Recommended AI Papers: April 2022 – Tamal Dutta Chowdhury

Multiview Transformers for Video Recognition: https://arxiv.org/pdf/2201.04288.pdf
ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer: https://arxiv.org/pdf/2202.07305.pdf
Privacy-preserving Anomaly Detection in Cloud Manufacturing via Federated Transformer: https://arxiv.org/pdf/2204.00843.pdf
A Tour of Visualization Techniques for Computer Vision Datasets: https://arxiv.org/pdf/2204.08601.pdf
Transfer Attacks Revisited: A Large-Scale Empirical Study in Real Computer Vision Settings: https://arxiv.org/pdf/2204.04063.pdf
Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers: https://arxiv.org/pdf/2205.05055.pdf
Depth Estimation with Simplified Transformer: https://arxiv.org/pdf/2204.13791.pdf
DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers: https://arxiv.org/pdf/2204.12997.pdf
Learning to Parallelize in a Shared-Memory Environment with Transformers: https://arxiv.org/pdf/2204.12835.pdf
CATrans: Context and Affinity Transformer for Few-Shot Segmentation: https://arxiv.org/pdf/2204.12817.pdf
Where in the World is this Image? Transformer-based Geo-localization in the Wild: https://arxiv.org/pdf/2204.13861.pdf
Memformer: A Memory-Augmented Transformer for Sequence Modeling: https://arxiv.org/pdf/2010.06891.pdf
TrackFormer: Multi-Object Tracking with Transformers: https://arxiv.org/pdf/2101.02702.pdf
M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection: https://arxiv.org/pdf/2104.09770.pdf
Patch Slimming for Efficient Vision Transformers: https://arxiv.org/pdf/2106.02852.pdf
Zero-Shot Controlled Generation with Encoder-Decoder Transformers: https://arxiv.org/pdf/2106.06411.pdf
Predicting Attention Sparsity in Transformers: https://arxiv.org/pdf/2109.12188.pdf
MoEfication: Transformer Feed-forward Layers are Mixtures of Experts: https://arxiv.org/pdf/2110.01786.pdf
RELVIT: Concept-Guided Vision Transformer For Visual Relational Reasoning: https://arxiv.org/pdf/2204.11167.pdf
Crystal Transformer: Self-Learning Neural Language Model For Generative And Tinkering Design Of Materials: https://arxiv.org/pdf/2204.11953.pdf
LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models: https://arxiv.org/pdf/2204.12130.pdf
Hierarchical Transformers Are More Efficient Language Models: https://arxiv.org/pdf/2110.13711.pdf
Swin Transformer V2: Scaling Up Capacity and Resolution: https://arxiv.org/pdf/2111.09883.pdf
Transformer-S2A: Robust And Efficient Speech-To-Animation: https://arxiv.org/pdf/2111.09771.pdf
SP-SEDT: Self-supervised Pre-training for Sound Event Detection Transformer: https://arxiv.org/pdf/2111.15222.pdf
Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space: https://arxiv.org/pdf/2201.00814.pdf
TransVPR: Transformer-Based Place Recognition with Multi-Level Attention Aggregation: https://arxiv.org/pdf/2201.02001.pdf
TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data: https://arxiv.org/pdf/2201.07284.pdf
Text Spotting Transformers: https://arxiv.org/pdf/2204.01918.pdf
MaxViT: Multi-Axis Vision Transformer: https://arxiv.org/pdf/2204.01697.pdf
HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE: https://arxiv.org/pdf/2204.01565.pdf
TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting: https://arxiv.org/pdf/2204.01018.pdf
A-ACT: Action Anticipation through Cycle Transformations: https://arxiv.org/pdf/2204.00942.pdf
Transformer-Empowered Content-Aware Collaborative Filtering: https://arxiv.org/pdf/2204.00849.pdf
What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions: https://arxiv.org/pdf/2204.00746.pdf
Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention: https://arxiv.org/pdf/2203.03937.pdf
DIT: Self-Supervised Pre-Training For Document Image Transformer: https://arxiv.org/pdf/2203.02378.pdf
Protecting Celebrities from DeepFake with Identity Consistency Transformer: https://arxiv.org/pdf/2203.01318.pdf
AutoFi: Towards Automatic WiFi Human Sensing via Geometric Self-Supervised Learning: https://arxiv.org/pdf/2205.01629.pdf
GMSS: Graph-Based Multi-Task Self-Supervised Learning for EEG Emotion Recognition: https://arxiv.org/pdf/2205.01030.pdf
DILEMMA: Self-Supervised Shape and Texture Learning with Transformers: https://arxiv.org/pdf/2204.04788.pdf
Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data: https://arxiv.org/pdf/2204.04645.pdf
Self-Supervised Video Representation Learning With Motion-Contrastive Perception: https://arxiv.org/pdf/2204.04607.pdf
SCGC : Self-Supervised Contrastive Graph Clustering: https://arxiv.org/pdf/2204.12656.pdf
Audio-Visual Contrastive Learning for Self-supervised Action Recognition: https://arxiv.org/pdf/2204.13386.pdf
Divergence-Aware Federated Self-Supervised Learning: https://arxiv.org/pdf/2204.04385.pdf
Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection: https://arxiv.org/pdf/2201.07131.pdf
Self-supervised Learning of Adversarial Example: Towards Good Generalizations for Deepfake Detection: https://arxiv.org/pdf/2203.12208.pdf
Selecting task with optimal transport self-supervised learning for few-shot classification: https://arxiv.org/pdf/2204.00289.pdf
End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation: https://arxiv.org/pdf/2204.00540.pdf
Federated Self-supervised Speech Representations: Are We There Yet?: https://arxiv.org/pdf/2204.02804.pdf
Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency: https://arxiv.org/pdf/2204.03017.pdf
Hierarchical Self-supervised Representation Learning for Movie Understanding: https://arxiv.org/pdf/2204.03101.pdf
Self-Supervised Keypoint Discovery in Behavioral Videos: https://arxiv.org/pdf/2112.05121.pdf