- Multiview Transformers for Video Recognition: https://arxiv.org/pdf/2201.04288.pdf
- ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer: https://arxiv.org/pdf/2202.07305.pdf
- Privacy-preserving Anomaly Detection in Cloud Manufacturing via Federated Transformer: https://arxiv.org/pdf/2204.00843.pdf
- A Tour of Visualization Techniques for Computer Vision Datasets: https://arxiv.org/pdf/2204.08601.pdf
- Transfer Attacks Revisited: A Large-Scale Empirical Study in Real Computer Vision Settings: https://arxiv.org/pdf/2204.04063.pdf
- Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers: https://arxiv.org/pdf/2205.05055.pdf
- Depth Estimation with Simplified Transformer: https://arxiv.org/pdf/2204.13791.pdf
- DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers: https://arxiv.org/pdf/2204.12997.pdf
- Learning to Parallelize in a Shared-Memory Environment with Transformers: https://arxiv.org/pdf/2204.12835.pdf
- CATrans: Context and Affinity Transformer for Few-Shot Segmentation: https://arxiv.org/pdf/2204.12817.pdf
- Where in the World is this Image? Transformer-based Geo-localization in the Wild: https://arxiv.org/pdf/2204.13861.pdf
- Memformer: A Memory-Augmented Transformer for Sequence Modeling: https://arxiv.org/pdf/2010.06891.pdf
- TrackFormer: Multi-Object Tracking with Transformers: https://arxiv.org/pdf/2101.02702.pdf
- M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection: https://arxiv.org/pdf/2104.09770.pdf
- Patch Slimming for Efficient Vision Transformers: https://arxiv.org/pdf/2106.02852.pdf
- Zero-Shot Controlled Generation with Encoder-Decoder Transformers: https://arxiv.org/pdf/2106.06411.pdf
- Predicting Attention Sparsity in Transformers: https://arxiv.org/pdf/2109.12188.pdf
- MoEfication: Transformer Feed-forward Layers are Mixtures of Experts: https://arxiv.org/pdf/2110.01786.pdf
- RELVIT: Concept-Guided Vision Transformer For Visual Relational Reasoning: https://arxiv.org/pdf/2204.11167.pdf
- Crystal Transformer: Self-Learning Neural Language Model For Generative And Tinkering Design Of Materials: https://arxiv.org/pdf/2204.11953.pdf
- LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models: https://arxiv.org/pdf/2204.12130.pdf
- Hierarchical Transformers Are More Efficient Language Models: https://arxiv.org/pdf/2110.13711.pdf
- Swin Transformer V2: Scaling Up Capacity and Resolution: https://arxiv.org/pdf/2111.09883.pdf
- Transformer-S2A: Robust And Efficient Speech-To-Animation: https://arxiv.org/pdf/2111.09771.pdf
- SP-SEDT: Self-supervised Pre-training for Sound Event Detection Transformer: https://arxiv.org/pdf/2111.15222.pdf
- Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space: https://arxiv.org/pdf/2201.00814.pdf
- TransVPR: Transformer-Based Place Recognition with Multi-Level Attention Aggregation: https://arxiv.org/pdf/2201.02001.pdf
- TranAD: Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data: https://arxiv.org/pdf/2201.07284.pdf
- Text Spotting Transformers: https://arxiv.org/pdf/2204.01918.pdf
- MaxViT: Multi-Axis Vision Transformer: https://arxiv.org/pdf/2204.01697.pdf
- HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE: https://arxiv.org/pdf/2204.01565.pdf
- TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting: https://arxiv.org/pdf/2204.01018.pdf
- A-ACT: Action Anticipation through Cycle Transformations: https://arxiv.org/pdf/2204.00942.pdf
- Transformer-Empowered Content-Aware Collaborative Filtering: https://arxiv.org/pdf/2204.00849.pdf
- What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions: https://arxiv.org/pdf/2204.00746.pdf
- Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention: https://arxiv.org/pdf/2203.03937.pdf
- DIT: Self-Supervised Pre-Training For Document Image Transformer: https://arxiv.org/pdf/2203.02378.pdf
- Protecting Celebrities from DeepFake with Identity Consistency Transformer: https://arxiv.org/pdf/2203.01318.pdf
- AutoFi: Towards Automatic WiFi Human Sensing via Geometric Self-Supervised Learning: https://arxiv.org/pdf/2205.01629.pdf
- GMSS: Graph-Based Multi-Task Self-Supervised Learning for EEG Emotion Recognition: https://arxiv.org/pdf/2205.01030.pdf
- DILEMMA: Self-Supervised Shape and Texture Learning with Transformers: https://arxiv.org/pdf/2204.04788.pdf
- Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data: https://arxiv.org/pdf/2204.04645.pdf
- Self-Supervised Video Representation Learning With Motion-Contrastive Perception: https://arxiv.org/pdf/2204.04607.pdf
- SCGC : Self-Supervised Contrastive Graph Clustering: https://arxiv.org/pdf/2204.12656.pdf
- Audio-Visual Contrastive Learning for Self-supervised Action Recognition: https://arxiv.org/pdf/2204.13386.pdf
- Divergence-Aware Federated Self-Supervised Learning: https://arxiv.org/pdf/2204.04385.pdf
- Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection: https://arxiv.org/pdf/2201.07131.pdf
- Self-supervised Learning of Adversarial Example: Towards Good Generalizations for Deepfake Detection: https://arxiv.org/pdf/2203.12208.pdf
- Selecting task with optimal transport self-supervised learning for few-shot classification: https://arxiv.org/pdf/2204.00289.pdf
- End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation: https://arxiv.org/pdf/2204.00540.pdf
- Federated Self-supervised Speech Representations: Are We There Yet?: https://arxiv.org/pdf/2204.02804.pdf
- Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency: https://arxiv.org/pdf/2204.03017.pdf
- Hierarchical Self-supervised Representation Learning for Movie Understanding: https://arxiv.org/pdf/2204.03101.pdf
- Self-Supervised Keypoint Discovery in Behavioral Videos: https://arxiv.org/pdf/2112.05121.pdf