- 3D Vision with Transformers: A Survey: https://arxiv.org/pdf/2208.04309v1.pdf
- Unifying Visual Perception by Dispersible Points Learning: https://arxiv.org/pdf/2208.08630v1.pdf
- ZoomNAS: Searching for Whole-body Human Pose Estimation in the Wild: https://arxiv.org/pdf/2208.11547v1.pdf
- ROLAND: Graph Learning Framework for Dynamic Graphs: https://arxiv.org/pdf/2208.07239v1.pdf
- Investigating Efficiently Extending Transformers for Long Input Summarization: https://arxiv.org/pdf/2208.04347v1.pdf
- Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion: https://arxiv.org/pdf/2207.14172v1.pdf
- TransNorm: Transformer Provides a Strong Spatial Normalization Mechanism for a Deep Segmentation Model: https://arxiv.org/pdf/2207.13415v1.pdf
- Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers: https://arxiv.org/pdf/2207.13820v1.pdf
- DETRs with Hybrid Matching: https://arxiv.org/pdf/2207.13080v1.pdf
- Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer: https://arxiv.org/pdf/2207.14024v3.pdf
- A Simple Baseline for Multi-Camera 3D Object Detection: https://arxiv.org/pdf/2208.10035v1.pdf
- Learning Visibility for Robust Dense Human Body Estimation: https://arxiv.org/pdf/2208.10652v1.pdf
- SwinIR: Image Restoration Using Swin Transformer: https://arxiv.org/pdf/2108.10257v1.pdf
- MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures: https://arxiv.org/pdf/2208.00277v2.pdf
- Z-BERT-A: a zero-shot pipeline for unknown intent detection: https://arxiv.org/pdf/2208.07084v2.pdf
- HighlightNet: Highlighting Low-Light Potential Features for Real-Time UAV Tracking: https://arxiv.org/pdf/2208.06818v1.pdf
- Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise: https://arxiv.org/pdf/2208.09392v1.pdf
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion: https://arxiv.org/pdf/2208.01618v1.pdf
- PeRFception: Perception using Radiance Fields: https://arxiv.org/pdf/2208.11537v1.pdf
- A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning: https://arxiv.org/pdf/2208.07860v1.pdf
- YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception: https://arxiv.org/pdf/2208.11434v1.pdf
- Fast Infinite Waveform Music Generation: https://arxiv.org/pdf/2208.08706v1.pdf
- YOLOV: Making Still Image Object Detectors Great at Video Object Detection: https://arxiv.org/pdf/2208.09686v1.pdf
- Pix4Point: Image Pretrained Transformers for 3D Point Cloud Understanding: https://arxiv.org/pdf/2208.12259v1.pdf
- Federated Learning via Decentralized Dataset Distillation in Resource-Constrained Edge Environments: https://arxiv.org/pdf/2208.11311v1.pdf
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale: https://arxiv.org/pdf/2208.07339v1.pdf
- TotalSegmentator: robust segmentation of 104 anatomical structures in CT images: https://arxiv.org/pdf/2208.05868v1.pdf
- A Library For Representing Python Programs As Graphs For Machine Learning: https://arxiv.org/pdf/2208.07461v1.pdf
- Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models: https://arxiv.org/pdf/2208.09399v1.pdf
- Refine and Represent: Region-to-Object Representation Learning: https://arxiv.org/pdf/2208.11821v1.pdf
- Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning: https://arxiv.org/pdf/2208.04202v1.pdf
- Representation Learning For The Automatic Indexing Of Sound Effects Libraries: https://arxiv.org/pdf/2208.09096v1.pdf
- Contrastive Audio-Language Learning For Music: https://arxiv.org/pdf/2208.12208v1.pdf
- Unbiased Multi-Modality Guidance for Image Inpainting: https://arxiv.org/pdf/2208.11844v1.pdf
- Paint2Pix: Interactive Painting based Progressive Image Synthesis and Editing: https://arxiv.org/pdf/2208.08092v1.pdf
- PyMIC: A deep learning toolkit for annotation-efficient medical image segmentation: https://arxiv.org/pdf/2208.09350v1.pdf
- Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data: https://arxiv.org/pdf/2107.10833v2.pdf
- DeepInteraction: 3D Object Detection via Modality Interaction: https://arxiv.org/pdf/2208.11112v2.pdf
- Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors: https://arxiv.org/pdf/2208.11356v1.pdf
- Federated Contrastive Learning and Masked Autoencoder for Dermatological Disease Diagnosis: https://arxiv.org/pdf/2208.11278v1.pdf
- Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling: https://arxiv.org/pdf/2208.12257v1.pdf
- Diverse Title Generation for Stack Overflow Posts with Multiple Sampling Enhanced Transformer: https://arxiv.org/pdf/2208.11523v1.pdf
- I Know What You Do Not Know: Knowledge Graph Embedding via Co-distillation Learning: https://arxiv.org/pdf/2208.09828v1.pdf
- Learning Spatial-Frequency Transformer for Visual Object Tracking: https://arxiv.org/pdf/2208.08829v1.pdf
- Expressing Multivariate Time Series as Graphs with Time Series Attention Transformer: https://arxiv.org/pdf/2208.09300v1.pdf
- Conviformers: Convolutionally Guided Vision Transformer: https://arxiv.org/pdf/2208.08900v1.pdf
- Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation: https://arxiv.org/pdf/2208.08315v3.pdf
- Mask and Reason: Pre-Training Knowledge Graph Transformers for Complex Logical Queries: https://arxiv.org/pdf/2208.07638v1.pdf
- Investigating Efficiently Extending Transformers for Long Input Summarization: https://arxiv.org/pdf/2208.04347v1.pdf
- Domain-Specific Text Generation for Machine Translation: https://arxiv.org/pdf/2208.05909v1.pdf
- SSformer: A Lightweight Transformer for Semantic Segmentation: https://arxiv.org/pdf/2208.02034v1.pdf
- Transformers as Meta-Learners for Implicit Neural Representations: https://arxiv.org/pdf/2208.02801v2.pdf
- Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models: https://arxiv.org/pdf/2208.03306v1.pdf
- Multi-Feature Vision Transformer via Self-Supervised Representation Learning for Improvement of COVID-19 Diagnosis: https://arxiv.org/pdf/2208.01843v1.pdf
- Reference-based Image Super-Resolution with Deformable Attention Transformer: https://arxiv.org/pdf/2207.11938v2.pdf
- Focused Decoding Enables 3D Anatomical Detection by Transformers: https://arxiv.org/pdf/2207.10774v2.pdf
- BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation: https://arxiv.org/pdf/2208.01159v4.pdf
- Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios: https://arxiv.org/pdf/2207.05501v4.pdf
- VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations: https://arxiv.org/pdf/2208.09021v1.pdf
- Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks: https://arxiv.org/pdf/2208.10442v1.pdf
- DPTDR: Deep Prompt Tuning for Dense Passage Retrieval: https://arxiv.org/pdf/2208.11503v1.pdf
- Language Supervised Training for Skeleton-based Action Recognition: https://arxiv.org/pdf/2208.05318v1.pdf
- Towards No.1 in CLUE Semantic Matching Challenge: Pre-trained Language Model Erlangshen with Propensity-Corrected Loss: https://arxiv.org/pdf/2208.02959v1.pdf
- Helixfold-Single: Msa-Free Protein Structure Prediction By Using Protein Language Model As An Alternative: https://arxiv.org/pdf/2207.13921v2.pdf
- PyABSA: Open Framework for Aspect-based Sentiment Analysis: https://arxiv.org/pdf/2208.01368v1.pdf
- Self-supervised Contrastive Representation Learning for Semi-supervised Time-Series Classification: https://arxiv.org/pdf/2208.06616v1.pdf
- COCOA: Cross Modality Contrastive Learning for Sensor Data: https://arxiv.org/pdf/2208.00467v2.pdf
- Non-Contrastive Self-Supervised Learning of Utterance-Level Speech Representations: https://arxiv.org/pdf/2208.05413v1.pdf
- Mind the Gap in Distilling StyleGANs: https://arxiv.org/pdf/2208.08840v1.pdf
- Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer: https://arxiv.org/pdf/2208.05216v1.pdf
- Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer: https://arxiv.org/pdf/2207.14024v3.pdf
- FRA-RIR: Fast Random Approximation of the Image-source Method: https://arxiv.org/pdf/2208.04101v1.pdf
- Learnability Enhancement for Low-light Raw Denoising: Where Paired Real Data Meets Noise Modeling: https://arxiv.org/pdf/2207.06103v2.pdf
- Language Supervised Training for Skeleton-based Action Recognition: https://arxiv.org/pdf/2208.05318v1.pdf
- Expanding Language-Image Pretrained Models for General Video Recognition: https://arxiv.org/pdf/2208.02816v1.pdf
- Domain Randomization-Enhanced Depth Simulation and Restoration for Perceiving and Grasping Specular and Transparent Objects: https://arxiv.org/pdf/2208.03792v1.pdf