- High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs: https://arxiv.org/pdf/2207.00257.pdf
- Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding: https://arxiv.org/pdf/2207.02971v1.pdf
- More ConvNets in the 2020s: Scaling up Kernels Beyond 51 × 51 using Sparsity: https://arxiv.org/pdf/2207.03620v1.pdf
- Softmax-free Linear Transformers: https://arxiv.org/pdf/2207.03341v1.pdf
- Learning Quality-aware Dynamic Memory for Video Object Segmentation: https://arxiv.org/pdf/2207.07922v1.pdf
- 3D Instances as 1D Kernels: https://arxiv.org/pdf/2207.07372v2.pdf
- XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model: https://arxiv.org/pdf/2207.07115v2.pdf
- Bootstrapped Masked Autoencoders for Vision BERT Pretraining: https://arxiv.org/pdf/2207.07116v1.pdf
- Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis: https://arxiv.org/pdf/2207.05049v1.pdf
- Clover: Towards A Unified Video-Language Alignment and Fusion Model: https://arxiv.org/pdf/2207.07885v2.pdf
- KL-UCB-Switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints: https://arxiv.org/pdf/1805.05071.pdf
- HyperTensioN and Total-order Forward Decomposition optimizations: https://arxiv.org/pdf/2207.00345.pdf
- Near-Optimal High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise: https://arxiv.org/pdf/2106.05958.pdf
- The “AI+R”-tree: An Instance-optimized R-tree: https://arxiv.org/pdf/2207.00550.pdf
- HELIX-MO: Sample-Efficient Molecular Optimization On Scene-Sensitive Latent Space: https://arxiv.org/pdf/2112.00905.pdf
- Audio −Visual Segmentation: https://arxiv.org/pdf/2207.05042v1.pdf
- YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors: https://arxiv.org/pdf/2207.02696v1.pdf
- In Defense of Online Models for Video Instance Segmentation: https://arxiv.org/pdf/2207.10661v1.pdf
- Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models: https://arxiv.org/pdf/2207.13038v1.pdf
- FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning: https://arxiv.org/pdf/2207.09653v1.pdf
- Improving Diffusion Model Efficiency Through Patching: https://arxiv.org/pdf/2207.04316v1.pdf
- Patchwork++: Fast and Robust Ground Segmentation Solving Partial Under-Segmentation Using 3D Point Cloud: https://arxiv.org/pdf/2207.11919v1.pdf
- Relighting4D: Neural Relightable Human from Videos: https://arxiv.org/pdf/2207.07104v1.pdf
- Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation: https://arxiv.org/pdf/2208.00219v1.pdf
- Self-Supervised Hypergraph Transformer for Recommender Systems: https://arxiv.org/pdf/2207.14338v1.pdf
- Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration: https://arxiv.org/pdf/2207.10447v1.pdf
- AiATrack: Attention in Attention for Transformer Visual Tracking: https://arxiv.org/pdf/2207.09603v2.pdf
- A Systematic Review and Replicability Study of BERT4Rec for Sequential Recommendation: https://arxiv.org/pdf/2207.07483v1.pdf
- HiFormer: Hierarchical Multi-scale Representations Using Transformers for Medical Image Segmentation: https://arxiv.org/pdf/2207.08518v1.pdf
- FashionViL: Fashion-Focused Vision-and-Language Representation Learning: https://arxiv.org/pdf/2207.08150v1.pdf
- JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes: https://arxiv.org/pdf/2207.07895v1.pdf
- Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective: https://arxiv.org/pdf/2207.09339v1.pdf
- SeedFormer: Patch Seeds based Point Cloud Completion with Upsample Transformer: https://arxiv.org/pdf/2207.10315v1.pdf
- Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation: https://arxiv.org/pdf/2207.08549v1.pdf
- Panoptic Scene Graph Generation: https://arxiv.org/pdf/2207.11247v1.pdf
- N-Grammer: Augmenting Transformers with latent n-grams: https://arxiv.org/pdf/2207.06366v1.pdf
- OSLAT: Open Set Label Attention Transformer for Medical Entity Span Extraction: https://arxiv.org/pdf/2207.05817v1.pdf
- Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection: https://arxiv.org/pdf/2207.05293v1.pdf
- Masked Autoencoders that Listen: https://arxiv.org/pdf/2207.06405v2.pdf
- Dual Vision Transformer: https://arxiv.org/pdf/2207.04976v2.pdf
- DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer: https://arxiv.org/pdf/2207.04491v1.pdf
- Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling: https://arxiv.org/pdf/2207.04179v1.pdf
- Pure Transformers are Powerful Graph Learners: https://arxiv.org/pdf/2207.02505v1.pdf
- Divert More Attention to Vision-Language Tracking: https://arxiv.org/pdf/2207.01076v1.pdf
- PolarFormer: Multi-camera 3D Object Detection with Polar Transformers: https://arxiv.org/pdf/2206.15398v4.pdf
- Zero-Shot Video Captioning with Evolving Pseudo-Tokens: https://arxiv.org/pdf/2207.11100v2.pdf
- Language Model Cascades: https://arxiv.org/pdf/2207.10342v2.pdf
- Label2Label: A Language Modeling Framework for Multi-Attribute Learning: https://arxiv.org/pdf/2207.08677v1.pdf
- Language Modelling With Pixels: https://arxiv.org/pdf/2207.06991v1.pdf
- The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications: https://arxiv.org/pdf/2207.04043v1.pdf
- Aspect-specific Context Modeling for Aspect-based Sentiment Analysis: https://arxiv.org/pdf/2207.08099v1.pdf
- An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics: https://arxiv.org/pdf/2207.00939v1.pdf
- Semi-supervised 3D Object Detection with Proficient Teachers: https://arxiv.org/pdf/2207.12655v1.pdf
- Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition: https://arxiv.org/pdf/2207.11518v1.pdf
- Adaptive Soft Contrastive Learning: https://arxiv.org/pdf/2207.11163v1.pdf
- Decoupled Adversarial Contrastive Learning for Self-supervised Adversarial Robustness: https://arxiv.org/pdf/2207.10899v1.pdf
- In Defense of Online Models for Video Instance Segmentation: https://arxiv.org/pdf/2207.10661v1.pdf
- HSE-NN Team at the 4th ABAW Competition: Multi-task Emotion Recognition and Learning from Synthetic Images: https://arxiv.org/pdf/2207.09508v2.pdf
- Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss: https://arxiv.org/pdf/2207.11482v1.pdf
- Contextual Information and Commonsense Based Prompt for Emotion Recognition in Conversation: https://arxiv.org/pdf/2207.13254v1.pdf
- AADG: Automatic Augmentation for Domain Generalization on Retinal Image Segmentation: https://arxiv.org/pdf/2207.13249v1.pdf
- TransNorm: Transformer Provides a Strong Spatial Normalization Mechanism for a Deep Segmentation Model: https://arxiv.org/pdf/2207.13415v1.pdf
- DETRs with Hybrid Matching: https://arxiv.org/pdf/2207.13080v1.pdf
- Compositional Human-Scene Interaction Synthesis with Semantic Control: https://arxiv.org/pdf/2207.12824v1.pdf
- Behind Every Domain There is a Shift: Adapting Distortion-aware Vision Transformers for Panoramic Semantic Segmentation: https://arxiv.org/pdf/2207.11860v2.pdf
- What is Healthy? Generative Counterfactual Diffusion for Lesion Localization: https://arxiv.org/pdf/2207.12268v1.pdf
- PCA: Semi-supervised Segmentation with Patch Confidence Adversarial Training: https://arxiv.org/pdf/2207.11683v1.pdf
- Self-Support Few-Shot Semantic Segmentation: https://arxiv.org/pdf/2207.11549v1.pdf
- DeVIS: Making Deformable Transformers Work for Video Instance Segmentation: https://arxiv.org/pdf/2207.11103v1.pdf
- In Defense of Online Models for Video Instance Segmentation: https://arxiv.org/pdf/2207.10661v1.pdf
- Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation: https://arxiv.org/pdf/2207.10436v1.pdf
- Region Aware Video Object Segmentation with Deep Motion Modeling: https://arxiv.org/pdf/2207.10258v1.pdf
- CoSMix: Compositional Semantic Mix for Domain Adaptation in 3D LiDAR Segmentation: https://arxiv.org/pdf/2207.09778v1.pdf
- Latent Discriminant Deterministic Uncertainty: https://arxiv.org/pdf/2207.10130v1.pdf
- GIPSO: Geometrically Informed Propagation for Online Adaptation in 3D LiDAR Segmentation: https://arxiv.org/pdf/2207.09763v1.pdf
- DecoupleNet: Decoupled Network for Domain Adaptive Semantic Segmentation: https://arxiv.org/pdf/2207.09988v1.pdf
- Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective: https://arxiv.org/pdf/2207.09339v1.pdf
- HiFormer: Hierarchical Multi-scale Representations Using Transformers for Medical Image Segmentation: https://arxiv.org/pdf/2207.08518v1.pdf
- Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation: https://arxiv.org/pdf/2207.08549v1.pdf
- Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning: https://arxiv.org/pdf/2207.04978v1.pdf
- Refign: Align and Refine for Adaptation of Semantic Segmentation to Adverse Conditions: https://arxiv.org/pdf/2207.06825v1.pdf
- Tackling Background Distraction in Video Object Segmentation: https://arxiv.org/pdf/2207.06953v3.pdf
- Prototypical Contrast Adaptation for Domain Adaptive Semantic Segmentation: https://arxiv.org/pdf/2207.06654v1.pdf
- LightViT: Towards Light-Weight Convolution-Free Vision Transformers: https://arxiv.org/pdf/2207.05557v1.pdf
- SFNet: Faster, Accurate, and Domain Agnostic Semantic Segmentation via Semantic Flow: https://arxiv.org/pdf/2207.04415v1.pdf
- 2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds: https://arxiv.org/pdf/2207.04397v1.pdf
- Domain Adaptive Video Segmentation via Temporal Pseudo Supervision: https://arxiv.org/pdf/2207.02372v1.pdf
- GFNet: Geometric Flow Network for 3D Point Cloud Semantic Segmentation: https://arxiv.org/pdf/2207.02605v1.pdf
- OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers: https://arxiv.org/pdf/2207.02255v3.pdf
- Distilling Ensemble of Explanations for Weakly-Supervised Pre-Training of Image Segmentation Models: https://arxiv.org/pdf/2207.03335v1.pdf
- Improving Nighttime Driving-Scene Segmentation via Dual Image-adaptive Learnable Filters: https://arxiv.org/pdf/2207.01331v1.pdf
- Towards Robust Video Object Segmentation with Adaptive Object Calibration: https://arxiv.org/pdf/2207.00887v1.pdf
- Domain-invariant Feature Exploration for Domain Generalization: https://arxiv.org/pdf/2207.12020v1.pdf
- Collaborating Domain-shared and Target-specific Feature Clustering for Cross-domain 3D Action Recognition: Collaborating Domain-shared and Target-specific Feature Clustering for Cross-domain 3D Action Recognition
- Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition: https://arxiv.org/pdf/2207.11518v1.pdf
- TinyViT: Fast Pretraining Distillation for Small Vision Transformers: https://arxiv.org/pdf/2207.10666v1.pdf
- FedX: Unsupervised Federated Learning with Cross Knowledge Distillation: https://arxiv.org/pdf/2207.09158v1.pdf
- Class-incremental Novel Class Discovery: https://arxiv.org/pdf/2207.08605v1.pdf
- ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech: https://arxiv.org/pdf/2207.06389v1.pdf
- Knowledge Condensation Distillation: https://arxiv.org/pdf/2207.05409v1.pdf
- CENET: Toward Concise And Efficient Lidar Semantic Segmentation For Autonomous Driving: https://arxiv.org/pdf/2207.12691v1.pdf
- AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection: https://arxiv.org/pdf/2207.10316v1.pdf
- Fully Sparse 3D Object Detection: https://arxiv.org/pdf/2207.10035v1.pdf
- CoSMix: Compositional Semantic Mix for Domain Adaptation in 3D LiDAR Segmentation: https://arxiv.org/pdf/2207.09778v1.pdf
- Restoring Vision in Adverse Weather Conditions with Patch-Based Denoising Diffusion Models: https://arxiv.org/pdf/2207.14626v1.pdf
- Privacy-Preserving Face Recognition with Learnable Privacy Budgets in Frequency Domain: https://arxiv.org/pdf/2207.07316v3.pdf
- DuetFace: Collaborative Privacy-Preserving Face Recognition via Channel Splitting in the Frequency Domain: https://arxiv.org/pdf/2207.07340v1.pdf
- Time Is MattEr: Temporal Self-supervision for Video Transformers: https://arxiv.org/pdf/2207.09067v1.pdf
- AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing: https://arxiv.org/pdf/2207.13784v1.pdf
- ShAPO: Implicit Representations for Multi-Object Shape, Appearance, and Pose Optimization: https://arxiv.org/pdf/2207.13691v1.pdf