Recommended AI Papers: July 2022 – Tamal Dutta Chowdhury

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs: https://arxiv.org/pdf/2207.00257.pdf
Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding: https://arxiv.org/pdf/2207.02971v1.pdf
More ConvNets in the 2020s: Scaling up Kernels Beyond 51 × 51 using Sparsity: https://arxiv.org/pdf/2207.03620v1.pdf
Softmax-free Linear Transformers: https://arxiv.org/pdf/2207.03341v1.pdf
Learning Quality-aware Dynamic Memory for Video Object Segmentation: https://arxiv.org/pdf/2207.07922v1.pdf
3D Instances as 1D Kernels: https://arxiv.org/pdf/2207.07372v2.pdf
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model: https://arxiv.org/pdf/2207.07115v2.pdf
Bootstrapped Masked Autoencoders for Vision BERT Pretraining: https://arxiv.org/pdf/2207.07116v1.pdf
Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis: https://arxiv.org/pdf/2207.05049v1.pdf
Clover: Towards A Unified Video-Language Alignment and Fusion Model: https://arxiv.org/pdf/2207.07885v2.pdf
KL-UCB-Switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints: https://arxiv.org/pdf/1805.05071.pdf
HyperTensioN and Total-order Forward Decomposition optimizations: https://arxiv.org/pdf/2207.00345.pdf
Near-Optimal High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise: https://arxiv.org/pdf/2106.05958.pdf
The “AI+R”-tree: An Instance-optimized R-tree: https://arxiv.org/pdf/2207.00550.pdf
HELIX-MO: Sample-Efficient Molecular Optimization On Scene-Sensitive Latent Space: https://arxiv.org/pdf/2112.00905.pdf
Audio −Visual Segmentation: https://arxiv.org/pdf/2207.05042v1.pdf
YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors: https://arxiv.org/pdf/2207.02696v1.pdf
In Defense of Online Models for Video Instance Segmentation: https://arxiv.org/pdf/2207.10661v1.pdf
Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models: https://arxiv.org/pdf/2207.13038v1.pdf
FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning: https://arxiv.org/pdf/2207.09653v1.pdf
Improving Diffusion Model Efficiency Through Patching: https://arxiv.org/pdf/2207.04316v1.pdf
Patchwork++: Fast and Robust Ground Segmentation Solving Partial Under-Segmentation Using 3D Point Cloud: https://arxiv.org/pdf/2207.11919v1.pdf
Relighting4D: Neural Relightable Human from Videos: https://arxiv.org/pdf/2207.07104v1.pdf
Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation: https://arxiv.org/pdf/2208.00219v1.pdf
Self-Supervised Hypergraph Transformer for Recommender Systems: https://arxiv.org/pdf/2207.14338v1.pdf
Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration: https://arxiv.org/pdf/2207.10447v1.pdf
AiATrack: Attention in Attention for Transformer Visual Tracking: https://arxiv.org/pdf/2207.09603v2.pdf
A Systematic Review and Replicability Study of BERT4Rec for Sequential Recommendation: https://arxiv.org/pdf/2207.07483v1.pdf
HiFormer: Hierarchical Multi-scale Representations Using Transformers for Medical Image Segmentation: https://arxiv.org/pdf/2207.08518v1.pdf
FashionViL: Fashion-Focused Vision-and-Language Representation Learning: https://arxiv.org/pdf/2207.08150v1.pdf
JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes: https://arxiv.org/pdf/2207.07895v1.pdf
Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective: https://arxiv.org/pdf/2207.09339v1.pdf
SeedFormer: Patch Seeds based Point Cloud Completion with Upsample Transformer: https://arxiv.org/pdf/2207.10315v1.pdf
Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation: https://arxiv.org/pdf/2207.08549v1.pdf
Panoptic Scene Graph Generation: https://arxiv.org/pdf/2207.11247v1.pdf
N-Grammer: Augmenting Transformers with latent n-grams: https://arxiv.org/pdf/2207.06366v1.pdf
OSLAT: Open Set Label Attention Transformer for Medical Entity Span Extraction: https://arxiv.org/pdf/2207.05817v1.pdf
Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection: https://arxiv.org/pdf/2207.05293v1.pdf
Masked Autoencoders that Listen: https://arxiv.org/pdf/2207.06405v2.pdf
Dual Vision Transformer: https://arxiv.org/pdf/2207.04976v2.pdf
DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer: https://arxiv.org/pdf/2207.04491v1.pdf
Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling: https://arxiv.org/pdf/2207.04179v1.pdf
Pure Transformers are Powerful Graph Learners: https://arxiv.org/pdf/2207.02505v1.pdf
Divert More Attention to Vision-Language Tracking: https://arxiv.org/pdf/2207.01076v1.pdf
PolarFormer: Multi-camera 3D Object Detection with Polar Transformers: https://arxiv.org/pdf/2206.15398v4.pdf
Zero-Shot Video Captioning with Evolving Pseudo-Tokens: https://arxiv.org/pdf/2207.11100v2.pdf
Language Model Cascades: https://arxiv.org/pdf/2207.10342v2.pdf
Label2Label: A Language Modeling Framework for Multi-Attribute Learning: https://arxiv.org/pdf/2207.08677v1.pdf
Language Modelling With Pixels: https://arxiv.org/pdf/2207.06991v1.pdf
The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications: https://arxiv.org/pdf/2207.04043v1.pdf
Aspect-specific Context Modeling for Aspect-based Sentiment Analysis: https://arxiv.org/pdf/2207.08099v1.pdf
An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics: https://arxiv.org/pdf/2207.00939v1.pdf
Semi-supervised 3D Object Detection with Proficient Teachers: https://arxiv.org/pdf/2207.12655v1.pdf
Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition: https://arxiv.org/pdf/2207.11518v1.pdf
Adaptive Soft Contrastive Learning: https://arxiv.org/pdf/2207.11163v1.pdf
Decoupled Adversarial Contrastive Learning for Self-supervised Adversarial Robustness: https://arxiv.org/pdf/2207.10899v1.pdf
In Defense of Online Models for Video Instance Segmentation: https://arxiv.org/pdf/2207.10661v1.pdf
HSE-NN Team at the 4th ABAW Competition: Multi-task Emotion Recognition and Learning from Synthetic Images: https://arxiv.org/pdf/2207.09508v2.pdf
Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss: https://arxiv.org/pdf/2207.11482v1.pdf
Contextual Information and Commonsense Based Prompt for Emotion Recognition in Conversation: https://arxiv.org/pdf/2207.13254v1.pdf
AADG: Automatic Augmentation for Domain Generalization on Retinal Image Segmentation: https://arxiv.org/pdf/2207.13249v1.pdf
TransNorm: Transformer Provides a Strong Spatial Normalization Mechanism for a Deep Segmentation Model: https://arxiv.org/pdf/2207.13415v1.pdf
DETRs with Hybrid Matching: https://arxiv.org/pdf/2207.13080v1.pdf
Compositional Human-Scene Interaction Synthesis with Semantic Control: https://arxiv.org/pdf/2207.12824v1.pdf
Behind Every Domain There is a Shift: Adapting Distortion-aware Vision Transformers for Panoramic Semantic Segmentation: https://arxiv.org/pdf/2207.11860v2.pdf
What is Healthy? Generative Counterfactual Diffusion for Lesion Localization: https://arxiv.org/pdf/2207.12268v1.pdf
PCA: Semi-supervised Segmentation with Patch Confidence Adversarial Training: https://arxiv.org/pdf/2207.11683v1.pdf
Self-Support Few-Shot Semantic Segmentation: https://arxiv.org/pdf/2207.11549v1.pdf
DeVIS: Making Deformable Transformers Work for Video Instance Segmentation: https://arxiv.org/pdf/2207.11103v1.pdf
In Defense of Online Models for Video Instance Segmentation: https://arxiv.org/pdf/2207.10661v1.pdf
Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation: https://arxiv.org/pdf/2207.10436v1.pdf
Region Aware Video Object Segmentation with Deep Motion Modeling: https://arxiv.org/pdf/2207.10258v1.pdf
CoSMix: Compositional Semantic Mix for Domain Adaptation in 3D LiDAR Segmentation: https://arxiv.org/pdf/2207.09778v1.pdf
Latent Discriminant Deterministic Uncertainty: https://arxiv.org/pdf/2207.10130v1.pdf
GIPSO: Geometrically Informed Propagation for Online Adaptation in 3D LiDAR Segmentation: https://arxiv.org/pdf/2207.09763v1.pdf
DecoupleNet: Decoupled Network for Domain Adaptive Semantic Segmentation: https://arxiv.org/pdf/2207.09988v1.pdf
Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective: https://arxiv.org/pdf/2207.09339v1.pdf
HiFormer: Hierarchical Multi-scale Representations Using Transformers for Medical Image Segmentation: https://arxiv.org/pdf/2207.08518v1.pdf
Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation: https://arxiv.org/pdf/2207.08549v1.pdf
Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning: https://arxiv.org/pdf/2207.04978v1.pdf
Refign: Align and Refine for Adaptation of Semantic Segmentation to Adverse Conditions: https://arxiv.org/pdf/2207.06825v1.pdf
Tackling Background Distraction in Video Object Segmentation: https://arxiv.org/pdf/2207.06953v3.pdf
Prototypical Contrast Adaptation for Domain Adaptive Semantic Segmentation: https://arxiv.org/pdf/2207.06654v1.pdf
LightViT: Towards Light-Weight Convolution-Free Vision Transformers: https://arxiv.org/pdf/2207.05557v1.pdf
SFNet: Faster, Accurate, and Domain Agnostic Semantic Segmentation via Semantic Flow: https://arxiv.org/pdf/2207.04415v1.pdf
2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds: https://arxiv.org/pdf/2207.04397v1.pdf
Domain Adaptive Video Segmentation via Temporal Pseudo Supervision: https://arxiv.org/pdf/2207.02372v1.pdf
GFNet: Geometric Flow Network for 3D Point Cloud Semantic Segmentation: https://arxiv.org/pdf/2207.02605v1.pdf
OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers: https://arxiv.org/pdf/2207.02255v3.pdf
Distilling Ensemble of Explanations for Weakly-Supervised Pre-Training of Image Segmentation Models: https://arxiv.org/pdf/2207.03335v1.pdf
Improving Nighttime Driving-Scene Segmentation via Dual Image-adaptive Learnable Filters: https://arxiv.org/pdf/2207.01331v1.pdf
Towards Robust Video Object Segmentation with Adaptive Object Calibration: https://arxiv.org/pdf/2207.00887v1.pdf
Domain-invariant Feature Exploration for Domain Generalization: https://arxiv.org/pdf/2207.12020v1.pdf
Collaborating Domain-shared and Target-specific Feature Clustering for Cross-domain 3D Action Recognition: Collaborating Domain-shared and Target-specific Feature Clustering for Cross-domain 3D Action Recognition
Online Knowledge Distillation via Mutual Contrastive Learning for Visual Recognition: https://arxiv.org/pdf/2207.11518v1.pdf
TinyViT: Fast Pretraining Distillation for Small Vision Transformers: https://arxiv.org/pdf/2207.10666v1.pdf
FedX: Unsupervised Federated Learning with Cross Knowledge Distillation: https://arxiv.org/pdf/2207.09158v1.pdf
Class-incremental Novel Class Discovery: https://arxiv.org/pdf/2207.08605v1.pdf
ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech: https://arxiv.org/pdf/2207.06389v1.pdf
Knowledge Condensation Distillation: https://arxiv.org/pdf/2207.05409v1.pdf
CENET: Toward Concise And Efficient Lidar Semantic Segmentation For Autonomous Driving: https://arxiv.org/pdf/2207.12691v1.pdf
AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection: https://arxiv.org/pdf/2207.10316v1.pdf
Fully Sparse 3D Object Detection: https://arxiv.org/pdf/2207.10035v1.pdf
CoSMix: Compositional Semantic Mix for Domain Adaptation in 3D LiDAR Segmentation: https://arxiv.org/pdf/2207.09778v1.pdf
Restoring Vision in Adverse Weather Conditions with Patch-Based Denoising Diffusion Models: https://arxiv.org/pdf/2207.14626v1.pdf
Privacy-Preserving Face Recognition with Learnable Privacy Budgets in Frequency Domain: https://arxiv.org/pdf/2207.07316v3.pdf
DuetFace: Collaborative Privacy-Preserving Face Recognition via Channel Splitting in the Frequency Domain: https://arxiv.org/pdf/2207.07340v1.pdf
Time Is MattEr: Temporal Self-supervision for Video Transformers: https://arxiv.org/pdf/2207.09067v1.pdf
AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing: https://arxiv.org/pdf/2207.13784v1.pdf
ShAPO: Implicit Representations for Multi-Object Shape, Appearance, and Pose Optimization: https://arxiv.org/pdf/2207.13691v1.pdf