1. Recognition, Detection, Segmentation and Pose Estimation:
End-to-End Object Detection with Transformers [Paper] [Code]
MutualNet: Adaptive ConvNet via Mutual Learning from Network Width and Resolution [Paper] [Code]
Gradient Centralization: A New Optimization Technique for Deep Neural Networks [Paper] [Code]
Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval [Paper]
Hybrid Models for Open Set Recognition [Paper]
Conditional Convolutions for Instance Segmentation [Paper]
Multitask Learning Strengthens Adversarial Robustness [Paper]
Dynamic Group Convolution for Accelerating Convolutional Neural Networks [Paper]
Disentangled Non-local Neural Networks [Paper]
Hard negative examples are hard, but useful [Paper]
Volumetric Transformer Networks [Paper]
Faster AutoAugment: Learning Augmentation Strategies Using Backpropagation [Paper]
A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses [Paper]
Semantic Flow for Fast and Accurate Scene Parsing [Paper]
Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation [Paper] [Code]
Learning From Multiple Experts: Self-paced Knowledge Distillation for Long-tailed Classification [Paper]
Feature Normalized Knowledge Distillation for Image Classification [Paper] [Code]
AutoMix: Mixup Networks for Sample Interpolation via Cooperative Barycenter Learning [Paper]
OnlineAugment: Online Data Augmentation with Less Domain Knowledge [Paper] [Code]
Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets [Paper] [Code]
DiVA: Diverse Visual Feature Aggregation for Deep Metric Learning [Paper]
Estimating People Flows to Better Count Them in Crowded Scenes [Paper]
SoundSpaces: Audio-Visual Navigation in 3D Environments [Paper]
Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation [Paper]
DADA: Differentiable Automatic Data Augmentation [Paper]
URIE: Universal Image Enhancement for Visual Recognition in the Wild [Paper]
BorderDet: Border Feature for Dense Object Detection [Paper] [Code]
TIDE: A General Toolbox for Understanding Errors in Object Detection [Paper] [Code]
AABO: Adaptive Anchor Box Optimization for Object Detection via Bayesian Sub-sampling [Paper]
PIoU Loss: Towards Accurate Oriented Object Detection in Complex Environments [Paper]
Learning Object Depth from Camera Motion and Video Object Segmentation [Paper]
Attentive Normalization [Paper]
Momentum Batch Normalization for Deep Learning with Small Batch Size [Paper]
A Simple Way to Make Neural Networks Robust Against Diverse Image Corruptions [Paper]
2. Semi-Supervised, Unsupervised, Transfer, Representation & Few-Shot Learning
Big Transfer (BiT): General Visual Representation Learning [Paper]
Learning Visual Representations with Caption Annotations [Paper]
Memory-augmented Dense Predictive Coding for Video Representation Learning [Paper]
SCAN: Learning to Classify Images without Labels [Paper]
GATCluster: Self-Supervised Gaussian-Attention Network for Image Clustering [Paper]
Associative Alignment for Few-shot Image Classification [Paper]
Domain Adaptation through Task Distillation [Paper]
Are Labels Necessary for Neural Architecture Search? [Paper]
The Hessian Penalty: A Weak Prior for Unsupervised Disentanglement [Paper]
Cross-Domain Cascaded Deep Translation [Paper]
Self-Challenging Improves Cross-Domain Generalization [Paper]
Label Propagation with Augmented Anchors: A Simple Semi-Supervised Learning baseline for Unsupervised Domain Adaptation [Paper]
Regularization with Latent Space Virtual Adversarial Training [Paper]
Transporting Labels via Hierarchical Optimal Transport for Semi-Supervised Learning [Paper]
Negative Margin Matters: Understanding Margin in Few-shot Classification [Paper]
Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need? [Paper] [Code]
Prototype Rectification for Few-Shot Learning [Paper]
3. 3D Computer Vision & Robotics
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis [Paper]
Towards Streaming Perception [Paper]
Teaching Cameras to Feel: Estimating Tactile Physical Properties of Surfaces From Images [Paper]
Convolutional Occupancy Networks [Paper]
Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping [Paper]
Privacy Preserving Structure-from-Motion [Paper]
Multiview Detection with Feature Perspective Transformation [Paper]
Motion Capture from Internet Videos [Paper]
Atlas: End-to-End 3D Scene Reconstruction from Posed Images [Paper]
Generative Sparse Detection Networks for 3D Single-shot Object Detection [Paper]
PointTriNet: Learned Triangulation of 3D Point Sets [Paper]
Points2Surf: Learning Implicit Surfaces from Point Cloud Patches [Paper]
Geometric Capsule Autoencoders for 3D Point Clouds [Paper]
Deep Feedback Inverse Problem Solver [Paper]
Single View Metrology in the Wild [Paper]
Shape and Viewpoint without Keypoints [Paper]
Hierarchical Kinematic Human Mesh Recovery [Paper]
3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning [Paper]
Few-Shot Single-View 3D Object Reconstruction with Compositional Priors [Paper]
NASA: Neural Articulated Shape Approximation [Paper]
Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation [Paper]
Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild [Paper]
4. Image and Video Synthesis
Transforming and Projecting Images into Class-conditional Generative Networks [Paper]
Contrastive Learning for Unpaired Image-to-Image Translation [Paper]
Rewriting a Deep Generative Model [Paper]
Learning Stereo from Single Images [Paper]
What makes fake images detectable? Understanding properties that generalize [Paper]
Free View Synthesis [Paper]
Unselfie: Translating Selfies to Neutral-pose Portraits in the Wild [Paper]
World-Consistent Video-to-Video Synthesis [Paper]
RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval [Paper]
Generating Videos of Zero-Shot Compositions of Actions and Objects [Paper]
Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild [Paper]
Across Scales & Across Dimensions: Temporal Super-Resolution using Deep Internal Learning [Paper]
Conditional Entropy Coding for Efficient Video Compression [Paper]
Semantic View Synthesis [Paper]
Learning Camera-Aware Noise Models [Paper]
In-Domain GAN Inversion for Real Image Editing [Paper]
5. Vision Languages
Connecting Vision and Language with Localized Narratives [Paper]
UNITER: UNiversal Image-TExt Representation Learning [Paper]
Learning to Learn Words from Visual Scenes [Paper]
Contrastive Learning for Weakly Supervised Phrase Grounding [Paper]
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments [Paper]
Adaptive Text Recognition through Visual Matching [Paper]
6. Others
Deep Learning: Applications, Methodology, and Theory
A Generic Visualization Approach for Convolutional Neural Networks [Paper]
Spike-FlowNet: Event-based Optical Flow Estimation [Paper]
A Metric Learning Reality Check [Paper]
Learning Predictive Models from Observation and Interaction [Paper]
Beyond Fixed Grid: Learning Geometric Image Representation with a Deformable Grid [Paper]
Stable Low-rank Tensor Decomposition for Compression of Convolutional Neural Network [Paper]
EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning [Paper]
Making Sense of CNNs: Interpreting Deep Representations & Their Invariances with INNs [Paper]
Event-based Asynchronous Sparse Convolutional Networks [Paper]
Low level vision, Motion and Tracking
RAFT: recurrent all pairs field transforms for optical flow [Paper]
VisualEchoes: Spatial Image Representation Learning through Echolocation [Paper]
Self-Supervised Learning of Audio-Visual Objects from Video [Paper]
Tracking Objects as Points
Face, Gesture, and Body Pose
Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach [Paper]
Thinking in Frequency: Face Forgery Detection by Mining Frequency-aware Clues [Paper]
Lifespan Age Transformation Synthesis [Paper]
Monocular Expressive Body Regression through Body-Driven Attention [Paper]
DLow: Diversifying Latent Flows for Diverse Human Motion Prediction [Paper]
Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars [Paper]
Blind Face Restoration via Deep Multi-scale Component Dictionaries [Paper]
Action Recognition, Understanding
RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition [Paper]
Self-supervised Video Representation Learning by Pace Prediction [Paper]
Aligning Videos in Space and Time [Paper]
Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video [Paper]
Foley Music: Learning to Generate Music from Videos [Paper]
Reference: