[YOLOR] You Only Learn One Representation: Unified Network for Multiple Tasks
{Explicit, Implicit}
{Explicit, Implicit}
People “understand” the world via vision, hearing, tactile, and also the past experience.
Human experience can be learned through:
Normal learning (we call it explicit knowledge)
Subconsciously (we call it implicit knowledge).
These experiences will be encoded and stored in the brain.
Using these abundant experiences as a huge database, human beings can effectively process data, even if they were unseen beforehand.
Figure. Performance of YOLOR vs. YOLOv4 and others.
Propose a unified network that can accomplish various tasks:
Learns a general representation by integrating implicit knowledge and explicit knowledge.
Improves the performance of the model with a very small amount of additional cost.
Introduce kernel space alignment, prediction refinement, and multi-task learning into the implicit knowledge learning process.
Discuss the ways of using vector, neural network, or matrix factorization as a tool to model implicit knowledge.
Observation = See Drone => Drone
Cover some methods that can automatically adjust or select features based on input data.
Transformer [14, 5, 20] uses query, key, or value to obtain self-attention.
Non-local networks [21, 4, 24] mainly extract pair-wise attention in time and space.
Explicit deep learning-based methods [7, 25] automatically select the appropriate kernel by input data.
Knowledge = See Drone => fly/ stand
Cover the related literature of implicit deep knowledge learning and implicit differential derivative.
Implicit neural representations [11]: obtain the parameterized continuous mapping representation of discrete inputs to perform different tasks.
Deep equilibrium models [2, 3, 19]: transform implicit learning into a residual form neural networks, and perform the equilibrium point calculation on it.
Integrate implicit knowledge and explicit knowledge:
Sparse representation [1, 23] uses exemplar, predefined over complete, or learned dictionary to perform modeling
Memory networks [22, 12] relies on combining various forms of embedding to form memory, and enable memory to be dynamically added or changed.
Object Detection
Instance Segmentation
Panoptic Segmentation
Keypoint Detection
Image Captioning
There are just different architectures for object detection and classification.
DETr, Kernel Section (Non-local)
Scaled-YOLOv4
=> Explicti Knowledge = Observation
"A good, single representation should be able to locate a corresponding projection in the manifold space it belongs".
Manifold learning aims to reduce the dimensions of the manifold space to a flat, featureless space.
Learn representation so that the model can learn different tasks: pose estimation and classification.
=> PCA & SVM.
"It can be problematic to deal with kernel space misalignment with multi-task and multi-head neural networks".
Perform operations (addition and multiplication) of the output features at the explicit model.
Kernel Alignment:
Translated
Rotated
Scaled
Offset refinement.
The introduction of addition can make a neural network predict the offset of the center coordinate, which can then be used to inform the final predictions.
Anchor refinement.
We see introducing multiplication allows for anchor refinement.
Anchor boxes allow one grid cell to detect multiple objects, and prevent complications from overlapping details.
In practice, this makes a more robust model by effectively automatically scanning the hyper-parameter set of the anchor.
Feature selection.
The figure shows how dot multiplication and concatenation can be applied to achieve multi-task feature selection and to set pre-conditions for subsequent calculations.
Reduce the error term so that the operation is closed to the target as much as possible.
Relax the error term to find the solution for each task.
Random the model to find solutions for all tasks.
Combine with: g(explicit(x), implicit(z))
Using addition or multiplication:
f(x): Observation = Explicit
g(z): Latent Code = Representation of Compressed Data = Implicit Knowledge
Can be modeled as: a vector, a neural network, or a matrix factorization.
Normally using the back-propabagation.
Feature Alignment on FPN
Prediction Refinement
Multi-task learning in a single model
Object Detection
Multi-level image classification
Feature Embedding
Analyzer looks like the backbone.
Explicit Knowledge is learned from the shallow layers of Analyzer
Implicit Knowledge is learned from other auxiliary tasks (might be in deeper layers - not belong to features extracted from input).
Explicit Knowledge and Implicit Knowledge contribute as Query and Key to select Value in Transformer.
Selector is similar to Selective Kernel to selectively collect information.