[DETR] Object Detection with Transformer: DETR
{, }
Figure. DERT pipeline.
Given a picture, perform feature extraction through CNN, then turn it into a feature sequence, input it into the transformer's encoder, and directly output an unordered set of specified length N, each containing the object category and coordinates.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encodes our prior knowledge about the task.
DEtection TRansformer or DETR.
Transform the target detection problem into an unordered set prediction problem.
A set-based global loss that forces unique predictions via bipartite matching,
A transformer encoder-decoder architecture.
Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel.
The basic set prediction task is the multilabel classification for which the baseline approach, one-vs-rest, does not apply to problems such as detection where there is an underlying structure between elements (i.e., near-identical boxes).
Difficulty:
Avoid near-duplicates. => Need global inference schemes that model interactions between all predicted elements to avoid redundancy.
For constant-size set prediction, dense fully connected networks [9] are sufficient but costly. => Use auto-regressive sequence models such as RNN.
The loss function should be invariant by a permutation of the predictions. => A loss based on the Hungarian algorithm [20], to find a bipartite matching between ground truth and prediction.
Use transformers with parallel decoding.
[9,25,35] used the bipartite matching loss. However, in these early deep learning models, the relation between different predictions was modeled with convolutional or fully-connected layers only and a hand-designed NMS post-processing can improve their performance.
Non-unique assignment rules [37,23,53], Learnable NMS methods [16,4], Relation networks [17]. These methods employ additional hand-crafted context features like proposal box coordinates to model relations between detections efficiently, while we look for solutions that reduce the prior knowledge encoded in the model.
Object detection [43] and instance segmentation [41,30,36,42] use bipartite-matching losses with encoder-decoder architectures based on CNN activations to directly produce a set of bounding boxes.
They do not leverage the recent transformers with parallel decoding.
Image features from the CNN backbone are passed through the transformer encoder, together with the spatial positional encoding that is added to queries and keys at every multi-head self-attention layer.
Then, the decoder receives queries (initially set to zero), outputs positional encoding (object queries), and encoder memory, and produces the final set of predicted class labels and bounding boxes through multiple multi-head self-attention and decoder-encoder attention.
The first self-attention layer in the first decoder layer can be skipped.
Input the (b,3,800,1200) image into resnet50 for feature extraction, and output shape = (b,1024,25,38)
Dimensionality reduction through 1x1 convolution becomes (b,256,25,38)
Use the sincos function to calculate the position code.
Add image features and position encoding vectors as input to the encoder, and output the encoded vector with the same shape.
Initialize the output embedding vector of all 0 (100,b,256), combine the position encoding vector and query_embed to decode the output, the decoder output shape is (6,b,100,256), the subsequent decoder accepts the output, and then again Combine the position encoding vector and query_embed for output, and continue to move forward.
Feed the last decoder output into the classification and regression heads, resulting in 100 unordered sets.
Post-processing of 100 unordered sets mainly involves extracting the foreground category and corresponding bbox coordinates, and multiplying by (800,1200) to obtain the final coordinates.
A CNN Backbone: extract a compact feature representation.
An encoder-decoder Transformer.
Feed-Forward Network (FFN) makes the final detection prediction.
# main.py
from models import build_model
model, criterion, postprocessors = build_model(args)
# models/detr.py
def build(args):
backbone = build_backbone(args)
transformer = build_transformer(args)
model = DETR(backbone, transformer, num_classes=num_classes, num_queries=args.num_queries, aux_loss=args.aux_loss)
matcher = build_matcher(args)
criterion = SetCriterion(num_classes, matcher=matcher, weight_dict=weight_dict, eos_coef=args.eos_coef, losses=losses)
postprocessors = {'bbox': PostProcess()}
return model, criterion, postprocessors
ResNet50, ViT...
Last stage, stride = 32.
All BNs in resnet are fixed, that is, the global mean and variance are used.
The stem and first stage of resnet do not update parameters, that is, parameter.requires_grad_(False)
The learning rate of backbone is smaller than transformer, lr_backbone=1e-05, and the rest is 0.0001
Spatial Positional Encoding:
DERTv2: Sinusoidal.
DERTv3: Learned.
QKV:
6 Encoders.
The Position Encoding Vector is only added to QK, and no position information is added to V.
Object Query:
Simply thought of as output position encoding.
Provide the relationship between the target object and the global image during the learning process, equivalent to global attention.
For example, next to the table in the room (category A) is usually a chair (category B), not a big head (Category C), so global attention can be used to better decode and predict output during reasoning.
Positional Encoding:
Sincos Position Encoding vector used as in the encoder.
QKV:
6 Decoders in total.
Position Encoding is not added to V.
Output:
N unordered boxes in parallel at one time.
Using the Feed-Forward Network.
The final prediction is computed by a 3-layer perceptron with ReLU activation function and hidden dimension d, and a linear projection layer.
The FFN predicts the normalized center coordinates, height and width of the box w.r.t. the input image, and the linear layer predicts the class label using a softmax function.
Since we predict a fixed-size set of N bounding boxes, where N is usually much larger than the actual number of objects of interest in an image, an additional special class label ∅ is used to represent that no object is detected within a slot. This class plays a similar role to the “background” class in the standard object detection approaches.
All losses are normalized by the number of objects inside the batch.
Use a soft version of Intersection over Union, where λIoU, λL1 ∈ R are hyperparameters and LIoU(·) is the generalized IoU.
The DICE coefficient is closely related to the Intersection over Union. Denote by mˆ the raw mask logits prediction of the model, and m the binary target mask.
where σ is the sigmoid function. This loss is normalized by the number of objects.
Using the Hungarian algorithm performs bipartite matching.
class HungarianMatcher(nn.Module):
def __init__(self, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1):
super().__init__()
self.cost_class = cost_class # 1
self.cost_bbox = cost_bbox # 5
self.cost_giou = cost_giou # 2
@torch.no_grad()
def forward(self, outputs, targets):
C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
C = C.view(bs, num_queries, -1).cpu() # torch.Size([2 batch, 100 예측 object, 21 GT object])
sizes = [len(v["boxes"]) for v in targets] # [img1 num_box, img2 num_box]
indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
Using AdamW [26] with improved weight decay handling, set to 10-4.
We also apply gradient clipping, with a maximal gradient norm of 0.1.
ImageNet pre-trained backbone ResNet-50 is imported from Torchvision, discarding the last classification layer.
Batch normalization weights and statistics are frozen during training.
Fine-tune using the learning rate of 10-5.
Train with a learning rate of 10−4 .
Additive dropout of 0.1 is applied after every multi-head attention and FFN before layer normalization.
The weights are randomly initialized with Xavier initialization.
Use linear combination of l1 and GIoU losses for bounding box regression with λL1 = 5 and λiou = 2 weights respectively.
All models were trained with N = 100 decoder query slots.
DETR can be naturally extended by adding a mask head on top of the decoder outputs.
Add a mask head that predicts a binary mask for each of the predicted boxes.
It takes as input the output of the transformer decoder for each object and computes multi-head (with M heads) attention scores of this embedding over the output of the encoder, generating M attention heatmaps per object in a small resolution.
To make the final prediction and increase the resolution, an FPN-like architecture is used.
The final resolution of the masks has stride 4 and each mask is supervised independently using the DICE/F-1 loss [28] and Focal loss.
To predict the final panoptic segmentation we simply use an argmax over the mask scores at each pixel, and assign the corresponding categories to the resulting masks.
COCO: against Faster R-CNN.
Without encoder layers, overall AP drops by 3.9 points, with a more significant drop of 6.0 AP on large objects.
Hypothesis: By using global scene reasoning, the encoder is important for disentangling objects.
From Figure 3: the encoder seems to separate instances already, which likely simplifies object extraction and localization for the decoder.
Auxiliary losses are applied after each decoding layer, hence, the prediction FFNs are trained by design to predict objects out of the outputs of every decoder layer. (Fig. 4).
With its set-based loss, DETR does not need NMS by design.
A single decoding layer of the transformer is not able to compute any cross-correlations between the output elements, and thus it is prone to making multiple predictions for the same object.
In the second and subsequent layers, the self-attention mechanism over the activations allows the model to inhibit duplicate predictions. We observe that the improvement brought by NMS diminishes as depth increases.
At the last layers, we observe a small loss in AP as NMS incorrectly removes true positive predictions.
Fig. 6 visualizes decoder attention, coloring attention maps for each predicted object in different colors.
We observe that decoder attention is fairly local, meaning that it mostly attends to object extremities such as heads or legs.
We hypothesize that after the encoder has separated instances via global attention, the decoder only needs to attend to the extremities to extract the class and object boundaries.
To evaluate the importance of different components of the matching cost and the loss, we train several models turning them on and off.
There are three components to the loss: classification loss, bounding box distance loss, and GIoU [38] loss.
The classification loss is essential for training and cannot be turned off, so we train a model without bounding box distance loss, and a model without the GIoU loss, and compare with baseline, trained with all three losses.
n2 n0
θ