ClusDet [32] unifies object clustering and detection in an end-to-end framework by sequentially finding clustered areas and detecting objects in these areas.
Zhang et al. [33] proposed a difficult region estimation network to find a difficult high-density area for further detection. Aiming to address vehicle detection challenges caused by diversity in drone-captured images.
AdNet [34] seeks to align features between different viewpoints, illumination, weather, and background following the idea of domain adaptation.
Tiny-scale objects and unevenly distributed objects severely hinder the performance of detection models as discussed in ClusDet [32].
GLSAN [35] adds an efficient self-adaptive region, selecting an algorithm for the global–local detection network, finding high-density areas, and detecting objects with large size variation accurately.
DMNet [36] proposes a novel crop strategy guided by a density map, removing the area without objects and balancing the information of the foreground and background.
Yu et al. [37] analyzed the detection results of DMNet [36] and found that it has an explicit performance degradation on a long-tail scene. They designed a DSHNet [37] to handle head and tail classes separately by combining class-biased samplers and bilateral box heads.
MDCT [38] designs a multi-kernel dilated convolution (MDC) block and transformer block to identify small objects in dense scenes.
Gallo et al. [39] utilized the YOLOv7 model to solve the challenge caused by the existence of unstructured crop conditions and the high biological variation of weeds.
RAANet [40] constructs a new residual ASPP by embedding the attention module and residual structure into the ASPP, to deal with the variability and complex background problems of land use in high-resolution imagery.
HawkNet [41] proposes an up-scale feature aggregation framework to fully utilize multi-scale complementary information.
CDMNet [42] formats density maps into coarse-grained form and designs a lightweight dual-task density estimation network.
FiFoNet [43] effectively selects a combination of multi-scale features for an object and block background interference, which further revitalizes the differentiability of the multi-scale feature representation.
TPH-YOLOv5 [20] combines a transformer-based prediction head and the YOLOv5 detection model, realizing significant performance improvement in large size variation and high-density scenes.
UAV-Net [44] analyzes influences from different backbone architectures, prediction heads, and model pruning methods comprehensively and constructs a better combination to realize fast object detection.
GDFNet [45] uses a global density model to jointly extract density information from multiple-level pyramid features, which is faster than most models based on pyramid feature fusion architecture.
RHFNet [46] utilizes a bidirectional fusion architecture to fully use multi-layer features, efficiently realizing small object detection.
HSD [47] proposes a novel reg-offset-cls module and a stacked strategy implementing precision and speed at the same time.
Integrating the specialized feature extraction and information fusion techniques, SODNet [48] effectively improves small object detection ability with high real-time performance.
Dividing the high-resolution input image into a number of chips still introduces a high computational cost, so UFPMPDet [49] merges sub-regions given by a coarse detector into a mosaic for a single inference, further promoting the efficiency of detection.
Swin–Transformer [50] uses a shifted windowing scheme to realize efficient computation of Transformer by limiting self-attention computation to nonoverlapping local windows while also allowing for cross-window connection.
To further expand the cross-window connection, CSWin–Transformer [51] proposes cross-shaped window self-attention mechanisms for computing self-attention in the horizontal and vertical stripes, promoting the connection in the global perspective.
CrossViT [52] proposes a novel Transformer-based module that can be used between features with different spatial sizes. CrossViT first utilizes two ViTs for two features separately. Then, CrossViT exchanges the class tokens of two features and extracts cross attention between two features.
V2X-ViT [53] proposes V2X communication using a novel vision Transformer to achieve accurate 3D object detection.
CoBEVT [54] designs a fused axial attention module (FAX) to realize bird’s eye view semantic segmentation.
MaxViT [55] consists of two aspects: blocked local and dilated global attention, which allows for global–local spatial interactions on arbitrary input resolutions with only linear complexity. Transformer is also widely used in image object detection tasks.
DETR [56] proposes an end-to-end architecture for object detection by regarding the task as a direct set prediction problem. However, in DETR, each object query will not focus on a specifc region.
Anchor DETR [57] proposes a query design and an attention variant to make the object query focus on the objects near the anchor point.
YOLOS [58] proposes a series of Transformer-based object detection models.