[MST] MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation
L. Xu, S. Li, Y. Chen, and J. Luo
{Multi-Scale token adaptation, token similarity, contrastive loss, discriminant}
L. Xu, S. Li, Y. Chen, and J. Luo
{Multi-Scale token adaptation, token similarity, contrastive loss, discriminant}
Interactive segmentation has gained significant attention for its application in human-computer interaction and data annotation.
Novel multi-scale token adaptation ==> Address the target scale variation issue in interactive segmentation.
A token learning algorithm based on contrastive loss ==> Enhance the robustness of multi-scale token selection.
The interactive image segmentation algorithms [FocalClick, SimpleClick, RITM] face performance bottlenecks due to the target scale variation issue.
For example, in remote sensing segmentation tasks [4], targets like green land and water systems are typically much larger than buildings and roads, making it difficult to handle using single-scale features.
To address this issue:
SimpleClick [2] used the SimpleFPN proposed by He et al. [5] to capture the scale variations of the ViT [6] features, which improved the segmentation accuracy ==> CFR-ICL [7], AdaptiveClick [8].
MViT [9] proposed a multi-scale vision transformer model using pooling attention across space-time resolution and channel dimension. ==> Segformer [10].
Focalclick [1] introduced Segformer’s four-stage blocks as image encoders for interactive segmentation, achieving better accuracy by utilising equivalent multi-scale features.
RiTM [3] and ClickSEG [11] adopted CNN-based HRNet [12] and ResNet [13] to capture robust multi-scale convolution features, which helped the algorithms perform better in target scale variation scenarios.
==> Although these works can alleviate the target scale variation issue to some extent, they have ignored the effective use of multi-scale features in the input stage.
Crossvit [15] pointed out that smaller patch sizes can bring better performance in vision transformers, but smaller patches also lead to longer token lengths and higher memory requirements.
They used small and large patch sizes to generate tokens and fused these two sizes of tokens via cross-attention.
However, this strategy requires all tokens to participate in the calculation, increasing computation and reducing speed.
Segformer [10] proposed a scaled self-attention algorithm, which down-sampled the key and value into larger-size tokens using a scale ratio.
This algorithm’s self-attention calculation is equivalent to the cross-attention between the base and larger tokens.
However, this strategy can only obtain larger tokens, and cannot produce tokens smaller than base tokens.
SSA [16] adopted two different scale ratios to downsample the base tokens, incorporating more abundant scale information from the input.
However, this strategy still cannot capture smaller tokens of the input.
Additionally, its complex structure introduces substantial computation overhead.
ViT [6] first used 16×16 pixels image patch in Vision Transformer (ViT) to limit the token length.
DPT [17] upscaled the low-resolution feature maps of ViT to high-resolution for dense prediction tasks.
==> These algorithms need to calculate all tokens in self-attention, which is computationally intensive in high-resolution input scenarios.
To efficiently select important tokens,
STTS et al. [18] proposed a score-based algorithm for selecting important tokens in video processing. The algorithm adopted a differentiable top-k selection algorithm [19].
QuadTree [20] proposed distinguishing important and unimportant tokens to skip unimportant regions and subdivide important areas.
The proposed algorithm contributes the following innovations to interactive segmentation:
A similarity-based multi-scale token interaction algorithm is introduced, improving the performance of fine-grained segmentation.
A token learning algorithm based on contrastive loss is proposed, which enhances the discrimination between positively clicked tokens and background tokens.
Adaptive Patch Embedding denotes the multi-scale token extraction module, which obtained different patch size tokens by convolution module with different interpolated weights. (8 × 8,16 × 16, and 28 × 28 patch sizes).
ViT is adopted as an image encoder, using 16 × 16 base tokens (black lines, Fig. 3).
The multi-scale tokens (red lines, Fig. 3) comprise additional 8×8 and 28×28 tokens.
The multi-scale tokens fusion (MST) module (Fig. 4) adopts the feature pyramid structure SimpleFPN [5].
In terms of training data generation, the random cropping image enhancement algorithm is adopted to generate training samples. The size of training samples is 448×448 and the training process is in the form of end-to-end. Considered in light of the simulated strategy for click points, we employ the iterative learning strategy of RITM [3], at the same time, sampling positive and negative clicks through using the training data generation strategy proposed by Xu et al. [25]. To be specific, the maximum number of click points during training is set to 24, with a decay probability of 0.8. For the purpose of getting better results, a combination of COCO [26] and LVIS [27] dataset is constructed for training the proposed algorithm and other comparable algorithmic approaches. Additionally, the random flip and resize for data augmentation algorithms are adopted in training for generalization improvement. The optimizer is adopted AdamW, β1 = 0.9, β1 = 0.999. There are 30000 training samples in one epoch, with a total of 230 Epochs. The initial learning rate is 5 × 10−6 , and it is reduced by a factor of 10 when the epoch is 50 and 70. The proposed algorithm is trained on six Nvidia RTX 3090 GPUs, which takes about 72 hours to complete
GrabCut (50 images): Simple dataset with one instance per image.
Berkeley (96 images): Small dataset with 100 instances.
DAVIS (345 images): High-quality segmentation masks extracted from videos.
Pascal VOC (1449 images): Validation set used for evaluation.
SBD (8497 train, 2857 val images): Larger dataset for training and validation.
COCO+LVIS (C+L, 104k images): Combined dataset used exclusively for training, offering extensive data for model learning.
Number of Clicks (NoC), measuring user inputs needed for satisfactory segmentation.
Thresholds: NoC@90 and NoC@95 (more stringent than commonly used NoC@85 and NoC@90) to ensure high-quality segmentation.
Click generation: Method from [12] used for evaluation.
Click limit: 20 clicks per image for efficient assessment.
The interactive image segmentation algorithms [FocalClick, SimpleClick, RITM] face performance bottlenecks due to the target scale variation issue.
The computational complexity of the transformer is quadratically dependent on token length.
Multi-scale Vision Transformers: CrossViT, Segformer, SSA, DPT, STTS, Quadtree.
n2 n0
θ