[BPR] Look Closer to Segment Better: Boundary Patch Refinement for Instance Segmentation
{Patch Refinement; Extra Refinement Network; Concatenate Mask and Image Patches}
Journal: https://link.springer.com/content/pdf/10.1007/s11263-022-01662-0.pdf
{Patch Refinement; Extra Refinement Network; Concatenate Mask and Image Patches}
Journal: https://link.springer.com/content/pdf/10.1007/s11263-022-01662-0.pdf
1) Motivation, Objectives and Related Works:
Motivation:
Tremendous efforts have been made on instance segmentation but the mask quality is still not satisfactory.
The boundaries of predicted instance masks are usually imprecise due to:
The low spatial resolution of feature maps.
The imbalance problem, caused by the extremely low proportion of boundary pixels.
Objectives:
Propose BPR (Boundary Patch Refinement), a conceptually simple yet effective post-processing refinement framework to improve the boundary quality based on the results of any instance segmentation model.
Following the idea of looking closer to segment boundaries better, we extract and refine a series of small boundary patches along the predicted instance boundaries.
The refinement is accomplished by a boundary patch refinement network at higher resolution.
Related Works:
Instance Segmentation
Semantic Segmentation
Panoptic Segmentation
Boundary Refinement
Extra and specialized module: Designing a boundary-aware segmentation model by integrating an extra and specialized module to process boundaries.
BMask R-CNN [7] and Gated-SCNN [35] employ an extra branch to enhance the boundary awareness of mask features by estimating boundaries directly.
PointRend [17] iteratively samples the feature points with unreliable predictions and refines them with a shared MLP.
Post-processing scheme: Refine the boundaries based on the results of existing segmentation models with a post-processing scheme.
SegFix [48] replaces the unreliable predictions of boundary pixels with the predictions of interior pixels.
PolyTransform [21] transforms the contour of an instance into a set of polygon vertices, using a Transformer [38] based network to predict the offsets of vertices towards object boundaries.
Boundary Detection
DeepStrip (Zhou et al., 2020) proposes to convert the boundary regions into a strip image and compute a boundary prediction in the strip domain.
BRG and Deep Strip are significantly different in design:
DeepStrip predicts the boundary pixels directly while BPR learns to predict foreground pixels. BPR is substantially easier to optimize since the proportions of foreground and background pixels are roughly the same in the boundary patches.
DeepStrip requires an 80 × 4096 strip image as the input, while BPR processes squared image patches (e.g., 64×64) and require the corresponding mask patches as the input.
The pipeline of the proposed BPR is more concise than DeepStrip, which adopts a series of carefully designed operations and loss functions.
Contribution:
Motivated by the human segmentation behavior, propose a conceptually simple yet effective post-processing framework to improve the boundary quality through a crop-then-refine strategy.
Specifically, given a coarse instance mask produced by any instance segmentation model:
Extract a series of small image patches along the predicted instance boundaries.
After concatenating with mask patches, the boundary patches are fed into a refinement network, which performs binary segmentation to refine the coarse boundaries.
The refined mask patches are then reassembled into a compact and high-quality instance mask.
==> Proposed framework as BPR (Boundary Patch Refinement).
2) Methodology:
Problem:
Two critical issues leading to low-quality boundary segmentation:
The low spatial resolution of the output makes finer details around object boundaries disappear. The predicted boundaries are always coarse and imprecise:
E.g. 28×28 in Mask R-CNN or at most 1/4 input resolution in some one-stage frameworks.
Pixels around object boundaries only make up a small fraction of the whole image (less than 1%), and are inherently hard to classify.
==> Treating all pixels equally may lead to an optimization bias towards smooth interior areas while underestimating the boundary pixels.
Baseline: Mask R-CNN ResNet-FPN-50
Patch size: 64×64 without padding.
Refine Network: HRNetV2-W18-Small (input size 128×128)
Input: Concatenate[Image Patch (w,h,3), Mask Patch (w,h,1)]
Output: Refined Patch (w,h,2) Foreground and Background.
NMS threshold: 0.25
Boundary Patch Extraction
Purpose: Given an instance mask produced by an instance segmentation model, we first need to determine which part of the mask should be refined.
Method: Propose an effective sliding-window style algorithm to extract a series of patches along the predicted instance boundaries.
Densely assign a group of squared bounding boxes where the central areas of the box should cover the boundary pixels.
Apply a Non-Maximum Suppression (NMS) algorithm to filter out a subset of patches (overlap and redundancies).
Also extract the corresponding binary mask patches from the given instance mask.
The concatenated image and mask patches are resized and fed into the following boundary patch refinement network.
Boundary Patch Refinement
Mask Patch (Figure 2.e)
Purpose:
Context information plays a vital role in pixel-wise classification.
The cropped image patches are hard to be classified independently due to the limited context information.
Method: Using the binary mask patch to accelerate training convergence and provide location guidance for the instance to be segmented.
Reasons:
The refinement network eliminates the need for learning instance-level semantics from scratch, and only needs to learn how to locate the hard pixels around the decision boundary and push them to the correct side.
This can be achieved by exploring low-level image properties (e.g. color consistency and contrast) provided in the local and high-resolution image patches.
The adjacent instances are likely to share an identical boundary patch, while the learning goals are totally different and ambiguous. Together with different mask patches for each instance, these issues can be avoided.
Boundary Patch Refinement Network
Purpose: The refinement network performs binary segmentation for each extracted boundary patch individually.
Method:
Any semantic segmentation model can be employed for this task by simply modifying the input channels to 4 (3 for the RGB image patch and 1 for the binary mask patch) and output classes to 2.
Adopt the state-of-the-art HRNetV2 as the refinement network, which can maintain high-resolution representation throughout the whole network.
Reassembling
Purpose:
Refined Patch: The refined mask patches are reassembled into a compact instance-level mask by replacing their previous predictions.
Not Refined Patch: Predictions are unchanged.
Overlapping areas of adjacent patches: the results are aggregated by simply averaging the output logits and applying a threshold of 0.5 to distinguish the foreground and background.
Extension to Semantic and Panoptic Segmentation
Problem 1: Pixel near borders
Patches extracted near the image border are usually entirely filled by foreground pixels (yellow boxes in Fig. 4).
These inferior patches have a negligible contribution for training since no effective boundaries are included.
They could degrade the model performance.
Method 1:
Remove these inferior patches from the training patch set when processing the semantic and panoptic segmentation results.
For instance segmentation, objects seldomly meet image border thus this issue can be neglected.
For semantic and panoptic segmentation, image patches are more likely to be shared by adjacent semantic masks than for instance segmentation, thus mask patches play a more important role.
Problem 2: Overlapping Pixel Predictions.
One pixel can have multiple predictions for instance segmentation (instances can have overlaps), but every pixel should only be assigned a single category (or a instance) label for semantic (or panoptic) segmentation.
Method 2:
For pixels with more than one predictions, we keep the semantic (or instance) label with the maximum confidence, to ensure the uniqueness of mask predictions after patch reassembling.
Learning and Inference
The refinement network is trained based on the boundary patches extracted from training images and tested on validation or testing images.
We do not directly train or fine-tune the instance segmentation models.
Only extract boundary patches from instances whose predicted masks have an Intersection over Union (IoU) overlap larger than 0.5 with the ground-truth masks, while all predicted instances are retained during inference.
The model outputs are supervised with the corresponding ground-truth mask patches using the pixel-wise binary cross-entropy loss.
The NMS-eliminating threshold = 0.25 during training, while adopting different thresholds during inference based on the speed requirements.
3.1 Experimental Results: Instance Segmentation
Dataset:
We mainly report the results on Cityscapes (Cordts et al., 2016), a real-world dataset with highquality instance segmentation annotations. We only used the fine data, containing 2, 975/500/1, 525 images for train/val/test, which were collected from 27 cities, with a high resolution of 1024×2048 pixels. There are eight instance categories, including bicycle, bus, person, train, truck, motorcycle, car, and rider.
Metrics:
The COCO-style (Lin et al., 2014) mask AP (averaged over 10 IoU thresholds ranging from 0.5 to 0.95 in the step of 0.05), AP50/AP75/AP90 (AP at an IoU of 0.5/0.75/0.9 respectively) and APS/APM /APL (for small/medium/large instances) were reported in most of our experiments. The official Cityscapes-style AP (Cordts et al., 2016) was only used to report the final results for a fair comparison, which was slightly higher than the COCO-style AP. Similar to (Takikawa et al., 2019; Liang et al., 2020; Yuan et al., 2020b), we also used a boundary F-score to evaluate the quality of the predicted boundaries. A mask was considered correct if the boundary was within a certain distance threshold from the ground-truth. We used a threshold of one pixel and only computed for true positives that were determined on the same 10 IoU thresholds ranging from 0.5 to 0.95. The boundary F-score was computed in an instance-wise manner and then averaged over them, termed AF. In addition to the AP and boundary F-score (AF) metrics, we further evaluated the performance with a newly proposed metric designed for measuring the boundary quality, boundary AP (Chen et al., 2021) (APb for short), to demonstrate effectiveness of the proposed BPR method for boundary refinement. Boundary AP is calculated base on boundary IoU, which calculates the IoU for mask pixels within a certain distance from the corresponding ground truth or prediction boundary contours.
Implementation Details:
The MMSegmentation (MMSegmentation, 2020) codebase was adopted to implement the boundary patch refinement network. During training, the image patches were augmented by random horizontal flipping and random photometric distortion. The binary mask patches were normalized with the mean and standard deviation both equal to 0.5. We used the SGD optimizer with the initial learning rate 0.01, the momentum 0.9, and the weight decay 0.0005. The learning rate was decayed using the poly learning rate policy with the power of 0.9. The models were trained for 160K iterations with a batch size of 32 on 4 GPUs and syncBN (Zhang et al., 2018). To have an impression about the training speed, we take the default setting adopted in ablation studies (see below) as an example. We extracted 280k/67k patches from the train/val results of Mask R-CNN (adopted from MMDetection (Chen et al., 2019)). It took about 10 hours for training on 4 NVIDIA RTX 2080Ti GPUs under this setting.
Ablations:
Effects of Mask Patch
Made a comparison by eliminating the mask patches while keeping other settings unchanged.
Table 2: With mask patches, achieved a significant improvement
Figure 3: With the help of mask patches, produced high-quality predictions with accurate and distinct boundaries.
Patch Size
Increased the boundary patch size by cropping with a larger box and/or with padding.
The padded areas were only used to enrich the context and not used for reassembling.
As the patch size gets larger, the model becomes less focused but can access more context information.
Table 3: 64×64 patch without padding works better.
Different Patch Extraction Schemes
The most important contribution of this work is the idea of looking closer at instance boundaries to achieve better segmentation results. There are multiple choices about how to extract the boundary patches for refinement. We compared three extraction schemes as shown in Fig. 6. The most straightforward scheme is to pre-define a grid and divide the input image into small patches (Fig. 6b). Then we select patches that cover boundary pixels as boundary patches for refinement. Another scheme is to extract the instance-level patch (Fig. 6c) based on the detected bounding box and further re-segment the instance patch, similar to previous studies (Liang et al., 2020; Liu et al., 2020). This scheme can be viewed as an improved Mask R-CNN equipped with a stand-alone mask head. It does not solve the optimization bias issue and the learning process is dominated by interior pixels. Experimental results are listed in Table 4. For the pre-defined grid scheme, we varied the patch size and found the results were consistently worse than our proposed “dense sampling + NMS filtering” scheme. The results were improved slightly by enabling padding but still sub-optimal. One of the most important reasons is the imbalanced foreground/background ratio. We observed that some extracted patches were almost entirely filled with either foreground or background pixels (yellow dashed boxes in Fig. 6b). These patches were hard to refine due to the lack of context. In contrast, by restricting the center of patches to cover the boundary pixels (Fig. 6a), the imbalance problem can be alleviated. For the instance-level patch scheme, even the patch size was enlarged to 512×512 pixels, the results were still sub-optimal
Input Size of the Refinement Network
The extracted boundary patches were upsampled into a larger scale before refinement. Table 5 shows the impact of input size. We also report the approximate inference speed of the refinement network, with a fixed batch size of 135 (on average 135 patches per image). As the input size increased, the AP and AF scores increased accordingly, and slightly dropped after 256. This is reasonable as more details are retained with larger input size, but very large input would diverge the network’s focus.
Alternatives of refinement network
Most semantic segmentation model can be used to perform binary segmentation for boundary patches. We compared the performance of different segmentation networks (Table 6). The performance was robust to the choice of refinement network. A stronger backbone usually led to higher performance, but at the expense of lower speed. Even with some lightweight networks (e.g., HRNet-W18s, ResNet-18, Fast-SCNN (Poudel et al., 2019)), we still achieved nontrivial improvements. We adopted FCN-ResNet-50 in the final model. Since the model essentially performs binary segmentation for patches, it can further benefit from the advances in semantic segmentation, such as increasing the resolution of feature maps (Wang et al., 2020b; Chen et al., 2017, 2018)
NMS Eliminating Threshold
We studied the impact of different NMS eliminating thresholds during inference, shown in Table 7. The reported “#patch/img” indicates the total number of patches for all instances in an image. There might be several instances per image, and the number of patches also varied across different instances (e.g., larger instances may produce more boundary patches) in the image. As the threshold got larger, the number of boundary patches increased rapidly. The overlap of adjacent patches provides a chance to correct unreliable predictions of the inferior patches. As shown, the resulting boundary quality was consistently improved with a larger threshold, and saturated around 0.55. We fixed the NMS eliminating threshold to 0.25 during training. During inference, 0.25 was used in ablation experiments (Sect. 4.2) and 0.55 was used in stronger models (Sects. 4.3 and 4.4).
Transferability:
What the BPR model learned is a general ability to correct error pixels around instance boundaries. We can easily transfer this ability of boundary refinement to refine the results of any instance segmentation model. After training, the BPR model becomes model-agnostic, similar to SegFix (Yuan et al., 2020b). Specifically, once we get a model trained on the boundary patches extracted from the predictions of Mask R-CNN on Cityscapes, we can make inference to refine the prediction of any model (not only Mask R-CNN) on the same dataset, without retraining the model. We validated the transferability by applying the model trained on Mask R-CNN results to refine the predictions of PointRend (Kirillov et al., 2020) and SegFix (Yuan et al., 2020b). Note that these two methods are also designed to improve boundary quality in segmentation. As shown in Table 8, the transferred model still improved the results of PointRend and SegFix by a large margin, suggesting that our method is compatible with them. In these experiments, BPR model was trained on Mask RCNN (w/ COCO pre-training) predictions. At the same time, choosing which model’s predictions to train the BPR model is also worth considering. We trained several BPR models on different segmentation results and applied them to refine the result of Mask R-CNN (Table 9). We adopted FCN-ResNet50 as the refinement network with NMS threshold equal to 0.55, other settings were the same as ablation experiments above. We found that training the BPR model with the predictions of Mask R-CNN without COCO pre-training worked the best. The reason for training the BPR model with the predictions of Mask R-CNN (w/o COCO pre-training) was better than training it with the predictions of Mask R-CNN (w/ COCO pre-training) might be that the boundaries of the weaker Mask R-CNN (w/o COCO pre-training) were usually coarser, thus provide more diverse boundary patches for training the BPR model.
Overall Results:
Comparison with State-of-the-art Methods
We adopted the optimal design choices and hyperparameters found in ablation experiments to train a stronger BPR model. Specifically, we adopted FCN-ResNet-50 as our refinement network, with 256×256 input patches resized from 64×64, and a NMS threshold of 0.55 during inference. The BPR model here was trained on the results of Mask R-CNN (w/o COCO pre-training). The model was evaluated on Cityscapes val and test sets and compared against some state-of-the-art methods, including DWT (Bai & Urtasun, 2017), SGN (Liu et al., 2017), Mask R-CNN (He et al., 2017), BMask R-CNN (Chen et al., 2020b), AdaptIS (Sofiiuk et al., 2019), PANet (Liu et al., 2018), SSAP (Gao et al., 2019), UPSNet (Xiong et al., 2019), PANet (Liu et al., 2018) (Table 10). We had the following observations. (1) Compared with the Mask R-CNN baseline, we achieved a significant improvement (+4.5% and +4.6% AP on val and test sets). Our BPR outperformed SegFix (Yuan et al., 2020b) by a large margin, which is also a boundary refinement module applied to the same baseline. Applying our BPR model to the results already refined by SegFix led to even better results (slightly lower than applying BPR only). (2) By applying BPR to the strong PolyTransform (Liang et al., 2020) baseline (1st place at CVPR 2020). Our “PolyTransform + BPR” consistently improved 2.6% AP on the Cityscapes test set and also outperformed “PolyTransform + SegFix” (2nd place at ECCV 2020) by a large margin (+1.5%). (3) By applying BPR to the stronger “PolyTransform + SegFix” (Yuan et al., 2020b) baseline, we achieved state-of-the-art results on the Cityscapes test set with AP of 42.8%. (4) Our BPR improved the results of RefineMask (Zhang et al., 2021) by a large margin (+3.2% AP), which is a newly published method focusing on boundary refinement for instance segmentation. (5) The results of the proposed BPR model were a little bit better than those reported in our conference paper (Tang et al., 2021), due to the change of refinement network (from HRNet-W48 to FCN-ResNet-50) and training data source (from Mask R-CNN w/ COCO pretraining to Mask R-CNN w/o COCO pre-training).
We further applied the proposed BPR method to refine the results of Mask2Former (Chen et al., 2022), which is a recently proposed query-based method for instance segmentation and achieved remarkable performance on the popular benchmarks. As shown in Table 11, the proposed BPR model successfully improved the results of Mask2Former on Cityscapes val, even with the powerful Swin-L (Liu et al., 2021) backbone. Note that the best-performing model (45.6%) still lagged behind “PolyTransform + BPR” (46.9%) since COCO pre-training was not used in Mask2Former (Cityscapes fine data only).
Comparison with Similar Methods
Several previous methods also focus on boundary refinement for segmentation, such as BMask R-CNN (Chen et al., 2020b), PointRend (Kirillov et al., 2020), and SegFix (Yuan et al., 2020b). SegFix and our BPR are model-agnostic. BMask R-CNN and PointRend add or replace the head of Mask R-CNN, and the original papers only report the results based on the Mask RCNN (w/o COCO) baseline. We further compared with these methods under the same baseline. As shown in Table 12, our BPR method remarkably improved the baseline results (+6.1% and +5.8% AP on val and test sets respectively), and outperformed these similar approaches by large margins.
Qualitative Results
We show some qualitative results on Cityscapes val in Fig. 7a. Compared with the coarse predictions of Mask R-CNN, our BPR generated substantially better segmentation results with precise and clear boundaries. It largely alleviated the over-smoothing issues (Kirillov et al., 2020) in previous methods caused by the low resolution feature maps.
Speed
The inference time of our proposed framework is independent of the original instance segmentation models, which consists of three parts: patch extraction, refinement, and reassembling. Note that only the refinement part was considered when we calculated the FPS in Tables 5, 6, and 7. Besides, the FPS was measured in an imprecise manner by fixing the batch size to 135 (average number of patches per image), while the exact number of patches varied from image to image. Here we report the total inference time, which measured by calculating the exact inference time for each image individually and then taking the average. Taking the default setting (HRNet-W18s with input size of 128×128) in ablation experiments as an example, it took about 211ms (52ms, 81ms, 78ms for the above three parts respectively) to process an image (1024×2048) of Cityscapes on a single RTX 2080Ti GPU, which is still much faster than PolyTransform (575ms per image (Liang et al., 2020), measured on a single GTX 1080Ti GPU, which is about 35% slower than our RTX 2080Ti GPU with FP32 training (Li, 2019)). Undoubtedly, the network speed can be further improved with more efficient backbones (e.g., MobileNets), smaller input size (e.g., 32×32 or 64×64), and fewer inference patches (e.g., with lower NMS thresholds or adaptively selecting the most unreliable patches). Note that the BPR models can still achieve a remarkable performance under these lightweight settings (Tables 5,6,7). The patch extraction and reassembling steps can also be accelerated with more CPU cores.
Limitation Analysis
The performance of our proposed framework relies on the initial masks. Some failure cases are illustrated in Fig. 8. For example, our model failed to produce an optimal mask if the initially predicted boundaries were far from the real object boundaries (the 1st row), but note that we still refined this case to some extent (IoU was improved). In addition, if the initial mask over-segmented the neighboring instance, our model may regard the two instances as a whole and further amplify this error (the 2nd and 3rd rows) since we only process the local boundary regions without a global view. We analyzed the IoU improvements for all predicted instances on Cityscapes val set, shown in Fig. 9. In most cases, our refinement model improved the mask IoU (red dots above the dash line). However, we found that it was hard to refine instance masks with extremely low IoU (e.g., < 0.1) due to the poor quality of initial boundaries. In addition, we observed that the improvement for smaller instances (about 2% in APS) was not as high as we got for larger instances (about 5% in APL ).
Results on COCO Dataset.
To demonstrate the generality of our framework, we also report the results on the COCO dataset (Lin et al., 2014), which contains 80 categories and more images (118k/5k for train/val). It is important to note that the coarse annotations in COCO may not fully reflect the improvements in mask quality (Gupta et al., 2019). Following some previous works (Kirillov et al., 2020; Zhang et al., 2021), we further report the AP measured using the higher quality LVIS (Gupta et al., 2019) annotations. We randomly sampled about 8% of instances for fast training. As shown in Table 13, we improved the powerful Mask R-CNN ResNeXtFPN-101 baseline by 0.8% AP and 1.7% AP on val2017. The AP improvement on COCO dataset was not as high as we got on Cityscapes. The most critical problem is that the coarse polygon-based annotations on COCO dataset yield significantly lower boundary quality (Gupta et al., 2019). Several examples (which are ubiquitous on COCO) are shown in Fig. 10 (top row). The misalignment between annotations and real instance boundaries may greatly increase the optimization difficulty of our refinement model. Especially, the coarse annotations may provide ambiguous optimization objectives for our local boundary patches, thus hampering the model convergence. We observed that some contour-based instance segmentation methods (Xie et al., 2020; Xu et al., 2019; Peng et al., 2020), which are sensitive to the quality of boundary annotations, also suffered from this misalignment issue. It seems that the coarse COCO annotations may not be friendly to these methods and it is hard to achieve very high AP scores by using these approaches. In spite of this, we still improved the Mask R-CNN results in some cases, shown in Fig. 10 (the middle and bottom rows). Some results were even better than the annotations.
3.2 Experimental Results: Semantic Segmentation
Dataset:
We used the fine data of Cityscapes dataset, containing 2, 975/500/1, 525 images for train/val/test. Different from instance segmentation, 19 categories are involved for semantic segmentation. For evaluation, we report the frequently-used class-wise mIoU to measure the semantic segmentation performance, and the boundary F-score (AF) to measure the quality of predicted boundaries. AF calculation for semantic segmentation is slightly different from instance segmentation (Sect. 4.1) since there are no instance concept. We adopted exactly the same metric as previous studies (Takikawa et al., 2019; Yuan et al., 2020b; Wang et al., 2022) to calculate boundary F-score for each category, and then averaged over categories.
Implementation Details:
The differences of applying BPR to instance segmentation and semantic segmentation are described in Sect. 3.3. Most configurations were the same as instance segmentation, except the batch size. Since more boundary patches were extracted for semantic segmentation results, the batch size was increased to 144. In this set of experiments, the BPR model trained on the results of HRNetW18s was transferred to refine the results of other models
Ablation:
We have validated the effectiveness of most configurable design choices in previous experiments (Sect. 4.2). Here we conducted ablation experiments for the special design choices of semantic segmentation. The configurations described in Sect. 4.2 were adopted. We have the following observations. (1) As shown in Table 14 (the last two rows), removing inferior patches extracted near image border improved the performance echoing the analysis in Sect. 3.3 and Fig. 4. (2) The BPR model without mask patch failed to converge (only 45.1% AP). The reason is that image patches were usually shared by adjacent objects but the learning goals were different (see the last two rows of Fig. 5). Location and semantic information provided by the mask patches can avoid this issue.
Quantitative Results:
We applied the BPR model to refine a variety of semantic segmentation results ( (Wang et al., 2020b; Yuan et al., 2020a; Long et al., 2015; Chen et al., 2017; Kirillov et al., 2020), adopted from MMSegmentation). The BPR model was trained with the same configurations as described in Sect. 4.4. As shown in Table 15, we achieved consistent improvements over different baselines. For example, we improved the HRNet-W18s baseline by 2.6% mIoU and 8.5% AF. The significant improvement on the boundary-sensitive AF scores demonstrates the effectiveness on boundary refinement. On the powerful HRNet-W48-OCR baseline, we still improved 1.0% and 0.6% mIoU under the single-scale and multi-scale settings respectively. In addition, we consistently outperformed SegFix (Yuan et al., 2020b) on both mIoU and AF metrics over different baselines.
Note that the overall improvements on semantic segmentation models (+0.6% ∼ +2.6%) are not as high as we got on instance segmentation (+2.5% ∼ +6.1%). Delving into the class-wise mIoU in Table 15, we found that categories (e.g., traffic light, traffic sign, rider) with smaller and fragmented labeling areas usually got more significant improvements than categories (e.g., road, building, vegetation) with larger and coherent areas. The reason lies on the IoU definition. For smaller regions, boundary pixels make much more contributions in IoU calculation than for larger regions. Thus categories with smaller regions can benefit more from boundary refinement.
Qualitative Results:
We show some qualitative results on Cityscapes val in Fig. 7b. Compared with the initial predictions of HRNet, our BPR framework generated better semantic segmentation results with precise boundaries. One limitation is that the performance of refinement relies on the initial predictions, which is similar to instance segmentation. For example, our model failed to refine the poor initial masks illustrated in Fig. 7b (last column, dashed boxes).
3.3 Experimental Results: Panoptic Segmentation
Dataset:
We used thefinedata of Cityscapes dataset. There are 8 ‘thing’ and 11 ‘stuff’ classes. For evaluation, we used the panoptic quality (PQ) metric (Kirillov et al., 2019b) to measure the performance, including the breakdowns of recognition (RQ) vs.segmentation (SQ) performance and stuff (PQSt) vs.things (PQT h) performance. In addition to the standard mask PQ (Kirillov et al., 2019b), we further evaluated the performance with a recently proposed metric designed for measuring the boundary quality, boundary PQ (Chen et al., 2021), to demonstrate effectiveness of the proposed method for boundary refinement.
Implementation Details:
The differences of applying BPR to instance segmentation and panoptic segmentation are described in Sect. 3.3. Most configurations were the same as instance segmentation, except the batch size (which was increased to 256). In this set of experiments, BPR trained on the results of UPSNet-ResNet-50 (Xiong et al., 2019) was transferred to refine the results of other models. We omitted ablation experiments here because the conclusions drawn in previous ablation experiments (Sec. 4.2 and Sect. 5.2) were also valid for panoptic segmentation.
Quantitative Results:
We applied the BPR model to refine the results of several typical panoptic segmentation models. The BPR model was trained with the same configurations as in Sect. 4.4. As shown in Table 16, we achieved consistent improvements over different baselines (Xiong et al., 2019; Chen et al., 2020a). For example, we improved the UPSNet-ResNet-101- COCO (Xiong et al., 2019) baseline by 2.7% PQ and by 6.4% boundary PQ. The improvements in terms of boundary PQ are more significant than the standard mask PQ, which suggests that the proposed method successfully improved the boundary quality for panoptic segmentation. Notably, for the two UPSNet models, the improvements on thing classes (PQT h) were larger than on stuff classes (PQSt), while the observation for the two Panoptic-DeepLab models was opposite. Besides, the overall improvements on UPSNet models (+2.4% ∼ +2.7% on standard PQ) were larger than on Panoptic-DeepLab models (+1.3% ∼ +1.5% on standard PQ). These differences may be due to distinct design principles for these two methods. UPSNet (Xiong et al., 2019) is a top-down method, which first produces well-segmented instance results and then fills in the remaining regions with a semantic head. Panoptic-DeepLab (Chen et al., 2020a) is a bottom-up method, which groups instances from the well-segmented semantic results. As a result, UPSNet pays more attention on thing classes (higher PQT h), but PanopticDeepLab pays more attention on stuff classes (higher PQSt).
Qualitative Results:
We show some qualitative results on Cityscapes val in Fig. 7c. Compared with the initial predictions of PanopticDeepLab, our BPR framework generated better panoptic segmentation results with precise and clear boundaries. Similar limitation was observed for panoptic segmentation. As shown in Fig. 7c (last column, red dashed box), the model failed to refine the poor initial masks.
References:
n2 n0
θ