FocusCut
{Global view; Focus view}
{Global view; Focus view}
Interactive image segmentation is an essential tool in pixel-level annotation and image editing.
To obtain a high-precision binary segmentation mask, users tend to add interaction clicks around the object details, such as edges and holes, for efficient refinement.
Current methods regard these repair clicks as the guidance to jointly determine the global prediction.
However, the global view makes the model lose focus from later clicks, and is not in line with user intentions.
In this paper, we dive into the view of clicks’ eyes to endow them with the decisive role in object details again.
FocusCut integrates the functions of object segmentation and local refinement.
After obtaining the global prediction.
It crops click-centered patches from the original image with adaptive scopes to refine the local predictions progressively.
More efficient mode of user interaction:
The bounding box-based: [50].
Deep grab cut for object selection. In BMVC, 2017.
The polygon-based: [1, 6, 32].
Efficient interactive annotation of segmentation datasets with polygon-rnn++. In CVPR, 2018.
Annotating object instances with a polygon-rnn. In CVPR, 2017.
Fast interactive object annotation with curve-gcn. In CVPR, 2019.
The clicks-based [2, 29, 36].
Interactive full image segmentation by considering all regions jointly. In CVPR, 2019.
Deep interactive thin object selection. In WACV, 2021.
Deep extreme cut: From extreme points to object segmentation. In CVPR, 2018.
The scribbles-based [3, 48].
Error-tolerant scribbles based interactive image segmentation. In CVPR, 2014.
Milcut: A sweeping line multiple instance learning paradigm for interactive image segmentation. In CVPR, 2014.
Combinations [34, 52].
Two-in-one refinement for interactive segmentation. In BMVC, 2020.
Interactive object segmentation with inside-outside guidance. In CVPR, 2020.
More efficient use of the interaction provided by users:
The interaction ambiguity [9, 26, 30].
Conditional diffusion for interactive segmentation. In ICCV, 2021.
Interactive image segmentation with latent diversity. In CVPR, 2018.
Multiseg: Semantically meaningful, scale-diverse segmentations from minimal user input. In ICCV, 2019.
The input information [31, 35].
Interactive image segmentation with first click attention. In CVPR, 2020
Content-aware multilevel guidance for interactive instance segmentation. In CVPR, 2019.
The back-propagating [20,41],
Interactive image segmentation via backpropagating refinement scheme. In CVPR, 2019
f-brs: Rethinking backpropagating refinement for interactive segmentation. In CVPR, 2020.
Traditional methods of interactive segmentation build models on the low-level features of the image. These methods may become invalid in complex environments.
Intelligent scissors and lazy snapping.
GraphCut => GrabCut, Random Walks (RW), RW with Restart.
Generative image segmentation using random walks with restart, In ECCV, 2008.
[21, 46, 47] to further improve traditional methods,
Interactive image segmentation using adaptive constraint propagation. IEEE TIP, 2016.
Diffusive likelihood for interactive image segmentation. PR, 2018.
Probabilistic diffusion for interactive image segmentation
Neural Network-based: Thanks to the ability to comprehensively consider global and local features.
Recurrent neural network - based [1, 6]
Efficient interactive annotation of segmentation datasets with polygon-rnn++. In CVPR, 2018.
Annotating object instances with a polygon-rnn. In CVPR, 2017.
Graph convolutional network [32]
Iteratively trained interactive segmentation. In BMVC, 2018.
Reinforcement learning [27, 42]
Iteratively refined interactive 3d medical image segmentation with multi-agent reinforcement learning. In CVPR, 2020.
Automatic seed generation with deep reinforcement learning for robust interactive segmentation. In CVPR, 2018.
CNN-based
The extreme points:
common objects [36] - Deep extreme cut: From extreme points to object segmentation. In CVPR, 2018.
thin objecst [29] - Deep interactive thin object selection. In WACV, 2021.
full image [2] - Interactive full image segmentation by considering all regions jointly. In CVPR, 2019
The boundary clicks [19,24].
Click carving: Interactive object segmentation in images and videos with point clicks. IJCV, 2019.
Interactive boundary prediction for object selection. In ECCV, 2018.
The combination of interactions, such as the bounding box and clicks [4, 52].
Large-scale interactive object segmentation with human annotators. In CVPR, 2019.
Interactive object segmentation with inside-outside guidance. In CVPR, 2020.
Providing points in the foreground and background.
Deep interactive object selection. In CVPR, 2016. (deep-learning-based algorithm, along with a click map transformation and several random sampling strategies)
Regional interactive image segmentation networks. In ICCV, 2017 (exploit the local region from click pairs to refine the segmentation results)
A fully convolutional two-stream fusion network for interactive image segmentation. NN, 2019. (provide a two-branch architecture for this task)
Content-aware multilevel guidance for interactive instance segmentation. In CVPR, 2019. (improve the transformation of user clicks by generating content-aware guidance maps.)
Interactive image segmentation via back-propagating refinement scheme. In CVPR, 2019. (develop BRS to correct the mislabeled pixels in the initial results, which has been improved in f-BRS [41])
Continuous adaptation for interactive object segmentation by learning from corrections. In ECCV, 2020. (employ user corrections as training samples and update the model parameters instantly.)
Interactive image segmentation with latent diversity. In CVPR, 2018. (couple two convolutional networks to train and select the proper result)
Multiseg: Semantically meaningful, scale-diverse segmentations from minimal user input. In ICCV, 2019. (introduce scale diversity into the model to help users quickly locate their desired target).
Interactive image segmentation with first-click attention. In CVPR, 2020. (emphasize the critical role of the first click and take it as special guidance).
Conditional diffusion for interactive segmentation. In ICCV, 2021. (introduce a nonlocal method to fully exploit the user cues.)
Most methods transform user interactions into a guidance map sharing the same size as the whole image. However, we view each click extra in a focus view, utilizing them to the full potential.
Local information has been made full use of in many segmentation tasks:
HAZN [49] - Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In ECCV, 2016. (can adaptively adjust the scale of view to the object or the part to refine the segmentation)
GLNet [8] - Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In CVPR, 2019. (aggregates feature maps captured by local and global branches).
AWMF-CNN [44] - Adaptive weighting multi-field-of-view cnn for semantic segmentation in pathology. In CVPR, 2019. (semantic segmentation - assigns weights to different magnifications of local patches separately).
CascadePSP [11] - Cascadepsp: toward class-agnostic and very highresolution segmentation via global and local refinement. In CVPR, 2020. (feeds image patches from the original image through the refinement module).
MagNet [18] - Progressive semantic segmentation. In CVPR, 2021. (refines segmentation results of local patches with different scales in a progressive way).
The local views can be decided by the interaction, thus avoiding the shortcoming.
In interactive segmentation, RIS-Net [28]
It generates the local patch by finding the nearest negative click for each positive click and constructs a bounding box.
The local feature is extracted by using the ROI pooling layer from the main branch, whose input is the image concatenated with the transformed clicks.
The local refinement is still under the influence of the entire image and other clicks, weakening the dominant role of local clicks to some extent.
Additionally, the local features are somewhat lost due to the down-sampling operation of the network.
We go a step further and adopt a purer focus view for local refinement, directly feeding the local patches centered by each click into the network and completely ignoring the influence of the whole image and other distant clicks.
Revisiting Classic Pipeline
DeepLab series - DeepLab v3+ [7].
Backbone network: the ResNet [16] is mostly adopted in interactive segmentation.
ASPP part contains four dilated convolution branches and a global average pooling branch.
The decoder part refines the ASPP module’s output by fusing the backbone’s low-level features to generate the final prediction.
Interactive Image Segmentation
The input should contain information about the interactions.
The click locations will be transformed into two click maps, such as distance, disk, and our used Gaussian maps, representing the positive and negative points.
Most works in interactive segmentation modify the input part of the network and take a 5-channel map as input, which concatenates the RGB image and two-click maps.
It can be implemented by adding another head to encode a 5-channel map to a 3- channel map for satisfying the standard architecture or directly changing the first convolutional layer like us.
The output will be supervised by the ground truth with the binary cross-entropy loss and binarized to the final prediction.
FocusCut Pipeline
Motivation:
In the process of interactive segmentation, the user often repairs the incorrectly segmented region by providing more foreground and background points.
As the number of clicks increases, the later points are used for repairing more local areas gradually. Especially in the later stage, it is likely that many interaction points are gathered together for repairing a small area.
Due to the size of the receptive field and downsampling operations of the neural network, it is difficult to segment the whole object and detail areas simultaneously.
Focus Cut
As shown in Fig. 2, the provided FocusCut is a pipeline for interactive segmentation which contains two interactive views.
One is the global view to segment the whole object, and the other is the focus view to refine the segmentation according to the previous coarse mask around clicks.
To reflect the effectiveness of our method, we decide not to change the architecture of the common-used network as much as possible. We take the DeepLab v3+ with an output stride of 16 as the basic network. The difference is that we regard this as a shared network, which can not only learn the segmentation of the whole object but also learn the refinement for local areas.
To achieve this, we need to unify the two inputs.
Since the refinement in the focus view is generated based on the coarse mask, we add an extra channel of the previous prediction for the input. We hope that our network can learn to generate a more accurate segmentation based on the previous prediction and interaction points besides object segmentation.
To achieve this goal, we use the data of global view and focus view alternately to train our network.
For the global view, we adopt the iterative training strategy [33]. The coarse prediction is set to the previous segmentation if it is the iterative step.
For the other situations, it will be set to an empty map.
The RGB image contains the whole object, and the clicks are also simulated according to the object mask, which will include at least one positive point to indicate the location of the object.
In the global view, the network takes this 6-channel map as the input to generate the prediction of the whole object.
For the focus view, the core of our method, we train the network with patch samples that represent the local information of the target.
As shown in Fig. 2, the input map is also a 6-channel map for this phase. However, the RGB images will be local areas cropped from the original image, which does not represent the object, and will pay more attention to the fine details.
Unlike the click maps in the global view, these click maps must contain the center point of the map, which may be either positive or negative. We will generate the coarse mask by processing the local ground truth to reduce its fineness. These maps will be concatenated and fed into the network.
Fig. 2 shows the inference phase in detail. In this phase, the user will click continually until the result meets the user’s needs.
Since the first click is bound to segment the whole object, we introduce our focus view from the second click and later.
When the current click is added, the pipeline of global view will be firstly adopted, as shown in the top part of Fig. 2.
According to the position of the current click and the difference between current prediction P and previous prediction P′ in the global view, the judgment is made to determine whether the click should go through an additional path of focus view.
If the focus view is adopted, we will calculate the focus scope r for the current click. This will be introduced in Sec. 3.4.
Then the original image, clicks, and the current prediction will be cropped to a local patch according to the focus scope, which will be fed into the path of focus view to generate the local prediction Pˆ, as shown in the bottom part of Fig. 2.
It is worth mentioning that the image patch here is cropped from the original image.
For high-resolution images, this helps to avoid information missing and get a clearer RGB patch.
Finally, the local prediction will be pasted back to the original prediction. If there are overlaps between patches, the overlapping part adopts their mean value.
In Sec. 3.4, we also provide a progressive focus strategy to pay attention to local areas iteratively to achieve better results.
Focus Patch Simulation
In this section, we will introduce our simulation algorithm to generate the focus patches around the clicks for training.
We find that in the middle and later stages of interactive segmentation, users often click around the object boundary to make the boundary more accurate, and the object details are often near the boundary.
We generate the patch to simulate this situation. We select a point on the boundary of the object and give it a random offset based on β within [βmin, βmax] to serve as the center point of the patch.
The focus scope r is a random number related to the object size.
The object size is reflected by k and calculated from the ground truth G and the random coefficient is α within [αmin, αmax]. The detailed calculation process is described in Algorithm 1.
The default αmin, αmax, βmin, and βmax are 0.2, 0.8, -0.3, and 0.3 in our experiments.
With the patch center p and focus scope r, we crop the image and the corresponding ground truth as a square patch from (px − r, py − r) to (px + r, py + r).
With the patch data, we generate a coarse mask as the previous prediction through dilating and eroding randomly as in [11].
The center point will always be included as a user click.
We will also select 0 ∼ 3 positive and negative points in the patch to simulate these clicks around the center one.
These patch clicks will be transformed into click maps and fed into the network with the RGB image and the coarse mask.
In Fig. 3, we illustrate simulated patches from an image of a chair and its ground truth. It can be seen that our algorithm simulates the user’s interaction positions and crops different parts. At least one interaction point is included in the center of the patch. These coarse masks are with low segmentation quality, but retain macroscopic information, making our neural network pay attention to the refinement
Focus Scope Calculation
In the inference phase for the focus view, how to choose the focus scope is of great significance for the refinement.
We find that these local clicks can still have a certain effect in the global view, although they are insufficient for detail refinement.
Therefore, by comparing the current and previous predictions, the influence scope of the current point can be estimated.
According to the size of varied prediction areas and the object, we can decide whether to dive into a focus view around this point.
The above process is based on the situation that the user clicks on the area where the prediction is wrong.
In practice, the users sometimes click in the area where the prediction is already correct, e.g., they put positive clicks on the predicted foreground for refining small components or negative clicks on the predicted background to constrain the boundary.
For this situation, we will always go through a focus view with the focus scope as the distance between the click and the previous boundary. Because our clipping is based on a square, we use Chebyshev distance in the practical calculation.
The function η is defined to calculate the Chebyshev distance between points a and b:
η(a, b) = max(|ax − bx|, |ay − by|).
Algorithm 2 shows the process. The default λ and ω are 0.2 and 1.75.
Progressive Focus Strateg
For our focus view, the smaller the focus scope is, the more detailed information may be focused on.
Based on this, we propose Progressive Focus Strategy (PFS), which gradually focuses on areas that need to be repaired more.
This is different from the traditional multi-scale way, and the scale is changing dynamically according to the variation of the previous and current patch predictions.
And each time the new prediction is obtained, its part will be used as the next input in the progressive focus view.
We show this iterative process in Algorithm 3. The default T is set to 3, ωˆ is set to 1.1, ε is set to 2.
The standard PFS needs to iteratively take use of the current prediction to repair the next patch.
Parallel operations cannot be realized among these multiple iterative processes.
Therefore, we also propose a fast version to alleviate this problem and improve the speed by sacrificing a little performance.
For each turn, we use 0.8 times the previous focus scope as the current one.
At the same time, the previous prediction of the cropped patch comes from the original global prediction. In this way, the three turns can be conducted in parallel, accelerating the calculation process.
References: