SOLO: Segmenting Objects by Locations
{Instance segmentation, Location category}
0. Motivation, Objective, and Related Works:
Motivation:
Instance segmentation is challenging because it requires the correct separation of all objects in an image while also semantically segmenting each instance at the pixel level. Objects in an image belong to a fixed set of semantic categories, but the number of instances varies. As a result, semantic segmentation can be easily formulated as a dense per-pixel classification problem, while it is challenging to predict instance labels directly following the same paradigm.
To overcome this obstacle, recent instance segmentation methods can be categorized into two groups, i.e., top-down and bottom-up paradigms. The former approach, namely ‘detect-then-segment’, first detects bounding boxes and then segments the instance mask in each bounding box. The latter approach learns an affinity relation, assigning an embedding vector to each pixel, by pushing away pixels belonging to different instances and pulling close pixels in the same instance. A grouping post-processing is then needed to separate instances. Both these two paradigms are step-wise and indirect, which either heavily rely on accurate bounding box detection or depend on per-pixel embedding learning and the grouping processing.
In contrast, we aim to directly segment instance masks, under the supervision of full instance mask annotations instead of masks in boxes or additional pixel pairwise relations. We start by rethinking a question: What are the fundamental differences between object instances in an image? Take the challenging MS COCO dataset [16] for example. There are in total 36, 780 objects in the validation subset, 98.3% of object pairs have center distance greater than 30 pixels. As for the rest 1.7% of object pairs, 40.5% of them have size ratio greater than 1.5×. To conclude, in most cases two instances in an image either have different center locations or have different object sizes. This observation makes one wonder whether we could directly distinguish instances by the center locations and object sizes?
In the closely related field, semantic segmentation, now the dominate paradigm leverages a fully convolutional network (FCN) to output dense predictions with N channels. Each output channel is responsible for one of the semantic categories (including background). Semantic segmentation aims to distinguish different semantic categories. Analogously, in this work, we propose to distinguish object instances in the image by introducing the notion of “instance categories”, i.e., the quantized center locations and object sizes, which enables to segment objects by locations, thus the name of our method, SOLO.
Locations An image can be divided into a grid of S ×S cells, thus leading to S 2 center location classes. According to the coordinates of the object center, an object instance is assigned to one of the grid cells, as its center location category. Note that grids are used conceptually to assign location category for each pixel. Each output channel is responsible for one of the center location categories, and the corresponding channel map should predict the instance mask of the object belonging to that location. Thus, structural geometric information is naturally preserved in the spatial matrix with dimensions of height by width. Unlike DeepMask [24] and TensorMask [4], which run in a dense sliding-window manner and segment an object in a fixed local patch, our method naturally outputs accurate masks for all scales of instances without the limitation of (anchor) box locations and scales.
In essence, an instance location category approximates the location of the object center of an instance. Thus, by classification of each pixel into its instance location category, it is equivalent to predict the object center of each pixel in the latent space. The importance here of converting the location prediction task into classification is that, with classification it is much more straightforward and easier to model varying number of instances using a fixed number of channels, at the same time not relying on post-processing like grouping or learning embeddings.
Sizes To distinguish instances with different object sizes, we employ the feature pyramid network (FPN) [14], so as to assign objects of different sizes to different levels of feature maps. Thus, all the object instances are separated regularly, enabling to classify objects by “instance categories”. Note that FPN was designed for the purposes of detecting objects of different sizes in an image.
In the sequel, we empirically show that FPN is one of the core components for our method and has a profound impact on the segmentation performance, especially objects of varying sizes being presented.
Objectives:
We present a new, embarrassingly simple approach to instance segmentation.
Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging.
In order to predict a mask for each instance, mainstream approaches either follow the “detect-then-segment” strategy (e.g., Mask R-CNN), or predict embedding vectors first then use clustering techniques to group pixels into individual instances.
We view the task of instance segmentation from a completely new perspective by introducing the notion of “instance categories”, which assigns categories to each pixel within an instance according to the instance’s location and size, thus nicely converting instance segmentation into a single-shot classification-solvable problem.
We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent single-shot instance segmenters in accuracy.
We hope that this simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation.
Code is available at https://git.io/AdelaiDet
With the proposed SOLO framework, we are able to optimize the network in an end-to-end fashion for the instance segmentation task using mask annotations solely, and perform pixel-level instance segmentation out of the restrictions of local box detection and pixel grouping. For the first time, we demonstrate a very simple instance segmentation approach achieving on par results to the dominant “detect-then-segment” method on the challenging COCO dataset [16] with diverse scenes and semantic classes. Additionally, we showcase the generality of our framework via the task of instance contour detection, by viewing the instance edge contours as a one-hot binary mask, with almost no modification SOLO can generate reasonable instance contours. The proposed SOLO only needs to solve two pixel-level classification tasks, thus it may be possible to borrow some of the recent advances in semantic segmentation for improving SOLO. The embarrassing simplicity and strong performance of the proposed SOLO method may predict its application to a wide range of instance-level recognition tasks.
Related Works:
Top-down Instance Segmentation.
The methods that segment object instance in a priori bounding box fall into the typical top-down paradigm. FCIS [13] assembles the position-sensitive score maps within the region-of-interests (ROIs) generated by a region proposal network (RPN) to predict instance masks. Mask R-CNN [9] extends the Faster R-CNN detector [25] by adding a branch for segmenting the object instances within the detected bounding boxes. Based on Mask R-CNN, PANet [19] further enhances the feature representation to improve the accuracy, Mask Scoring R-CNN [10] adds a mask-IoU branch to predict the quality of the predicted mask and scoring the masks to improve the performance. HTC [2] interweaves box and mask branches for a joint multi-stage processing. TensorMask [4] adopts the dense sliding window paradigm to segment the instance in the local window for each pixel with a predefined number of windows and scales. In contrast to the top-down methods above, our SOLO is totally boxfree thus not being restricted by (anchor) box locations and scales, and naturally benefits from the inherent advantages of FCNs.
Bottom-up Instance Segmentation.
This category of the approaches generate instance masks by grouping the pixels into an arbitrary number of object instances presented in an image. In [22], pixels are grouped into instances using the learned associative embedding. A discriminative loss function [7] learns pixel-level instance embedding efficiently, by pushing away pixels belonging to different instances and pulling close pixels in the same instance. SGN [18] decomposes the instance segmentation problem into a sequence of sub-grouping problems. SSAP [8] learns a pixel-pair affinity pyramid, the probability that two pixels belong to the same instance, and sequentially generates instances by a cascaded graph partition. Typically bottom-up methods lag behind in accuracy compared to top-down methods, especially on the dataset with diverse scenes. Instead of exploiting pixel pairwise relations SOLO directly learns with the instance mask annotations solely during training, and predicts instance masks end-to-end without grouping post-processing.
Direct Instance Segmentation.
To our knowledge, no prior methods directly train with mask annotations solely, and predict instance masks and semantic categories in one shot without the need of grouping post-processing. Several recently proposed methods may be viewed as the ‘semi-direct’ paradigm. AdaptIS [26] first predicts point proposals, and then sequentially generates the mask for the object located at the detected point proposal. PolarMask [28] proposes to use the polar representation to encode masks and transforms per-pixel mask prediction to distance regression. They both do not need bounding boxes for training but are either being step-wise or founded on compromise, e.g., coarse parametric representation of masks. Our SOLO takes an image as input, directly outputs instance masks and corresponding class probabilities, in a fully convolutional, box-free and grouping-free paradigm.
1. Method - SOLO:
2.1 Problem Formulation
The central idea of SOLO framework is to reformulate the instance segmentation as two simultaneous category-aware prediction problems. Concretely, our system divides the input image into a uniform grids, i.e., S×S. If the center of an object falls into a grid cell, that grid cell is responsible for 1) predicting the semantic category as well as 2) segmenting that object instance.
Semantic Category For each grid, our SOLO predicts the C-dimensional output to indicate the semantic class probabilities, where C is the number of classes. These probabilities are conditioned on the grid cell. If we divide the input image into S×S grids, the output space will be S×S×C, as shown in Figure 2 (top). This design is based on the assumption that each cell of the S×S grid must belong to one individual instance, thus only belonging to one semantic category. During inference, the C-dimensional output indicates the class probability for each object instance.
Instance Mask In parallel with the semantic category prediction, each positive grid cell will also generate the corresponding instance mask. For an input image I, if we divide it into S×S grids, there will be at most S 2 predicted masks in total. We explicitly encode these masks at the third dimension (channel) of a 3D output tensor. Specifically, the instance mask output will have HI×WI×S 2 dimension. The k th channel will be responsible to segment instance at grid (i, j), where k = i · S + j (with i and j zero-based)3 . To this end, a one-to-one correspondence is established between the semantic category and class-agnostic mask (Figure 2).
A direct approach to predict the instance mask is to adopt the fully convolutional networks, like FCNs in semantic segmentation [20]. However the conventional convolutional operations are spatially invariant to some degree. Spatial invariance is desirable for some tasks such as image classification as it introduces robustness. However, here we need a model that is spatially variant, or in more precise words, position sensitive, since our segmentation masks are conditioned on the grid cells and must be separated by different feature channels.
Our solution is very simple: at the beginning of the network, we directly feed normalized pixel coordinates to the networks, inspired by ‘CoordConv’ operator [17]. Specifically, we create a tensor of same spatial size as input that contains pixel coordinates, which are normalized to [−1, 1]. This tensor is then concatenated to the input features and passed to the following layers. By simply giving the convolution access to its own input coordinates, we add the spatial functionality to the conventional FCN model. It should be noted that CoordConv is not the only choice. For example the semi-convolutional operators [23] may be competent, but we employ CoordConv for its simplicity and being easy to implement. If the original feature tensor is of size H×W×D, the size of new tensor becomes H×W×(D + 2), in which the last two channels are x-y pixel coordinates. For more information on CoordConv, we refer readers to [17].
Forming Instance Segmentation. In SOLO, the category prediction and the corresponding mask are naturally associated by their reference grid cell, i.e., k = i · S + j. Based on this, we can directly form the final instance segmentation result for each grid. The raw instance segmentation results are generated by gathering all grid results. Finally, non-maximum-suppression (NMS) is used to obtain the final instance segmentation results. No other post processing operations are needed.
References: