[SANET] SANet: A Slice-Aware Network for Pulmonary Nodule Detection
Jie Mei, Ming-Ming Cheng, Gang Xu, Lan-Ruo Wan, and Huan Zhang
{, }
Paper:
Jie Mei, Ming-Ming Cheng, Gang Xu, Lan-Ruo Wan, and Huan Zhang
{, }
Paper:
It is hard even for experienced doctors to distinguish pulmonary nodules from massive CT slices.
The currently existing nodule datasets are limited in scale and category.
The largest and most diverse dataset named PN9 for pulmonary nodule detection. (Contains 8,798 CT scans and 40,439 annotated nodules from 9 common classes)
Propose a slice-aware network (SANet) for pulmonary nodule detection.
A slice grouped non-local (SGNL) module is developed to capture long-range dependencies among any positions and any channels of one slice group in the feature map.
A 3D region proposal network to generate pulmonary nodule candidates with high sensitivity, while this detection stage usually comes with many false positives.
A false positive reduction module (FPR) is proposed by using the multi-scale feature maps.
Integrate 2D proposals into 3D proposals by using post-processing: inefficient and may affect the accuracy of nodule detection.
Ding et al. [26]: introduce a de-convolutional structure to Faster RCNN for candidate detection on axial slices.
Setio et al. [24]: multi-view ConvNets for pulmonary nodule detection.
The outputs from multiple 2D ConvNets are combined using a dedicated fusion method to get the final results.
Ding et al [26]
Setio et al. [24]
Dou et al. [27] propose a method employing 3D CNNs for nodule detection from CT scans, and introduce an effective strategy encoding multilevel contextual information to deal with the large variations and hard mimics of lung nodules.
In [39], a 3D CNN with an encoder-decoder structure is developed for pulmonary nodule detection. It also adopts a dynamically scaled cross-entropy to reduce the false positive rate and the squeeze-and-excitation structure to fully utilize channel inter-dependency.
Zhu et al. [23] propose a 3D Faster R-CNN with 3D dual-path blocks for nodule detection and a U-Net-like [40] architecture to effectively learn nodule features.
Liao et al. [7] adopt a 3D RPN to detect pulmonary nodules. They introduce a leaky noisy-OR gate to evaluate the cancer probabilities by selecting the top five nodules based on the detection confidences.
[41] proposes a novel multi-scale gradual integration CNN to learn features of multi-scale inputs with a gradual feature extraction strategy, which reduces many false positives.
An end-to-end probabilistic diagnostic system [42], which contains a Computer-Aided Detection (CADe) module for detecting suspicious lung nodules and a Computer-Aided Diagnosis (CADx) module for patient-level malignancy classification.
Harsono et al. [43] propose a lung nodule detection and classification model I3DR-Net, which combines the I3D backbone with RetinaNet and modified FPN framework.
Song et al. [44] develop a 3D centerpoints matching detection network (CPM-Net) for pulmonary nodule detection. It automatically predicts the position and aspect ratio of nodules without the manual design of anchor parameters. (CenterNet)
1) Multilevel Contextual 3-D CNNs [27]
2) DeepSEED [39]
3) 3D Faster R-CNN with 3D dual-path blocks [23]
4) 3D Deep Leaky Noisy-or Network [7]
5) multi-scale gradual integration CNN [41]
6) end-to-end probabilistic diagnostic system [42]
7) I3D backbone with RetinaNet and modified FPN [43].
8) 3D centerpoints matching detection network (CPM-Net) [44]
SANet is the two stage model, which consists of four parts:
Encoder-Decoder architecture.
Slice grouped non-local module (SGNL).
3D Region proposal.
False-positive reduction module.
Total train CT 6,707 scans = train 6,037 scans + validation 670 scans. (PN9 Dataset)
Clipping HU range: [-1200; 600].
Transform the HU range linearly into [0; 255].
Clip 3D patch p as the model input: 128x128x128x1 (Depth x Height x Width x Channel). If a patch exceeds the range of CT images, padded with a value of 170 (the luminance of common tissues and can be distinguished from pulmonary nodules)
Backbone: (3D ResNet-50) + Neck.
Generate pulmonary nodule candidates. (with high sensitivity and carries many false positives.)
Reduce the number of false positives among the nodule candidates and generate the final results.
Problem: Standard 3D ResNet50 Encoders are good for feature extraction but can struggle with the small, variable size of lung nodules in CT images.
Solution:
An U-shaped Encoder-Decoder 3D ResNet50 architecture [40].
The decoder network consists of two 2x2x2 deconvolution layers for up-sampling the feature maps.
Each output feature map of deconvolutional layers is concatenated with the corresponding output in the encoder network, whose channel is adjusted by a 1x1x1 convolutional layer.
The feature maps produced are defined as {Mres1; Mres2; Mres3; Mres4; Mres5; Mde1; Mde2}, respectively.
To generate pulmonary nodule candidates, a 3x3x3 convolutional layer is employed over the concatenated feature map Mde2.
The 3x3x3 convolution is followed by two parallel 1x1x1 convolutional layers:
Regressing the 3D bounding box of each voxel (i.e., Reg Layer in Fig. 2).
Predicting classification probability (i.e., Cls Layer in Fig. 2).
Five anchors with sizes: 5, 10, 20, 30, and 50. Each anchor is specified six regression parameters: central z-, y-, x- coordinates, depth, height, and width.
The multitask loss function LRPN:
Lcls: weighted binary cross-entropy loss.
Lreg: smooth L1 loss [34].
i is the index of ith anchor in one 3D patch.
Ncls and Nreg are the numbers of anchors considered for computing classification loss and regression loss, respectively.
λ is a parameter used to balance the two losses.
pi is the predicted probability of ith anchor being a nodule, pi* is 1 if the anchor is positive and 0 otherwise.
TIoU > 0.5 or highest IoU overlap, pi* is assigned a positive label.
TIoU < 0.02 with all ground-truth boxes, pi* is considered as negative.
ti is a vector denoting the predicted 6 parameterised coordinates for nodule position, and ti* is the ground-truth vector.
where x; y; z; w; h, and d represent the predicted box’s center coordinates, width, height, and depth.
x*; y*; z*; w*; h*, and d* are the parameters for the ground-truth box.
xi ; yi ; zi ; wi; hi, and di denote the parameters of the anchor box.
Motivation
The candidate detection stage is introduced to detect nodule candidates with high sensitivity, which usually carries many false positives.
Some thoracic tissues, such as nodular-like structures, mediastinal structures, large vessels, and scarring, are often found as false positives.
Objective: Propose a false positive reduction module.
Method: Using the multi-scale feature maps
Cropping the feature maps Mres1; Mres2; Mde2 using nodule candidates, we obtain three regions of interest (RoI) of different scales: Rres1; Rres2; Rde2.
Rde2 is up-sampled and concatenated with Rres2, then it is concatenated with Rres1 after up-sampled.
The final RoI is converted by 3D max pooling, followed by two Fully connected (FC) layers to obtain classification probability and bounding-box regression offsets.
Loss function: Same as the 3D RPN.
Optimizer: Stochastic gradient descent (SGD).
Batch size: 16
Epochs: 200
Learning rate: 0.01 (<100 epochs); 0.001 (100~160 epochs); 0.0001(>160 epochs)
Motivation:
In the thoracic CT images, vessels and bronchus are the continuous pipe-like structure, while nodules are usually isolated and spherical.
To diagnose nodules from other tissues, doctors need to view multiple consecutive slices to capture the correlation among them.
Objective: Based on the non-local module in [57]. The SGNL can learn explicit correlations among any elements across slices.
Physical meaning:
A nodule usually exists in several consecutive slices, and utilising all depths to detect the nodule is unnecessary. ==> Group operation.
The slice grouping operation can capture the similarity between any positions and any channels in one group, which augments the discrimination of nodules with different sizes correspond to information in one slice group.
X: input feature map for the non-local module.
D; H; W, and C represent depth, height, width, and the number of channels. (For above figure, D corresponds to T and C = 1024)
The original non-local operation in [57] is defined as:
θ, ϕ, g are implemented by 1x1x1 convolution and can be written as:
Wθ; Wϕ, and Wg are weight matrices to be learned.
The function f ( . ) is used to compute the similarity between all locations in the feature map. (correlation)
Using dot-product is probably the simplest one.
The origin non-local module can capture long-range dependencies among any positions in the feature map.
==> The affinity between any channels is also important for discriminating the fine-grained objects.
Consider cross-channel information in the origin non-local operation to model long-range dependencies among any positions and any channels.
Capture long-range dependencies among any positions and any channels of one slice group in the feature map.
Reshape the output of Eq. (5) by merging the channel into position.
The response Y is computed as:
where vec denotes that it is a vector after the reshape operation.
There is a DHWCxDHWC pairwise matrix, the computational complexity is much higher ==> not feasible.
==> Grouping the depth dimension D into G groups (ex, G = 4), each containing D' = D/G depths of the feature map. Each group is executed independently by Eq. (7) to compute Y', and the results are concatenated along the depth dimension to obtain Y.
The SGNL operation in Eq. (7) is wrapped into the SGNL block, which is defined as
where Wz represents a 1x1x1 convolutional layer, and BN is a Batch Normalization [64].
“concatenate” denotes that all groups are concatenated along the depth dimension.
The residual connection “+ X” makes the SGNL compatible with the existing neural network blocks.
For the configuration of the SGNL block, we add 5 blocks (2 blocks on the res3 and 3 blocks on the res4, to every other residual block) into 3D ResNet50 following [57].
PN9: the largest and most diverse dataset for pulmonary nodule detection.
Contain 8,798 CT scans and 40,439 annotated nodules of 9 different classes.
Free-Response Receiver Operating Characteristic (FROC) is the official evaluation metric of the LUNA16 dataset [30], which is defined as the average recall rate at 0.125, 0.25, 0.5, 1, 2, 4, and 8 false positives per scan.
True positive when nodule candidate is located within a distance R from the center of any nodules in the reference standard, where R denotes the radius of the reference nodule.
False positives: Nodule candidates is not located in the range of any reference nodules.
FROCIoU: defines the true positives if the 3D Intersection over Union of nodule candidates and any reference nodules is higher than one threshold (3D IoU threshold is defined as 0.25 in experiments).
3D mean Average Precision (mAP) as the detection evaluation metric.
AP@0.25 (AP at 3D IoU = 0.25).
AP@0.35 (AP at 3D IoU = 0.35).
APs (AP for small nodules that correspond size 0-5 mm: volume < 512).
APm (AP for medium nodules that correspond size 5-10 mm: 512 < volume < 4096).
APl (AP for large nodules that correspond size > 10 mm: volume > 4096).
Group convolution ideas: Xception [60], MobileNet [61], ResNeXt [62], and Group normalisation [63]
n2 n0
θ