Loss Functions for Semantic Segmentation
{, }
Paper: https://www.semanticscholar.org/reader/b8601c86905b0184b9387b042400609febb93d10
Code: https://github.com/shruti-jadon/Semantic-Segmentation-Loss-Functions
{, }
Paper: https://www.semanticscholar.org/reader/b8601c86905b0184b9387b042400609febb93d10
Code: https://github.com/shruti-jadon/Semantic-Segmentation-Loss-Functions
Loss functions for medical image segmentation methods [Link]
Loss functions are one of the important ingredients in deep learning-based medical image segmentation methods.
In the past four years, more than 20 loss functions have been proposed for various segmentation tasks. Most of them can be used in any segmentation task in a plug-and-play way.
We present a systematic taxonomy to sort existing loss functions into four meaningful categories. This helps to reveal links and fundamental similarities between them.
Moreover, we implement all the loss functions with pytorch. The code and references are publicly available here.
Distribution-based Loss
Region-based Loss: aims to minimize the mismatch or maximize the overlap regions between ground truth and predicted segmentation.
Boundary-based Loss: A recent new type of loss function, aims to minimize the distance between ground truth and predicted segmentation. Usually, to make the training more robust, boundary-based loss functions are used with region-based loss.
Compound Loss
Cross Entropy (CE)
Weighted Cross-Entropy (WCE)
Balanced Cross-Entropy* (BCE)
TopK loss
Focal loss
Distance penalized CE loss
Binary Cross-entropy is defined as a measure of the difference between two probability distributions for a given random variable or set of events.
It is widely used for classification objectives, and as segmentation is pixel-level classification, it works well.
The positive examples get weighted by some coefficients.
β is used to adjust the number of false negatives and false positives.
If we want to reduce the number of false negatives, then set β > 1.
To reduce the number of false positives, then set β < 1.
Balanced Cross-entropy (BCE)
Similar to weighted-cross entropy.
Weighs both positive as well as negative examples by β and 1 − β, respectively.
Here, β is defined as 1 − y/(H∗W)
Top-K Loss
Instead of considering the loss for all pixels equally, Top-K loss selects the K pixels with the highest loss values (i.e., the most uncertain predictions) and computes the loss only on those pixels.
This forces the model to pay more attention to the areas where it's making the biggest errors, leading to faster learning and better generalization.
Focal Loss
Adapts the standard CE to deal with extreme foreground-background class imbalance, where the loss assigned to well-classified examples is reduced.
Works best with highly-imbalanced datasets.
Focal Loss proposes to down-weight easy examples and focuses training on hard negatives using a modulating factor (1 − pt)γ
Aims to guide the network’s focus toward hard-to-segment boundary regions.
Distance maps are defined as the distance between the ground truth and predicted map. There are two ways to combine distance maps:
By creating a neural network architecture with a reconstruction head with segmentation.
By making it a loss function.
DPCE: Introduces distance-based penalization, meaning errors are punished differently based on their distance between the predicted probability (p) and the true label (y), which is defined as follows:
Here, ϕ is created from the distance maps.
Hadamard product.
The constant 1 is added to avoid the gradient vanishing problem in U-net and V-net architectures.
Sensitivity-Specificity (SS) Loss
Dice Loss
IoU Loss
Tversky Loss
Focal Tversky Loss
Generalized Dice Loss
Penalty Loss
Log-Cosh Dice Loss
Inspired by Sensitivity and Specificity metrics, used for cases where there is more focus on True Positives.
The weighted sum of the mean squared difference of sensitivity and specificity.
To address imbalanced problems, SS weights the specificity higher, using w parameter.
Directly optimize the Dice Coefficient which is the most commonly used segmentation evaluation metric.
As Dice Coefficient is non-convex in nature, it has been modified to make it more tractable.
α is a very small number used to ensure that the denominator of the expression is always different from 0.
A.k.a Jaccard loss, similar to Dice loss, is also used to directly optimize the segmentation metric.
Tversky Loss
Tversky Index (TI) is a generalization of Dice Coefficient. It adds a weight to FP (false positives) and FN (false negatives) with the help of β coefficient
Add different weights to False positives and False negatives, which is different from dice loss using the equal weights for FN and FP.
Tversky Coefficient:
When β=1/2, then TI becomes Dice.
Tversky Loss:
Applies the concept of Focal loss to focus on hard cases with low probabilities.
Focal Tversky loss also attempts to learn hard-examples such as with small ROIs(region of interest) with the help of γ coefficient (range from [1,3])
Hybrid loss function: Combines the strengths of both Dice loss and log-cosh loss.
Dice loss: Effectively handles class imbalance and focuses on pixel-wise overlap.
Log-cosh loss: Smoother than Dice loss, reducing sensitivity to outliers and improving convergence.
The deravative of cosh function:
Cosh(x) range can go up to infinity.
So, to capture it in range, log space is used.
Making the log-cosh function to be:
Using Chain rule:
This is a continuous function and have range [-1, 1]
The multi-class extension of Dice loss where the weight of each class is inversely proportional to the square of label frequencies.
pGD: Weight FP and FN
Boundary loss
Hausdorff Distance loss
Shape Aware loss*
Aims to estimate Hausdorff Distance from the CNN output probability so as to learn to reduce HD directly.
Calculates the maximum distance between any point on one boundary and the closest point on the other boundary.
Directed Hausdorff Distance: (Maximum distance from any point in A to the closest point in B)
DHD(A, B) = max_{a ∈ A} min_{b ∈ B} d(a, b)
Symmetric Hausdorff Distance: (Combines distances in both directions)
HD(A, B) = max(DHD(A, B), DHD(B, A))
Specifically, HD can be estimated by the distance transform of ground truth and segmentation.
Loss tackle the non-convex nature of Distance metric by adding some variations.
Where dG and dS are distance transforms of ground-truth and segmentation.
Weakness: Sensitive to outliers and might over-penalize small segmentation errors.
Variants:
Average Hausdorff Distance: Reduces outlier sensitivity by averaging distances.
Modified Hausdorff Distance: Excludes farthest points to mitigate outlier effects.
Weighted Hausdorff Distance: Assigns different weights to points based on their importance.
Combining with Other Losses: Often used with cross-entropy or other losses for a balanced approach.
Most loss functions work at the pixel level. However, Shape-aware loss calculates the average of the Euclidean distance between points around the predicted curve, and uses it as a coefficient in the Cross-Entropy loss.
Variation of cross-entropy loss by adding a shape-based coefficient, used in cases of hard-to-segment boundaries.
E is considered to be a learned network mask similar to the training shapes.
Dice+CE
Dice+TopK
Dice+Focal
Exponential Logarithm loss*
Correlation Maximized Structural Similarity loss*
Used for lightly class imbalanced.
It attempts to leverage the flexibility of Dice Loss of class imbalance and at same time use cross-entropy for curve smoothing.
DL is Dice loss.
Focuses on less accurately predicted structures.
We can use: γ_cross = γ_Dice
Focuses on Segmentation Structure.
Used in cases of structural importance such as medical images.
Introduced a Structural Similarity Loss (SSL) to achieve a high positive linear correlation between the ground truth map and the predicted map.
It's divided into 3 steps:
Structure Comparison.
Cross-Entropy weight coefficient determination.
Mini-batch loss definition.
In Structure Comparison, authors have calculated e-coefficient, which can measure the degree of linear correlation between GT and Prediction:
C4 is stability factor set to be 0.01 as an empirical observed value.
μy and σy are local mean and standard deviaion of the GT y.
Sau khi tính được độ tương quan e. Tác giả sử dụng nó như 1 hệ số trong Cross Entropy, được định nghĩa như sau:
Sử dụng hệ số trên trong hàm tính CMSSL như sau:
Và hàm loss được định nghĩa cho mini-batch như sau:
Sử dụng công thức bên trên, hàm loss sẽ tự động bỏ qua những pixel không thể hiện được độ tương quan trong cấu trúc.
DICE is not a Convex function. It can reach value that is larger than 1. (Convex function's value can not be larger than 1).
DICE is regularly used by integrating with other convex functions, such as Cross-Entropy loss.