Loss Functions for Semantic Segmentation

{, }

Paper: https://www.semanticscholar.org/reader/b8601c86905b0184b9387b042400609febb93d10

Code: https://github.com/shruti-jadon/Semantic-Segmentation-Loss-Functions

Loss functions for medical image segmentation methods [Link]

Motivation, Objectives and Related Works

Motivation

Loss functions are one of the important ingredients in deep learning-based medical image segmentation methods.
In the past four years, more than 20 loss functions have been proposed for various segmentation tasks. Most of them can be used in any segmentation task in a plug-and-play way.

Objectives

We present a systematic taxonomy to sort existing loss functions into four meaningful categories. This helps to reveal links and fundamental similarities between them.
Moreover, we implement all the loss functions with pytorch. The code and references are publicly available here.

Content

Distribution-based Loss

Cross Entropy (CE)
Weighted Cross-Entropy (WCE)
Balanced Cross-Entropy* (BCE)
TopK loss
Focal loss
Distance penalized CE loss

Binary Cross-Entropy (BCE)

- Binary Cross-entropy is defined as a measure of the difference between two probability distributions for a given random variable or set of events.
- It is widely used for classification objectives, and as segmentation is pixel-level classification, it works well.

Weighted Binary Cross-entropy (WCE)

The positive examples get weighted by some coefficients.
β is used to adjust the number of false negatives and false positives.
1. If we want to reduce the number of false negatives, then set β > 1.
2. To reduce the number of false positives, then set β < 1.

Balanced Cross-entropy (BCE)

Similar to weighted-cross entropy.
Weighs both positive as well as negative examples by β and 1 − β, respectively.
Here, β is defined as 1 − y/(H∗W)

Top-K Loss

Instead of considering the loss for all pixels equally, Top-K loss selects the K pixels with the highest loss values (i.e., the most uncertain predictions) and computes the loss only on those pixels.
This forces the model to pay more attention to the areas where it's making the biggest errors, leading to faster learning and better generalization.

Focal Loss

Adapts the standard CE to deal with extreme foreground-background class imbalance, where the loss assigned to well-classified examples is reduced.
Works best with highly-imbalanced datasets.
Focal Loss proposes to down-weight easy examples and focuses training on hard negatives using a modulating factor (1 − pt)γ

DPCE - Distance Penalized CE loss

Aims to guide the network’s focus toward hard-to-segment boundary regions.
Distance maps are defined as the distance between the ground truth and predicted map. There are two ways to combine distance maps:
- 1. By creating a neural network architecture with a reconstruction head with segmentation.
  2. By making it a loss function.
DPCE: Introduces distance-based penalization, meaning errors are punished differently based on their distance between the predicted probability (p) and the true label (y), which is defined as follows:

Here, ϕ is created from the distance maps.
Hadamard product.
The constant 1 is added to avoid the gradient vanishing problem in U-net and V-net architectures.

Region-based Loss

Sensitivity-Specificity (SS) Loss
Dice Loss
IoU Loss
Tversky Loss
Focal Tversky Loss
Generalized Dice Loss
Penalty Loss
Log-Cosh Dice Loss

Sensitivity-Specificity (SS) loss

Inspired by Sensitivity and Specificity metrics, used for cases where there is more focus on True Positives.
The weighted sum of the mean squared difference of sensitivity and specificity.
To address imbalanced problems, SS weights the specificity higher, using w parameter.

Dice Loss

Directly optimize the Dice Coefficient which is the most commonly used segmentation evaluation metric.
As Dice Coefficient is non-convex in nature, it has been modified to make it more tractable.
α is a very small number used to ensure that the denominator of the expression is always different from 0.

IoU Loss

A.k.a Jaccard loss, similar to Dice loss, is also used to directly optimize the segmentation metric.

Tversky Loss

Tversky Index (TI) is a generalization of Dice Coefficient. It adds a weight to FP (false positives) and FN (false negatives) with the help of β coefficient
Add different weights to False positives and False negatives, which is different from dice loss using the equal weights for FN and FP.
Tversky Coefficient:

When β=1/2, then TI becomes Dice.

Tversky Loss:

Focal Tversky loss

Applies the concept of Focal loss to focus on hard cases with low probabilities.
Focal Tversky loss also attempts to learn hard-examples such as with small ROIs(region of interest) with the help of γ coefficient (range from [1,3])

Log-Cosh Dice loss

Hybrid loss function: Combines the strengths of both Dice loss and log-cosh loss.
Dice loss: Effectively handles class imbalance and focuses on pixel-wise overlap.
Log-cosh loss: Smoother than Dice loss, reducing sensitivity to outliers and improving convergence.

The deravative of cosh function:

Cosh(x) range can go up to infinity.
So, to capture it in range, log space is used.

Making the log-cosh function to be:

Using Chain rule:

This is a continuous function and have range [-1, 1]

Generalized Dice Loss

The multi-class extension of Dice loss where the weight of each class is inversely proportional to the square of label frequencies.

pGD: Weight FP and FN

Boundary-based Loss

Boundary loss
Hausdorff Distance loss
Shape Aware loss*

Boundary loss

HD - Hausdorff Distance

Aims to estimate Hausdorff Distance from the CNN output probability so as to learn to reduce HD directly.
Calculates the maximum distance between any point on one boundary and the closest point on the other boundary.
- 1. Directed Hausdorff Distance: (Maximum distance from any point in A to the closest point in B)

DHD(A, B) = max_{a ∈ A} min_{b ∈ B} d(a, b)

- 1. Symmetric Hausdorff Distance: (Combines distances in both directions)

HD(A, B) = max(DHD(A, B), DHD(B, A))

Specifically, HD can be estimated by the distance transform of ground truth and segmentation.

Loss tackle the non-convex nature of Distance metric by adding some variations.
Where dG and dS are distance transforms of ground-truth and segmentation.

Weakness: Sensitive to outliers and might over-penalize small segmentation errors.

Variants:
1. Average Hausdorff Distance: Reduces outlier sensitivity by averaging distances.
2. Modified Hausdorff Distance: Excludes farthest points to mitigate outlier effects.
3. Weighted Hausdorff Distance: Assigns different weights to points based on their importance.
4. Combining with Other Losses: Often used with cross-entropy or other losses for a balanced approach.

Shape Aware Loss

Most loss functions work at the pixel level. However, Shape-aware loss calculates the average of the Euclidean distance between points around the predicted curve, and uses it as a coefficient in the Cross-Entropy loss.
Variation of cross-entropy loss by adding a shape-based coefficient, used in cases of hard-to-segment boundaries.
E is considered to be a learned network mask similar to the training shapes.

Compound Loss

Dice+CE
Dice+TopK
Dice+Focal
Exponential Logarithm loss*
Correlation Maximized Structural Similarity loss*

Combo Loss = Dice + BCE

Used for lightly class imbalanced.

It attempts to leverage the flexibility of Dice Loss of class imbalance and at same time use cross-entropy for curve smoothing.
DL is Dice loss.

Dice+TopK

Dice+Focal

Exponential Logarithm Loss (ELL) = Dice Loss + BCE

Focuses on less accurately predicted structures.

We can use: γ_cross = γ_Dice

Correlation Maximized Structural Similarity loss

Focuses on Segmentation Structure.
Used in cases of structural importance such as medical images.
Introduced a Structural Similarity Loss (SSL) to achieve a high positive linear correlation between the ground truth map and the predicted map.
It's divided into 3 steps:
- 1. Structure Comparison.
  2. Cross-Entropy weight coefficient determination.
  3. Mini-batch loss definition.
- In Structure Comparison, authors have calculated e-coefficient, which can measure the degree of linear correlation between GT and Prediction:

C4 is stability factor set to be 0.01 as an empirical observed value.
μy and σy are local mean and standard deviaion of the GT y.

Sau khi tính được độ tương quan e. Tác giả sử dụng nó như 1 hệ số trong Cross Entropy, được định nghĩa như sau:

Sử dụng hệ số trên trong hàm tính CMSSL như sau:

Và hàm loss được định nghĩa cho mini-batch như sau:

Sử dụng công thức bên trên, hàm loss sẽ tự động bỏ qua những pixel không thể hiện được độ tương quan trong cấu trúc.

Relationship between Losses

Dice, Boundary and HD Loss

Key Takeways

DICE is not a Convex function. It can reach value that is larger than 1. (Convex function's value can not be larger than 1).
DICE is regularly used by integrating with other convex functions, such as Cross-Entropy loss.