0) Motivation, Object and Related works:
Motivation:
With the advent of deep learning, many dense prediction tasks, i.e. tasks that produce pixel-level predictions, have seen significant performance improvements.
Typical approach: learn these tasks in isolation, that is, a separate neural network is trained for each individual task.
Yet, recent multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint, by jointly tackling multiple tasks through a learned shared representation.
Objectives:
In this survey, we provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision, explicitly emphasizing on dense prediction tasks.
Contributions
First, we consider MTL from a network architecture point-of-view. We include an extensive overview and discuss the advantages/disadvantages of recent popular MTL models.
Second, we examine various optimization methods to tackle the joint learning of multiple tasks. We summarize the qualitative elements of these works and explore their commonalities and differences.
Finally, we provide an extensive experimental evaluation across a variety of dense prediction benchmarks to examine the pros and cons of the different methods, including both architectural and optimization based strategies.
Introduction:
Multi-Task Learning (MTL) aims to improve such generalization by leveraging domain-specific information contained in the training signals of related tasks. In the deep learning era, MTL translates to designing networks capable of learning shared representations from multi-task supervisory signals.
Compared to the single-task case, where each individual task is solved separately by its own network, such multi-task networks bring several advantages to the table.
First, due to their inherent layer sharing, the resulting memory footprint is substantially reduced.
Second, as they explicitly avoid to repeatedly calculate the features in the shared layers, once for every task, they show increased inference speeds.
Most importantly, they have the potential for improved performance if the associated tasks share complementary information, or act as a regularizer for one another.
Tackling multiple dense prediction tasks differs in several aspects from solving multiple classification tasks.
First, as jointly learning multiple dense prediction tasks is governed by the use of different loss functions, unlike classification tasks that mostly use cross-entropy losses, additional consideration is required to avoid a scenario where some tasks overwhelm the others during training.
Second, opposed to image-level classification tasks, dense prediction tasks can not be directly predicted from a shared global image representation [35], which renders the network design more difficult.
Third, pixel-level tasks in scene understanding often have similar characteristics [14], and these similarities can potentially be used to boost the performance under a MTL setup. A popular example is semantic segmentation and depth estimation [13].
Related works:
MTL has been the subject of several surveys [30], [31], [36], [37]
In [30], Caruana showed that MTL can be beneficial as it allows for the acquisition of inductive bias through the inclusion of related additional tasks into the training pipeline. The author showcased the use of MTL in artificial neural networks, decision trees and k-nearest neighbors methods, but this study is placed in the very early days of neural networks, rendering it outdated in the deep learning era.
Ruder [36] gave an overview of recent MTL techniques (e.g. [5], [6], [9], [19]) applied in deep neural networks.
In the same vein, Zhang and Yang [31] provided a survey that includes feature learning, low-rank, task clustering, task relation learning, and decomposition approaches for MTL. Yet, both works are literature review studies without an empirical evaluation or comparison of the presented techniques.
Gong et al. [37] benchmarked several optimization techniques (e.g. [8], [19]) across three MTL datasets. Still, the scope of this study is rather limited, and explicitly focuses on the optimization aspect.
=> All prior studies provide a general overview on MTL without giving specific attention to dense prediction tasks that are of utmost importance in computer vision.
Non-Deep Learning Methods:
Task parameters should lie close to each other w.r.t. some distance metric [38], [39], [40], [41], share a common probabilistic prior [42], [43], [44], [45], [46], or reside in a low dimensional subspace [47], [48], [49] or manifold [50].
These assumptions work well when all tasks are related [38], [47], [51], [52], but can lead to performance degradation if information sharing happens between unrelated tasks.
The latter is a known problem in MTL, referred to as negative transfer. To mitigate this problem, some of these works opted to cluster tasks into groups based on prior beliefs about their similarity or relatedness.
Soft and Hard Parameter Sharing in Deep Learning: deep multi-task architectures were classified into hard or soft parameter sharing techniques.
In hard parameter sharing, the parameter set is divided into shared and task-specific parameters (see Figure 2a). MTL models using hard parameter sharing typically consist of a shared encoder that branches out into task-specific heads [19], [20], [22], [53], [54].
UberNet [55] was the first hardparameter sharing model to jointly tackle a large number of low-, mid-, and high-level vision tasks. The model featured a multi-head design across different network layers and scales. Still, the most characteristic hard parameter sharing design consists of a shared encoder that branches out into task-specific decoding heads [19], [20], [22], [53], [54]. Multilinear relationship networks [56] extended this design by placing tensor normal priors on the parameter set of the fully connected layers. In these works the branching points in the network are determined ad hoc, which can lead to suboptimal task groupings. To alleviate this issue, several recent works [9], [10], [11], [12] proposed efficient design procedures that automatically decide where to share or branch within the network. Similarly, stochastic filter groups [57] re-purposed the convolution kernels in each layer to support shared or task-specific behaviour.
In soft parameter sharing, each task is assigned its own set of parameters and a feature sharing mechanism handles the cross-task talk (see Figure 2b).
Cross-stitch networks [5] introduced soft-parameter sharing in deep MTL architectures. The model uses a linear combination of the activations in every layer of the task-specific networks as a means for soft feature fusion. Sluice networks [6] extended this idea by allowing to learn the selective sharing of layers, subspaces and skip connections. NDDR-CNN [7] also incorporated dimensionality reduction techniques into the feature fusion layers. Differently, MTAN [8] used an attention mechanism to share a general feature pool amongst the task-specific networks. A concern with soft parameter sharing approaches is scalability, as the size of the multi-task network tends to grow linearly with the number of tasks.
Distilling Task Predictions in Deep Learning: a few recent works first employed a multi-task network to make initial task predictions, and then leveraged features from these initial predictions to further improve each task output – in a one-off or recursive manner.
PAD-Net [13] proposed to distill information from the initial task predictions of other tasks, by means of spatial attention, before adding it as a residual to the task of interest. JTRL [15] opted for sequentially predicting each task, with the intention to utilize information from the past predictions of one task to refine the features of another task at each iteration. PAP-Net [14] extended upon this idea, and used a recursive procedure to propagate similar cross-task and task-specific patterns found in the initial task predictions. To do so, they operated on the affinity matrices of the initial predictions, and not on the features themselves, as was the case before [13], [15]. Zhou et al. [17] refined the use of pixel affinities to distill the information by separating inter- and intra-task patterns from each other. MTI-Net [16] adopted a multi-scale multimodal distillation procedure to explicitly model the unique task interactions that happen at each individual scale.
A New Taxonomy of MTL Approaches: several recent works took inspiration from both groups of works to jointly solve multiple pixellevel tasks. As a consequence, it is debatable whether the soft vs hard parameter sharing paradigm should still be used as the main framework for classifying MTL architectures.
In this survey, we propose an alternative taxonomy that discriminates between different architectures on the basis of where the task interactions take place, i.e. locations in the network where information or features are exchanged or shared between tasks. The impetus for this framework was given in Section 2.1.3. Based on the proposed criterion, we distinguish between two types of models: encoder-focused and decoder-focused architectures. The encoder-focused architectures (see Figure 3a) only share information in the encoder, using either hard- or soft-parameter sharing, before decoding each task with an independent task-specific head. Differently, the decoder-focused architectures (see Figure 3b) also exchange information during the decoding stage. Figure 1 gives an overview of the proposed taxonomy, listing representative works in each case.
1) DEEP MULTI-TASK ARCHITECTURES:
1.1 Encoder-focused Architectures: (Figure 3a)
1.1.0 Strategy:
Share the task features in the encoding stage (off-the-shelf backbone network), before they process them with a set of independent task-specific heads.
This model relies on the encoder (i.e. backbone network) to learn a generic representation of the scene. The features from the encoder are then used by the task-specific heads to get the predictions for every task.
While this simple model shares the full encoder amongst all tasks, recent works have considered where and how the feature sharing should happen in the encoder
1.1.1 Cross-Stitch Networks [5]:
Shared the activations amongst all single-task networks in the encoder.
Assume we are given two activation maps xA, xB at a particular layer, that belong to tasks A and B respectively. A learnable linear combination of these activation maps is applied, before feeding the transformed result x˜A, x˜B to the next layer in the single-task networks. The transformation is parameterized by learnable weights α, and can be expressed as:
As illustrated in Figure 4, this procedure is repeated at multiple locations in the encoder. By learning the weights α, the network can decide the degree to which the features are shared between tasks. In practice, we are required to pre-train the single-task networks, before stitching them together, in order to maximize the performance. A disadvantage of cross-stitch networks is that the size of the network increases linearly with the number of tasks. Furthermore, it is not clear where the cross-stitch units should be inserted in order to maximize their effectiveness. Sluice networks [6] extended this work by also supporting the selective sharing of subspaces and skip connections.
1.1.2 Neural Discriminative Dimensionality Reduction:
Used a similar architecture with crossstitch networks (see Figure 4). However, instead of utilizing a linear combination to fuse the activations from all singletask networks, a dimensionality reduction mechanism is employed. First, features with the same spatial resolution in the single-task networks are concatenated channel-wise. Second, the number of channels is reduced by processing the features with a 1 by 1 convolutional layer, before feeding the result to the next layer. The convolutional layer allows to fuse activations across all channels. Differently, crossstitch networks only allow to fuse activations from channels that share the same index. The NDDR-CNN behaves as a cross-stitch network when the non-diagonal elements in the weight matrix of the convolutional layer are zero. Due to their similarity with cross-stitch networks, NDDR-CNNs are prone to the same problems. First, there is a scalability concern when dealing with a large number of tasks. Second, NDDR-CNNs involve additional design choices, since we need to decide where to include the NDDR layers. Finally, both cross-stitch networks and NDDR-CNNs only allow to use limited local information (i.e. small receptive field) when fusing the activations from the different single-task networks. We hypothesize that this is suboptimal because the use of sufficient context is very important during encoding – as already shown for the tasks of image classification [58] and semantic segmentation [59], [60], [61]. This is backed up by certain decoder-focused architectures in Section 2.3 that overcome the limited receptive field by predicting the tasks at multiple scales and by sharing the features repeatedly at every scale.
1.1.3 Multi-Task Attention Networks (MTAN) [8]:
Used a shared backbone network in conjunction with task-specific attention modules in the encoder. The shared backbone extracts a general pool of features. Then, each task-specific attention module selects features from the general pool by applying a soft attention mask. The attention mechanism is implemented using regular convolutional layers and a sigmoid non-linearity. Since the attention modules are small compared to the backbone network, the MTAN model does not suffer as severely from the scalability issues that are typically associated with cross-stitch networks and NDDR-CNNs. However, similar to the fusion mechanism in the latter works, the MTAN model can only use limited local information to produce the attention mask.
1.1.4 Branched Multi-Task Learning Networks:
1.2 Decoder-Focused Architectures: (Figure 3b)
1.2.0 Strategy:
Directly predict all task outputs from the same input in one processing cycle (i.e. all predictions are generated once, in parallel or sequentially, and are not refined afterwards).
By doing so, they fail to capture commonalities and differences among tasks, that are likely fruitful for one another (e.g. depth discontinuities are usually aligned with semantic edges).
Arguably, this might be the reason for the moderate only performance improvements achieved by the encoder-focused approaches to MTL (see Section 4.3.1).
To alleviate this issue, a few recent works first employed a multi-task network to make initial task predictions, and then leveraged features from these initial predictions in order to further improve each task output – in an one-off or recursive manner.
As these MTL approaches also share or exchange information during the decoding stage, we refer to them as decoder-focused architectures (see Figure 3b)
1.2.1 PAD-Net [13]
PAD-Net [13] was one of the first decoder-focused architectures. The model itself is visualized in Figure 6. As can be seen, the input image is first processed by an off-the-shelf backbone network. The backbone features are further processed by a set of task-specific heads that produce an initial prediction for every task. These initial task predictions add deep supervision to the network, but they can also be used to exchange information between tasks, as will be explained next. The task features in the last layer of the task-specific heads contain a per-task feature representation of the scene. PAD-Net proposed to re-combine them via a multi-modal distillation unit, whose role is to extract cross-task information, before producing the final task predictions.
1.2.2 Pattern-Affinitive Propagation Networks (PAP-Net) [14]
1.2.3 Joint Task-Recursive Learning (JTRL) [15]
1.2.4 Multi-Scale Task Interaction Networks (MTI-Net) [16]
1.3 Other Approaches:
A number of approaches that fall outside the aforementioned categories have been proposed in the literature. For example, multilinear relationship networks [56] used tensor normal priors to the parameter set of the task-specific heads Backbone Scale 1/32 Distillation Scale 1/32 Scale 1/16 Distillation Scale 1/16 Scale 1/8 Distillation Scale 1/8 Scale 1/4 Initial Task Predictions Distillation Scale 1/4 Feature Propagation Module Feature Propagation Module Initial Task Predictions Feature Propagation Module Initial Task Predictions Feature Aggregation Feature Aggregation Multi-Scale Multi-Modal Distillation Fig. 9: The architecture in Multi-Scale Task Interaction Networks [16]. Starting from a backbone that extracts multiscale features, initial task predictions are made at each scale. These task features are then distilled separately at every scale, allowing the model to capture unique task interactions at multiple scales, i.e. receptive fields. After distillation, the distilled task features from all scales are aggregated to make the final task predictions. To boost performance, a feature propagation module is included to pass information from lower resolution task features to higher ones. to allow interactions in the decoding stage. Different from the standard parallel ordering scheme, where layers are aligned and shared (e.g. [5], [7]), soft layer ordering [64] proposed a flexible sharing scheme across tasks and network depths. Yang et al. [65] generalized matrix factorisation approaches to MTL in order to learn cross-task sharing structures in every layer of the network. Routing networks [66] proposed a principled approach to determine the connectivity of a network’s function blocks through routing. Piggyback [67] showed how to adapt a single, fixed neural network to a multi-task network by learning binary masks. Huang et al. [68] introduced a method rooted in Neural Architecture Search (NAS) for the automated construction of a tree-based multi-attribute learning network. Stochastic filter groups [57] re-purposed the convolution kernels in each layer of the network to support shared or taskspecific behaviour. In a similar vein, feature partitioning [69] presented partitioning strategies to assign the convolution kernels in each layer of the network into different tasks. In general, these works have a different scope within MTL, e.g. automate the network architecture design. Moreover, they mostly focus on solving multiple (binary) classification tasks, rather than multiple dense predictions tasks. As a result, they fall outside the scope of this survey, with one notable exception that is discussed next.
Attentive Single-Tasking of Multiple Tasks (ASTMT) [18] proposed to take a ’single-tasking’ route for the MTL problem. That is, within a multi-tasking framework they performed separate forward passes, one for each task, that activate shared responses among all tasks, plus some residual responses that are task-specific. Furthermore, to suppress the negative transfer issue they applied adversarial training on the gradients level that enforces them to be statistically indistinguishable across tasks. An advantage of this approach is that shared and task-specific information within the network can be naturally disentangled. On the negative side, however, the tasks can not be predicted altogether, but only one after the other, which significantly increases the inference speed and somehow defies the purpose of MTL.
2) OPTIMIZATION IN MTL: => task balancing problem
2.1 Task Balancing Approaches:
Uncertainty Weighting
Gradient Normalization
Dynamic Weight Averaging (DWA)
Dynamic Task Prioritization (DTP)
MTL as Multi-Objective Optimization
2.2 Other Approaches
MTL has been the subject of several surveys [30], [31], [36], [37]
In [30], Caruana showed that MTL can be beneficial as it allows for the acquisition of inductive bias through the inclusion of related additional tasks into the training pipeline. The author showcased the use of MTL in artificial neural networks, decision trees and k-nearest neighbors methods, but this study is placed in the very early days of neura
References:
Multi-task:
[30] R. Caruana, “Multitask learning,” Machine learning, 1997.
[31] Zhang, Y. A survey on multi-task learning, ArXiv, 2017.
[36] Ruder, S. An overview of multi-task learning in deep neural networks, ArXiv, 2017.
[37] Gong, T., Lee, T., Stephenson, C., Renduchintala, V., Padhy, S., Ndirango, A., ... & Elibol, O. H. A comparison of loss weighting strategies for multi task learning in deep neural networks, IEEE Access, 2019.
Encoder-focused Architectures:
[19]
[20]
[22]
[53]
[54]
Decoder-Focused Architectures:
PAD-Net [13]
Pattern-Affinitive Propagation Networks (PAP-Net) [14]
Joint Task-Recursive Learning (JTRL) [15]
Multi-Scale Task Interaction Networks (MTI-Net) [16]