Motivation:
Surface-defect detection: Deep-learning methods have recently started being employed for addressing surface-defect detection problems in industrial quality control.
The lack of high-precision labelled data for learning: due to the needs for data, many industrial problems cannot be easily solved, or the cost of the solutions would significantly increase.
Objectives:
Proposed an end-to-end architecture, composed of two sub-networks yielding defect segmentation and classification results.
Mixed supervision.
Outperforms fully-supervised settings and weakly-supervised methods.
Demonstrate state-of-the-art results on all four datasets: KolektorSDD, DAGM and Severstal Steel Defect, KolektorSDD2 .
New dataset: KolektorSDD2.
Over 3000 images containing several types of defects, obtained while addressing a real-world industrial problem.
Related work
Fully supervised defect detection:
Several related works explored the use of deep-learning for industrial anomaly detection and categorisation [24, 18, 13, 42, 35, 27, 17, 38, 14]
Masci et al. [21] using a shallow network for steel defect classification.
A more comprehensive study of a modern deep network architecture by Weimer et al. [40].
From recent work, Kim et al. [14] used a VGG16 architecture pre-trained on general images for optical inspection of surfaces
Wanget al. [38] applied a custom 11-layer network for the same task.
Racki et al. [27] further proposed to improve the efficiency of patch-based processing from [40] with a fully convolutional architecture and proposed a two-stage network architecture with a segmentation net for pixel-wise localization of the error and a classification network for per-image defect detection.
In our more recent work [35], we performed an extensive study of the two-stage architecture with several additional improvements and showed the state-of-the-art results that outperformed others such as U-Net [28] and DeepLabv3 [6] on a real-world case of anomaly detection problem.
We also extended this work and presented an end-to-end learning method for the two-stage approach [5], however still in a fully-supervised regime, without considering the task in the context of mixed or weakly supervised learning.
Dong et al. [10] also used the U-Net architecture but combined it with SVM for classification and random forests for detection.
Other recent approaches also explored lightweight networks [41, 18, 13, 20].
Lin et al. [18] used a compact multi-scale cascade CNN termed MobileNet-v2- dense for surface-defect detection.
Huang et al. [13] proposed an even more lightweight network using atrous spatial pyramid pooling (ASPP) and depthwise separable convolution.
Unsupervised learning
In unsupervised learning, annotations are not needed (and are not taken into account even when available) and features are learned from either reconstruction objective [15, 7], adversarial loss [12] or similar self-supervised objective [8, 39, 45]. In unsupervised anomaly detection solutions, the models are usually trained considering only non-anomalous image by applying out-of-distribution detection of anomalies as a significant deviations in features. Various methods based on this principle were proposed, such as AnoGAN [31] and its successor f-AnoGAN [30] that utilize Generative Adversarial Networks, or a deep-metric-learning-based approach with triplet loss that learns features of non-anomalous samples [34], or approach that transfers pre-trained discriminative latent embedding into a smaller network using knowledge transfer for out-of-distribution detection [4], termed Uninformed Students. The latter achieved state-of-the-art results in unsupervised anomaly detection on the MVTec dataset [3], which, however, only partially reflects the complexity of real-world industrial examples.
Weakly supervised learning Various weakly supervised deep-learning approaches have been developed in the context of semantic segmentation and object detection [26, 29, 2, 16, 37, 36]. In early applications, convolutional neural networks were trained with image-tags using Multiple Instance Learning (MIL) [26] or with constrained optimization as in Constrained CNN [25]. The approach by Selehet al. [29] further used dense conditional random fields to generate foreground/background masks that act as priors on an object, while Bearman et al. [2] used a single-pixel point label of object location instead of image-tags. Ge et al. [11] used a segmentation-aggregation framework learned from weakly annotated visual data and applied it to insulator detection on power transmission lines. Others utilized class activation maps (CAM) [46]. Zhu et al. [47] applied CAM for instance segmentation, while Diba et al. [9] simultaneously addressed image classification, object detection, and semantic segmentation, where CAM from image classification is used in a separate cascaded network to improve the last two tasks. Class activation maps were also applied to anomaly detection. Lin et al. [17] addressed defect detection in LED chips using CAM from the AlexNet architecture [46] to localize the defects, but learning only on the image-level labels. Zhang et al. [44] extended CAM for defect localization with bounding box prediction in their proposed CADN model. Their model directly predicts the bounding boxes from category-aware heatmaps while also using knowledge distillation to reduce the complexity of the final inference model. However, both methods do not consider pixel-level labels in the learning process, thus failing to utilize this information when available.
Mixed supervision Several related approaches also considered learning with different precision of labels. Souly et al. [33] combined fully labeled segmentation masks with unlabeled images for pixel-wise semantic segmentation tasks. They train the model in adversarial manner by generating images with GAN and include any provided weak, image-level labels to the discriminator in GAN that further improves the semantic segmentation. Mlynarski et al. [22] addressed the problem of segmenting brain tumors from magnetic resonance images. They proposed to use fully segmented images and combine them with weakly annotated image-level information. They focus on the goal of segmenting brain tumor images, while our primary concern is image-level anomaly detection in the industrial surfacedefect-detection domain. They also do not perform any analysis of different mixtures of the weakly and fully supervised learning, which is the central point of this paper.
To reduce the need for data:
Unsupervised [34, 4, 43] using defect-free images only to train models.
Weakly-supervised [17, 44, 47] methods utilize weak labels and do not require pixel-level annotations.
=> Mixed supervision mode with some fully-labeled samples and a number of weakly-labeled ones as depicted in Fig. 1. (applied on other tasks [33, 22])