PCT: Point Cloud Transformer
{, }
0) Motivation, Object and Related works:
Motivation:
The irregular domain and lack of ordering make it challenging to design deep neural networks for point cloud processing.
Objectives:
This paper presents a novel framework named Point Cloud Transformer(PCT) for point cloud learning. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing. It is inherently permutation invariant for processing a sequence of points, making it well-suited for point cloud learning. To better capture local context within the point cloud, we enhance input embedding with the support of farthest point sampling and nearest neighbor search. Extensive experiments demonstrate that the PCT achieves the state-of-the-art performance on shape classification, part segmentation, semantic segmentation and normal estimation tasks.
Relatedwork:
Transformer in NLP Bahdanau et al. [2] proposed a neural machine translation method with an attention mechanism, in which attention weight is computed through the hidden state of an RNN. Self-attention was proposed by Lin et al. [18] to visualize and interpret sentence embeddings. Building on these, Vaswani et al. [26] proposed Transformer for machine translation; it is based solely on self-attention, without any recurrence or convolution operators. Devlin et al. [6] proposed bidirectional transformers (BERT) approach, which is one of the most powerful models in the NLP field. More lately, language learning networks such as XLNet [36], Transformer-XL [5] and BioBERT [15] have further extended the Transformer framework. However, in natural language processing, the input is in order, and word has basic semantic, whereas point clouds are unordered, and individual points have no semantic meaning in general.
Transformer for vision Many frameworks have introduced attention into vision tasks. Wang et al. [27] proposed a residual attention approach with stacked attention modules for image classification. Hu et al. [10] presented a novel spatial encoding unit, the SE block, whose idea was derived from the attention mechanism. Zhang el al. [38] designed SAGAN, which uses self-attention for image generation. There has also been an increasing trend to employ Transformer as a module to optimize neural networks. Wu et al. [30] proposed visual transformers that apply Transformer to tokenbased images from feature maps for vision tasks. Recently, Dosovitskiy [7], proposed an image recognition network, ViT, based on patch encoding and Transformer, showing that with sufficient training data, Transformer provides better performance than a traditional convolutional neural network. Carion et al. [4] presented an end-to-end detection transformer that takes CNN features as input and generates bounding boxes with a Transformer encoder-decoder. Inspired by the local patch structures used in ViT and basic semantic information in language word, we present a neighbor embedding module that aggregates features from a point’s local neighborhood, which can capture the local information and obtain semantic information.
Point-based deep learning PointNet [21] pioneered point cloud learning. Subsequently, Qi et al. proposed PointNet++ [22], which uses query ball grouping and hierarchical PointNet to capture local structures. Several subsequent works considered how to define convolution operations on point clouds. One main approach is to convert a point cloud into a regular voxel array to allow convolution operations. Tchapmi et al. [24] proposed SEGCloud for pointwise segmentation. It maps convolution features of 3D voxels to point clouds using trilinear interpolation and keeps global consistency through fully connected conditional random fields. Atzmon et al [1] present the PCNN framework with extension and restriction operators to map between point-based representation and voxel-based representation. Volumetric convolution is performed on voxels for point feature extraction. MCCNN by Hermosilla et al. [8] allows non-uniformly sampled point clouds; convolution is treated as a Monte Carlo integration problem. Similarly, in PointConv proposed by Wu et al. [31], 3D convolution is performed through Monte Carlo estimation and importance sampling. A different approach redefines convolution to operation on irregular point cloud data. Li et al. [17] introduce a point cloud convolution network, PointCNN, in which a χtransformation is trained to determine a 1D point order for convolution. Tatarchenko et al. [23] proposed tangent convolution, which can learn surface geometric features from projected virtual tangent images. SPG proposed by Landrieu et al. [13] divides the scanned scene into similar elements, and establishes a superpoint graph structure to learn contextual relationships between object parts. Pan et al. [35] use a parallel framework to extend CNN from the conventional domain to a curved two-dimensional manifold. However, it requires dense 3D gridded data as input so is unsuitable for 3D point clouds. Wang et al. [29] designed an EdgeConv operator for dynamic graphs, allowing point cloud learning by recovering local topology. Various other methods also employ attention and Transformer. Yan et al. [34] proposed PointASNL to deal with noise in point cloud processing, using a self-attention mechanism to update features for local groups of points. Hertz et al. [9] proposed PointGMM for shape interpolation with both multi-layer perceptron (MLP) splits and attentional splits. Unlike the above methods, our PCT is based on Transformer rather than using self-attention as an auxiliary module. While a framework by Wang et al. [28] uses Transformer to optimize point cloud registration, our PCT is a more general framework which can be used for various point cloud tasks