Non-local Network

Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He

{, }

Paper: https://openaccess.thecvf.com/content_cvpr_2018/html/Wang_Non-Local_Neural_Networks_CVPR_2018_paper.html

Code: https://github.com/facebookresearch/video-nonlocal-net?utm_source=catalyzex.com

Motivation, Objectives and Related Works

Motivation

Both convolutional and recurrent operations are building blocks that process one local neighbourhood at a time.

Objectives

Present non-local operations as a generic family of building blocks for capturing long-range dependencies.
Non-local operation computes the response at a position as a weighted sum of the features at all positions.
This building block can be plugged into many computer vision architectures.

Related Works

Non-local Image Processing

1. Non-local means [4]
  - A classical filtering algorithm that computes a weighted mean of all pixels in an image.
  - It allows distant pixels to contribute to the filtered response at a location based on patch appearance similarity.
2. BM3D (block-matching 3D) [10]
  - Implement non-local filtering idea.
  - Performs filtering on a group of similar, but non-local, patches.

Graphical Models

1. Conditional random fields (CRF) [29, 28] ==> a graphic model that models long-range dependencies.
  - A CRF can be exploited to post-process semantic segmentation predictions of a network [9].
  - The iterative mean-field inference of CRF can be turned into a recurrent network and trained [56, 42, 8, 18, 34].

Feedforward Modelling for Sequences

1. Using feedforward (i.e., non-recurrent) networks for modelling sequences in speech and language [36, 54, 15].
  - Long-term dependencies are captured by the large receptive fields contributed by very deep 1-D convolutions.
  - These feedforward models are amenable to parallelised implementations and can be more efficient than widely used recurrent models.

Self-attention

1. A self-attention module [49] computes the response at a position in a sequence (e.g., a sentence) by attending to all positions and taking their weighted average in an embedding space.

Interaction Networks

1. Interaction Networks (IN) [2, 52] model physical systems, operating on graphs of objects involved in pairwise interactions.
  - Hoshen [24] presented the more efficient Vertex Attention IN (VAIN) in the context of multi-agent predictive modelling.
  - Relation Networks [40], computes a function on the feature embeddings at all pairs of positions in its input.

Video Classification Architectures

1. A natural solution to video classification is to combine the success of CNNs for images and RNNs for sequences [55, 11].
2. Feedforward models are achieved by 3D convolutions (C3D) [26, 48] in spacetime, and the 3D filters can be formed by “inflating” [13, 7] pre-trained 2D filters.
3. Optical flow [45] and trajectories [50, 51] can be helpful.

Non-local Means [4 - Phu Read]

Definition:
- The NLM algorithm aims to reduce noise in images while preserving image details and textures.
- Unlike traditional "local" filters (e.g., Gaussian, median), which average pixels within a small neighborhood around the target pixel, NLM considers similarities between patches of pixels throughout the entire image.
How It Works:
- Patch Comparison: For a target pixel, the NLM algorithm defines a small patch around that pixel. It then searches the entire image for other patches that are similar to the target patch.
- Weighted Averaging: The value of the target pixel is replaced with a weighted average of pixels from similar patches. The weights are determined by how similar the patches are to the target patch. More similar patches get higher weights.
Python:
- skimage.restoration.denoise_nl_means (scikit-image)
- cv2.fastNlMeansDenoising (OpenCV)
Formular:
- Given an image u, at a pixel p, the denoise value of pixel p is calculated as:

with

- C(p): normalized parameter.
- w(p,q) is the weighting function of pixels p and q. It can be weighted Euclidean distance.
- Let B(p) be the average value of pixels around pixel p.
- h is a parameter that adjusts the degree of weight reduction as the Euclidean distance increases.

Model

Idea

Non-local modules enhance deep neural networks by directly computing relationships between distant positions within an image or video, capturing long-range dependencies that convolutional layers might miss.

Steps

Non-local Operation

1. In Neural Network, non-local operation can be:

- i: index of an output position (in space, time, or spacetime) whose response is to be computed.
- j: index that enumerates all possible positions.
- x: input signal (image, sequence, video; often their features)
- y: output signal of the same size as x.
- f: A pairwise function computes a scalar (representing a relationship such as affinity) between i and all j.
- g: The unary function computes a representation of the input signal at the position j.
- C(x): normalizing factor.

Instantiations

g(xj) = Wgxj, where Wg is a weight matrix to be learned.
Choices for the pairwise function f:

Gaussian

Here xiTxj is dot-product similarity.
Euclidean distance as used in [4, 47] is also applicable, but dot product is more implementation-friendly in modern deep learning platforms.

with

Embedded Gaussian

Here θ(xi) = Wθxi (query) and φ(xj) = Wφxj (key) are two embeddings.

with

Dot Product

Here θ(xi) = Wθxi and φ(xj) = Wφxj are two embeddings.
C(x) = N where N is the number of positions in x, rather than the sum of f.

Concatenation

Concatenation is used by the pairwise function in Relation Networks [40] for visual reasoning.
Here [· , ·] denotes concatenation and wf is a weight vector that projects the concatenated vector to a scalar.
C(x) = N.
In this case, we adopt ReLU [35] in f.

Non-local Block

Wrap the non-local operation in Eq.(1) into a non-local block:

where yi is given in Eq.(1) and “+xi” denotes a residual connection [21].
The residual connection allows us to insert a new non-local block into any pre-trained model, without breaking its initial behavior (e.g., if Wz is initialized as zero).
The pairwise computation in Eq.(2), (3), or (4) can be simply done by matrix multiplication as shown in Figure 2; the concatenation version in (5) is straightforward.

Lightweight: The pairwise computation of a non-local block is lightweight when it is used in high-level, sub-sampled feature maps. (ex: T = 4, H = W = 14 or 7). The pairwise computation as done by matrix multiplication is comparable to a typical convolutional layer in standard networks.

Implementation of Non-local Blocks:
- We set the number of channels represented by Wg, Wθ, and Wφ to be half of the number of channels in x. [21]
- The weight matrix Wz in Eq.(6) computes a position-wise embedding on yi, matching the number of channels to that of x.
- A subsampling trick can be used to further reduce computation.

Architecture

Baseline ResNet-50 C2D model for video:

Training Strategy

Training.
- Pre-trained on ImageNet [39].
- Fine-tune using 32-frame input clips.
- These clips are formed by randomly cropping out 64 consecutive frames from the original full-length video and then dropping every other frame.
- The spatial size is 224×224 pixels, randomly cropped from a scaled video whose shorter side is randomly sampled in [256, 320] pixels, following [46].
- Train on an 8-GPU machine and each GPU has 8 clips in a mini-batch (so in total with a mini-batch size of 64 clips).
- 400k iterations in total, starting with a learning rate of 0.01 and reducing it by a factor of 10 at every 150k iterations (see also Figure 4).
- Momentum of 0.9 and a weight decay of 0.0001.
- Dropout [22] after the global pooling layer, with a dropout ratio of 0.5.
- Fine-tune our models with BatchNorm (BN) [25] enabled when it is applied. ==> reduces overfitting.
- Add a BN layer right after the last 1×1×1 layer that represents Wz;
Inference.
- Spatial: Perform spatially fully convolutional inference on videos whose shorter side is rescaled to 256.
- Temporal: Sample 10 clips evenly from a full-length video and compute the softmax scores on them individually.
- The final prediction is the averaged softmax scores of all clips.