Table of Contents
Fetching ...

Non-local Neural Networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He

TL;DR

This work introduces non-local operations as a generic, efficient means to capture long-range dependencies in visual data, extending beyond traditional local convolutions and recurrent connections. By formulating a flexible non-local block that aggregates information from all positions via various pairwise affinities, the authors demonstrate substantial improvements in video classification (Kinetics, Charades) and image tasks (COCO) when inserted into existing architectures. Key contributions include a generalized formulation, multiple instantiations (Gaussian, embedded Gaussian, dot product, concatenation), a residual non-local block design, and extensive ablations showing robustness across spacetime, depth, and sequence length. The results show non-local blocks offer strong accuracy gains with modest computational overhead and can complement 3D convolutions, suggesting broad applicability as a standard building block in vision networks.

Abstract

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code is available at https://github.com/facebookresearch/video-nonlocal-net .

Non-local Neural Networks

TL;DR

This work introduces non-local operations as a generic, efficient means to capture long-range dependencies in visual data, extending beyond traditional local convolutions and recurrent connections. By formulating a flexible non-local block that aggregates information from all positions via various pairwise affinities, the authors demonstrate substantial improvements in video classification (Kinetics, Charades) and image tasks (COCO) when inserted into existing architectures. Key contributions include a generalized formulation, multiple instantiations (Gaussian, embedded Gaussian, dot product, concatenation), a residual non-local block design, and extensive ablations showing robustness across spacetime, depth, and sequence length. The results show non-local blocks offer strong accuracy gains with modest computational overhead and can complement 3D convolutions, suggesting broad applicability as a standard building block in vision networks.

Abstract

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code is available at https://github.com/facebookresearch/video-nonlocal-net .

Paper Structure

This paper contains 39 sections, 6 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: A spacetime non-local operation in our network trained for video classification in Kinetics. A position $\mathbf{x}_i$'s response is computed by the weighted average of the features of all positions $\mathbf{x}_j$ (only the highest weighted ones are shown here). In this example computed by our model, note how it relates the ball in the first frame to the ball in the last two frames. More examples are in Figure \ref{['fig:examples']}.
  • Figure 2: A spacetime non-local block. The feature maps are shown as the shape of their tensors, e.g., $T$$\times$$H$$\times$$W$$\times$$1024$ for 1024 channels (proper reshaping is performed when noted). "$\otimes$" denotes matrix multiplication, and "$\oplus$" denotes element-wise sum. The softmax operation is performed on each row. The blue boxes denote 1$\times$1$\times$1 convolutions. Here we show the embedded Gaussian version, with a bottleneck of 512 channels. The vanilla Gaussian version can be done by removing $\theta$ and $\phi$, and the dot-product version can be done by replacing softmax with scaling by $1/N$.
  • Figure 3: Examples of the behavior of a non-local block in res$_3$ computed by a 5-block non-local model trained on Kinetics. These examples are from held-out validation videos. The starting point of arrows represents one $\mathbf{x}_i$, and the ending points represent $\mathbf{x}_j$. The 20 highest weighted arrows for each $\mathbf{x}_i$ are visualized. The 4 frames are from a 32-frame input, shown with a stride of 8 frames. These visualizations show how the model finds related clues to support its prediction.
  • Figure 4: Curves of the training procedure on Kinetics for the ResNet-50 C2D baseline (blue) vs. non-local C2D with 5 blocks (red). We show the top-1 training error (dash) and validation error (solid). The validation error is computed in the same way as the training error (so it is 1-clip testing with the same random jittering at training time); the final results are in Table \ref{['tab:ablation:deeper']} (R50, 5-block).