Table of Contents
Fetching ...

DCVNet: Dilated Cost Volume Networks for Fast Optical Flow

Huaizu Jiang, Erik Learned-Miller

TL;DR

DCVNet proposes a single-pass optical flow model that uses multiple dilated cost volumes to capture both small and large displacements without sequential refinement. A U-Net converts the concatenated dilated volumes into interpolation weights, enabling a weighted combination of candidate displacements to produce the flow. The approach achieves competitive accuracy on Sintel and KITTI while maintaining real-time performance (30 fps) on a mid-range GPU, thanks to efficient cost-volume construction and a lightweight decoding stage. Training combines SceneFlow pre-training with targeted fine-tuning, and ablations demonstrate the benefit of multiple dilation factors and supervised interpolation weights. This method offers a fast alternative to coarse-to-fine or recurrent cost-volume strategies in optical flow estimation.

Abstract

The cost volume, capturing the similarity of possible correspondences across two input images, is a key ingredient in state-of-the-art optical flow approaches. When sampling correspondences to build the cost volume, a large neighborhood radius is required to deal with large displacements, introducing a significant computational burden. To address this, coarse-to-fine or recurrent processing of the cost volume is usually adopted, where correspondence sampling in a local neighborhood with a small radius suffices. In this paper, we propose an alternative by constructing cost volumes with different dilation factors to capture small and large displacements simultaneously. A U-Net with skip connections is employed to convert the dilated cost volumes into interpolation weights between all possible captured displacements to get the optical flow. Our proposed model DCVNet only needs to process the cost volume once in a simple feedforward manner and does not rely on the sequential processing strategy. DCVNet obtains comparable accuracy to existing approaches and achieves real-time inference (30 fps on a mid-end 1080ti GPU). The code and model weights are available at https://github.com/neu-vi/ezflow.

DCVNet: Dilated Cost Volume Networks for Fast Optical Flow

TL;DR

DCVNet proposes a single-pass optical flow model that uses multiple dilated cost volumes to capture both small and large displacements without sequential refinement. A U-Net converts the concatenated dilated volumes into interpolation weights, enabling a weighted combination of candidate displacements to produce the flow. The approach achieves competitive accuracy on Sintel and KITTI while maintaining real-time performance (30 fps) on a mid-range GPU, thanks to efficient cost-volume construction and a lightweight decoding stage. Training combines SceneFlow pre-training with targeted fine-tuning, and ablations demonstrate the benefit of multiple dilation factors and supervised interpolation weights. This method offers a fast alternative to coarse-to-fine or recurrent cost-volume strategies in optical flow estimation.

Abstract

The cost volume, capturing the similarity of possible correspondences across two input images, is a key ingredient in state-of-the-art optical flow approaches. When sampling correspondences to build the cost volume, a large neighborhood radius is required to deal with large displacements, introducing a significant computational burden. To address this, coarse-to-fine or recurrent processing of the cost volume is usually adopted, where correspondence sampling in a local neighborhood with a small radius suffices. In this paper, we propose an alternative by constructing cost volumes with different dilation factors to capture small and large displacements simultaneously. A U-Net with skip connections is employed to convert the dilated cost volumes into interpolation weights between all possible captured displacements to get the optical flow. Our proposed model DCVNet only needs to process the cost volume once in a simple feedforward manner and does not rely on the sequential processing strategy. DCVNet obtains comparable accuracy to existing approaches and achieves real-time inference (30 fps on a mid-end 1080ti GPU). The code and model weights are available at https://github.com/neu-vi/ezflow.

Paper Structure

This paper contains 15 sections, 6 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Illustrations of our proposed model, DCVNet, compared with two representative existing approaches. DCVNet is an alternative to existing approaches, which does not need the sequential processing of the cost volume. The key idea is to construct cost volumes with different dilation rates to capture small and large displacement at the same time. It achieves real-time inference on a mid-end 1080ti GPU (30 fps) and comparable accuracy to existing approaches.
  • Figure 2: Illustration of using dilation to capture both small and large displacements. (a) input two images where points $A$ and $B$ move to $A'$ and $B'$, respectively. (b) two patches around $A$ in two images. (c) two patches around $B$ in two images. Blue dots in (b) and (c) correspond to candidate displacements when constructing cost volumes. With a small search radius (2 in this example), correct displacements (denoted by red and blue crosses, respectively) can be captured using two different dilation factors. Best viewed in color.
  • Figure 3: Pipeline of DCVNet. Feature representations of two input images are obtained from the encoder, which are used to construct the dilated cost volumes at different strides and dilation rates. A U-Net is employed to process the cost volumes to produce a set of interpolation weights over the captured displacements in the cost volume to compute the optical flow.
  • Figure 4: Illustration of interpolation weights. For both points A and B, in the right, we show the interpolation weights obtained with and without the U-Net filtering on the top and bottom, respectively. Each image represents $U\times V$ ($9\times 9)$ interpolation weights. The feature stride is 8 and different dilation factors are shown in the bottom. We can see that for the point A, whose motion magnitude is small, a small dilation factor is sufficient to capture the correspondence. While for the point B, whose motion magnitude is large, a large dilation factor is needed. (Color encoding: blue is close to 0 and yellow is close to 1. Best viewed in color.)
  • Figure 5: Visual comparison of optical flow estimations. From left to right: (a) input images, (b) PWCNet sun2018pwc, (c) VCN yang2019volumetric, (d) RAFT teed2020raft, and (e) our DCVNet. For each method, we show colorized optical flow and error maps (obtained from online servers). For the error maps, white and red indicate large error while black and blue mean small error. Best viewed in color.
  • ...and 9 more figures