Table of Contents
Fetching ...

CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation

Daniel Silver, Ron Kimmel

TL;DR

CoordFlow addresses the challenge of high-quality video reconstruction at low bitrates by introducing a pixel-wise INR that partitions video into temporally coherent layers, each with motion compensation. The method employs an ensemble of CoordFlow layers, where a Flow Network learns a time-dependent similarity transform to canonical space and a Color Network encodes RGB and $\alpha$, with layer fusion enabling unsupervised segmentation. Empirically, CoordFlow achieves state-of-the-art performance among pixel-wise INRs and is competitive with leading frame-wise methods on UVG and Boat datasets, while enabling upsampling, segmentation, inpainting, stabilization, and denoising. These results highlight the potential of motion-aware, layer-wise neural representations for flexible and scalable video compression, though training and inference remain slow and merit acceleration.

Abstract

In the field of video compression, the pursuit for better quality at lower bit rates remains a long-lasting goal. Recent developments have demonstrated the potential of Implicit Neural Representation (INR) as a promising alternative to traditional transform-based methodologies. Video INRs can be roughly divided into frame-wise and pixel-wise methods according to the structure the network outputs. While the pixel-based methods are better for upsampling and parallelization, frame-wise methods demonstrated better performance. We introduce CoordFlow, a novel pixel-wise INR for video compression. It yields state-of-the-art results compared to other pixel-wise INRs and on-par performance compared to leading frame-wise techniques. The method is based on the separation of the visual information into visually consistent layers, each represented by a dedicated network that compensates for the layer's motion. When integrated, a byproduct is an unsupervised segmentation of video sequence. Objects motion trajectories are implicitly utilized to compensate for visual-temporal redundancies. Additionally, the proposed method provides inherent video upsampling, stabilization, inpainting, and denoising capabilities.

CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation

TL;DR

CoordFlow addresses the challenge of high-quality video reconstruction at low bitrates by introducing a pixel-wise INR that partitions video into temporally coherent layers, each with motion compensation. The method employs an ensemble of CoordFlow layers, where a Flow Network learns a time-dependent similarity transform to canonical space and a Color Network encodes RGB and , with layer fusion enabling unsupervised segmentation. Empirically, CoordFlow achieves state-of-the-art performance among pixel-wise INRs and is competitive with leading frame-wise methods on UVG and Boat datasets, while enabling upsampling, segmentation, inpainting, stabilization, and denoising. These results highlight the potential of motion-aware, layer-wise neural representations for flexible and scalable video compression, though training and inference remain slow and merit acceleration.

Abstract

In the field of video compression, the pursuit for better quality at lower bit rates remains a long-lasting goal. Recent developments have demonstrated the potential of Implicit Neural Representation (INR) as a promising alternative to traditional transform-based methodologies. Video INRs can be roughly divided into frame-wise and pixel-wise methods according to the structure the network outputs. While the pixel-based methods are better for upsampling and parallelization, frame-wise methods demonstrated better performance. We introduce CoordFlow, a novel pixel-wise INR for video compression. It yields state-of-the-art results compared to other pixel-wise INRs and on-par performance compared to leading frame-wise techniques. The method is based on the separation of the visual information into visually consistent layers, each represented by a dedicated network that compensates for the layer's motion. When integrated, a byproduct is an unsupervised segmentation of video sequence. Objects motion trajectories are implicitly utilized to compensate for visual-temporal redundancies. Additionally, the proposed method provides inherent video upsampling, stabilization, inpainting, and denoising capabilities.
Paper Structure (25 sections, 6 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: CoordFlow layer architecture workflow. The process initiates with the input pixel coordinates $(x, y, t)$, where $t$ undergoes positional encoding (PE) before being processed by the Flow Network. This network computes a similarity transformation to realign the spatial coordinates $(x, y)$, counteracting the motion within the video sequence, and yielding a set of transformed coordinates $(x', y', t)$. These stabilized coordinates, after positional encoding, are then inputted into the Color Network, which produces the color (RGB) and alpha ($\alpha$) outputs for each pixel. The operation of the Flow Network effectively creates a 'canonical space', in which the temporal motion is neutralized, allowing the Color Network to generate a consistent representation across time.
  • Figure 2: Visualization of CoordFlow. The input coordinates are passed through the CoordFlow layers in parallel, outputting RGB and alpha values. In this example there are only two layers, and the RGB output of each layer can be seen in the middle, in addition to the softmax value of the alphas. The softmax map acts similarly to an attention map, and we can see the background/foreground segmentation. At the far right is the ground truth frame, next to the final output of the model.
  • Figure 3: Comparison to pixel-wise models, NeRV and classic methods. Results taken from lee2023ffnervkwan2024hinervkim2022scalablemaiya2023nirvana
  • Figure 4: Visualization of the interaction between the flow and the color networks: The large image represents the canonical space at frame 100. This image was created by sampling the color network with x and y values which go out of the expected sample area of the color network for a wider view. The expected sample area for times 0, 100, and 200 are the marked boundaries. On the right are the ground truth frames, marked with the corresponding colors. In this example, CoordFlow refers to the video as one large image, and simply changes the sample area of the canonical space, using the flow network, in order to create the current frame.
  • Figure 5: Upsampling of the boat video using CoordFlow (trained on every fourth pixel), bi-linear, and nearest neighbor interpolation. All methods were applied to the same down-sampled video.
  • ...and 4 more figures