CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation

Daniel Silver; Ron Kimmel

CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation

Daniel Silver, Ron Kimmel

TL;DR

CoordFlow addresses the challenge of high-quality video reconstruction at low bitrates by introducing a pixel-wise INR that partitions video into temporally coherent layers, each with motion compensation. The method employs an ensemble of CoordFlow layers, where a Flow Network learns a time-dependent similarity transform to canonical space and a Color Network encodes RGB and $\alpha$, with layer fusion enabling unsupervised segmentation. Empirically, CoordFlow achieves state-of-the-art performance among pixel-wise INRs and is competitive with leading frame-wise methods on UVG and Boat datasets, while enabling upsampling, segmentation, inpainting, stabilization, and denoising. These results highlight the potential of motion-aware, layer-wise neural representations for flexible and scalable video compression, though training and inference remain slow and merit acceleration.

Abstract

In the field of video compression, the pursuit for better quality at lower bit rates remains a long-lasting goal. Recent developments have demonstrated the potential of Implicit Neural Representation (INR) as a promising alternative to traditional transform-based methodologies. Video INRs can be roughly divided into frame-wise and pixel-wise methods according to the structure the network outputs. While the pixel-based methods are better for upsampling and parallelization, frame-wise methods demonstrated better performance. We introduce CoordFlow, a novel pixel-wise INR for video compression. It yields state-of-the-art results compared to other pixel-wise INRs and on-par performance compared to leading frame-wise techniques. The method is based on the separation of the visual information into visually consistent layers, each represented by a dedicated network that compensates for the layer's motion. When integrated, a byproduct is an unsupervised segmentation of video sequence. Objects motion trajectories are implicitly utilized to compensate for visual-temporal redundancies. Additionally, the proposed method provides inherent video upsampling, stabilization, inpainting, and denoising capabilities.

CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation

TL;DR

, with layer fusion enabling unsupervised segmentation. Empirically, CoordFlow achieves state-of-the-art performance among pixel-wise INRs and is competitive with leading frame-wise methods on UVG and Boat datasets, while enabling upsampling, segmentation, inpainting, stabilization, and denoising. These results highlight the potential of motion-aware, layer-wise neural representations for flexible and scalable video compression, though training and inference remain slow and merit acceleration.

Abstract

Paper Structure (25 sections, 6 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 9 figures, 8 tables, 1 algorithm.

Introduction & related works
Method
CoordFlow layer
Model architecture
Loss
Experiments
Neural representation
Model Compression
Ablation study
Advanced Applications of CoordFlow
Upsampling
Segmentation and inpainting
Video stabilization
Denoising
Conclusion
...and 10 more sections

Figures (9)

Figure 1: CoordFlow layer architecture workflow. The process initiates with the input pixel coordinates $(x, y, t)$, where $t$ undergoes positional encoding (PE) before being processed by the Flow Network. This network computes a similarity transformation to realign the spatial coordinates $(x, y)$, counteracting the motion within the video sequence, and yielding a set of transformed coordinates $(x', y', t)$. These stabilized coordinates, after positional encoding, are then inputted into the Color Network, which produces the color (RGB) and alpha ($\alpha$) outputs for each pixel. The operation of the Flow Network effectively creates a 'canonical space', in which the temporal motion is neutralized, allowing the Color Network to generate a consistent representation across time.
Figure 2: Visualization of CoordFlow. The input coordinates are passed through the CoordFlow layers in parallel, outputting RGB and alpha values. In this example there are only two layers, and the RGB output of each layer can be seen in the middle, in addition to the softmax value of the alphas. The softmax map acts similarly to an attention map, and we can see the background/foreground segmentation. At the far right is the ground truth frame, next to the final output of the model.
Figure 3: Comparison to pixel-wise models, NeRV and classic methods. Results taken from lee2023ffnervkwan2024hinervkim2022scalablemaiya2023nirvana
Figure 4: Visualization of the interaction between the flow and the color networks: The large image represents the canonical space at frame 100. This image was created by sampling the color network with x and y values which go out of the expected sample area of the color network for a wider view. The expected sample area for times 0, 100, and 200 are the marked boundaries. On the right are the ground truth frames, marked with the corresponding colors. In this example, CoordFlow refers to the video as one large image, and simply changes the sample area of the canonical space, using the flow network, in order to create the current frame.
Figure 5: Upsampling of the boat video using CoordFlow (trained on every fourth pixel), bi-linear, and nearest neighbor interpolation. All methods were applied to the same down-sampled video.
...and 4 more figures

CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation

TL;DR

Abstract

CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)