CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation
Daniel Silver, Ron Kimmel
TL;DR
CoordFlow addresses the challenge of high-quality video reconstruction at low bitrates by introducing a pixel-wise INR that partitions video into temporally coherent layers, each with motion compensation. The method employs an ensemble of CoordFlow layers, where a Flow Network learns a time-dependent similarity transform to canonical space and a Color Network encodes RGB and $\alpha$, with layer fusion enabling unsupervised segmentation. Empirically, CoordFlow achieves state-of-the-art performance among pixel-wise INRs and is competitive with leading frame-wise methods on UVG and Boat datasets, while enabling upsampling, segmentation, inpainting, stabilization, and denoising. These results highlight the potential of motion-aware, layer-wise neural representations for flexible and scalable video compression, though training and inference remain slow and merit acceleration.
Abstract
In the field of video compression, the pursuit for better quality at lower bit rates remains a long-lasting goal. Recent developments have demonstrated the potential of Implicit Neural Representation (INR) as a promising alternative to traditional transform-based methodologies. Video INRs can be roughly divided into frame-wise and pixel-wise methods according to the structure the network outputs. While the pixel-based methods are better for upsampling and parallelization, frame-wise methods demonstrated better performance. We introduce CoordFlow, a novel pixel-wise INR for video compression. It yields state-of-the-art results compared to other pixel-wise INRs and on-par performance compared to leading frame-wise techniques. The method is based on the separation of the visual information into visually consistent layers, each represented by a dedicated network that compensates for the layer's motion. When integrated, a byproduct is an unsupervised segmentation of video sequence. Objects motion trajectories are implicitly utilized to compensate for visual-temporal redundancies. Additionally, the proposed method provides inherent video upsampling, stabilization, inpainting, and denoising capabilities.
