Table of Contents
Fetching ...

NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning

Tiberio Uricchio, Marco Bertini

Abstract

While neural lossless image compression has advanced significantly with learned entropy models, lossless video compression remains largely unexplored in the neural setting. We present NeuralLVC, a neural lossless video codec that combines masked diffusion with an I/P-frame architecture for exploiting temporal redundancy. Our I-frame model compresses individual frames using bijective linear tokenization that guarantees exact pixel reconstruction. The P-frame model compresses temporal differences between consecutive frames, conditioned on the previous decoded frame via a lightweight reference embedding that adds only 1.3% trainable parameters. Group-wise decoding enables controllable speed-compression trade-offs. Our codec is lossless in the input domain: for video, it reconstructs YUV420 planes exactly; for image evaluation, RGB channels are reconstructed exactly. Experiments on 9 Xiph CIF sequences show that NeuralLVC outperforms H.264 and H.265 lossless by a significant margin. We verify exact reconstruction through end-to-end encode-decode testing with arithmetic coding. These results suggest that masked diffusion with temporal conditioning is a promising direction for neural lossless video compression.

NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning

Abstract

While neural lossless image compression has advanced significantly with learned entropy models, lossless video compression remains largely unexplored in the neural setting. We present NeuralLVC, a neural lossless video codec that combines masked diffusion with an I/P-frame architecture for exploiting temporal redundancy. Our I-frame model compresses individual frames using bijective linear tokenization that guarantees exact pixel reconstruction. The P-frame model compresses temporal differences between consecutive frames, conditioned on the previous decoded frame via a lightweight reference embedding that adds only 1.3% trainable parameters. Group-wise decoding enables controllable speed-compression trade-offs. Our codec is lossless in the input domain: for video, it reconstructs YUV420 planes exactly; for image evaluation, RGB channels are reconstructed exactly. Experiments on 9 Xiph CIF sequences show that NeuralLVC outperforms H.264 and H.265 lossless by a significant margin. We verify exact reconstruction through end-to-end encode-decode testing with arithmetic coding. These results suggest that masked diffusion with temporal conditioning is a promising direction for neural lossless video compression.

Paper Structure

This paper contains 23 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: High-level overview of NeuralLVC. The first frame is coded independently, while later frames are coded from the current frame together with the previous decoded frame. Both branches use the same masked-diffusion entropy-modeling backbone and are finally compressed with arithmetic coding into a single lossless bitstream. Exact token mappings and the group-wise decoding strategy are described in the method section rather than embedded in the figure.
  • Figure 2: Grouping patterns for different $\delta$ values on an 8$\times$8 grid (32$\times$32 in practice). Each color represents a group of positions predicted in parallel. $\delta=0$ yields column-wise groups; $\delta=1$ produces diagonal bands with more groups and better compression; $\delta=2$ creates steeper diagonals. The number in each cell indicates the group index.
  • Figure 3: Temporal redundancy and compression cost (coastguard, Y channel). (a) Reference frame. (b) Temporal difference with the next frame (amplified $5{\times}$): most change occurs along the moving boat. (c) Per-patch compression rate of the P-frame (dark = high rate, bright = low rate): patches with large temporal differences require more bits, while static regions compress to ${\sim}$28%.
  • Figure 4: Per-frame compression rate on two CIF sequences with different motion levels (YUV420, all 300 frames, sampled every 5). Our method (solid) produces stable P-frame rates that consistently outperform H.265 lossless (dashed). H.265 exhibits large per-frame fluctuations due to its B-frame GOP structure (235 B, 63 P, 2 I frames), while our codec maintains near-constant rates without error accumulation.
  • Figure 5: Rate composition per video. The I-frame cost (dark) is amortized over $T$ frames and contributes less than 1% to the total. Compression is dominated by P-frame performance.
  • ...and 1 more figures