Reconstructing the Invisible: Video Frame Restoration through Siamese Masked Conditional Variational Autoencoder

Yongchen Zhou; Richard Jiang

Reconstructing the Invisible: Video Frame Restoration through Siamese Masked Conditional Variational Autoencoder

Yongchen Zhou, Richard Jiang

TL;DR

This work tackles the problem of restoring missing information in video frames, proposing SiamMCVAE, a Siamese Vision Transformer–based conditional variational autoencoder that leverages frame-to-frame similarities to reconstruct masked regions. The method integrates two weight-sharing ViT encoders, a reparameterization-based latent variable $\mathbf{Z}$, and a ViT decoder, optimizing the objective $\max_{\phi,\theta} \mathbb{E}_{q_\phi(\mathbf{Z}|\mathbf{X}_1,\mathbf{X}_2)} \log p_\theta(\mathbf{R}|\mathbf{Z})$ with the constraint on KL divergence and a reconstruction loss $\mathcal{L} = \mathcal{L}_{\mathrm{r}} + \beta \mathcal{L}_{\mathrm{KL}}$. Evaluations on the BDD100K driving dataset show SiamMCVAE consistently surpasses MAE, MAE-ST, and VideoMAE across masking ratios, frame gaps, and metrics such as PSNR, SSIM, and FSIM, underscoring its robustness in real-world, high-missingness scenarios. The results underscore the value of combining siamese encoders with variational inference in a generative restoration framework for dynamic visual data, with practical implications for autonomous systems and surveillance. The work also provides ablation evidence that reparameterization and a carefully chosen $\beta$ (e.g., $0.2$) improve restoration quality, highlighting design choices that influence performance.

Abstract

In the domain of computer vision, the restoration of missing information in video frames is a critical challenge, particularly in applications such as autonomous driving and surveillance systems. This paper introduces the Siamese Masked Conditional Variational Autoencoder (SiamMCVAE), leveraging a siamese architecture with twin encoders based on vision transformers. This innovative design enhances the model's ability to comprehend lost content by capturing intrinsic similarities between paired frames. SiamMCVAE proficiently reconstructs missing elements in masked frames, effectively addressing issues arising from camera malfunctions through variational inferences. Experimental results robustly demonstrate the model's effectiveness in restoring missing information, thus enhancing the resilience of computer vision systems. The incorporation of Siamese Vision Transformer (SiamViT) encoders in SiamMCVAE exemplifies promising potential for addressing real-world challenges in computer vision, reinforcing the adaptability of autonomous systems in dynamic environments.

Reconstructing the Invisible: Video Frame Restoration through Siamese Masked Conditional Variational Autoencoder

TL;DR

, and a ViT decoder, optimizing the objective

with the constraint on KL divergence and a reconstruction loss

. Evaluations on the BDD100K driving dataset show SiamMCVAE consistently surpasses MAE, MAE-ST, and VideoMAE across masking ratios, frame gaps, and metrics such as PSNR, SSIM, and FSIM, underscoring its robustness in real-world, high-missingness scenarios. The results underscore the value of combining siamese encoders with variational inference in a generative restoration framework for dynamic visual data, with practical implications for autonomous systems and surveillance. The work also provides ablation evidence that reparameterization and a carefully chosen

(e.g.,

) improve restoration quality, highlighting design choices that influence performance.

Abstract

Paper Structure (11 sections, 11 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 11 sections, 11 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Method
Experiments
Experiment Setup
Comparison with Prior Work
Model Robustness
Qualitative Analysis
Ablation Studies
Discussion
Conclusion

Figures (4)

Figure 1: Our SiamMCVAE architecture. The foundational framework of our SiamMCVAE is meticulously crafted to address the intricate challenges posed by missing information in video frames. Embracing a siamese architecture, our model synergistically integrates twin encoders equipped with vision transformers. This innovative design augments the model's ability to discern and reconstruct missing content by capturing inherent similarities between paired frames. The siamese encoder configuration, coupled with the transformative power of vision transformers, empowers SiamMCVAE to proficiently reconstruct missing elements within masked frames. The intricacies of our architecture extend further with the incorporation of variational principles, elevating the model's capacity to generate diverse and meaningful representations.
Figure 2: Performance comparison of different models across varying masking ratios. In the face of increasing masking ratios, SiamMCVAE consistently outperforms other models, showcasing its remarkable resilience and effectiveness in restoring missing information within video frames.
Figure 3: Performance comparison across different frame gaps. Notably, the SiamMCVAE consistently outperforms both MAE-ST and VideoMAE.
Figure 4: Comparative visualization of model outputs at a 90% masking ratio. In the first column, masked video frames are depicted, while the subsequent columns showcase outputs from various models, including MAE he2022masked, MAE-ST feichtenhofer2022masked, VideoMAE tong2022videomae, and our SiamMCVAE, arranged from left to right. The rightmost column features the unaltered ground truth frames.

Reconstructing the Invisible: Video Frame Restoration through Siamese Masked Conditional Variational Autoencoder

TL;DR

Abstract

Reconstructing the Invisible: Video Frame Restoration through Siamese Masked Conditional Variational Autoencoder

Authors

TL;DR

Abstract

Table of Contents

Figures (4)