Table of Contents
Fetching ...

RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism

Elia Peruzzo, Dejia Xu, Xingqian Xu, Humphrey Shi, Nicu Sebe

TL;DR

This work tackles the challenge of motion realism in text-to-video diffusion by introducing RagMe, a Retrieval Augmented Generation framework that conditions generation on externally retrieved videos. The approach integrates a CLIP-based retrieval mechanism (RM), a temporal transformer-based conditioning module (RagCA), and a novel RAG-based noise initialization (RagInit) to guide motion without copying. Empirical results on WebVid10M and VBench show improved motion dynamics and reduced FVD, at modest latency costs, along with demonstrated applicability to motion transfer via flexible retrieval databases. The method provides a practical, plug-in augmentation to existing T2V diffusion models, with potential for specialization via task-specific databases and broader architectural generalization.

Abstract

Video generation is experiencing rapid growth, driven by advances in diffusion models and the development of better and larger datasets. However, producing high-quality videos remains challenging due to the high-dimensional data and the complexity of the task. Recent efforts have primarily focused on enhancing visual quality and addressing temporal inconsistencies, such as flickering. Despite progress in these areas, the generated videos often fall short in terms of motion complexity and physical plausibility, with many outputs either appearing static or exhibiting unrealistic motion. In this work, we propose a framework to improve the realism of motion in generated videos, exploring a complementary direction to much of the existing literature. Specifically, we advocate for the incorporation of a retrieval mechanism during the generation phase. The retrieved videos act as grounding signals, providing the model with demonstrations of how the objects move. Our pipeline is designed to apply to any text-to-video diffusion model, conditioning a pretrained model on the retrieved samples with minimal fine-tuning. We demonstrate the superiority of our approach through established metrics, recently proposed benchmarks, and qualitative results, and we highlight additional applications of the framework.

RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism

TL;DR

This work tackles the challenge of motion realism in text-to-video diffusion by introducing RagMe, a Retrieval Augmented Generation framework that conditions generation on externally retrieved videos. The approach integrates a CLIP-based retrieval mechanism (RM), a temporal transformer-based conditioning module (RagCA), and a novel RAG-based noise initialization (RagInit) to guide motion without copying. Empirical results on WebVid10M and VBench show improved motion dynamics and reduced FVD, at modest latency costs, along with demonstrated applicability to motion transfer via flexible retrieval databases. The method provides a practical, plug-in augmentation to existing T2V diffusion models, with potential for specialization via task-specific databases and broader architectural generalization.

Abstract

Video generation is experiencing rapid growth, driven by advances in diffusion models and the development of better and larger datasets. However, producing high-quality videos remains challenging due to the high-dimensional data and the complexity of the task. Recent efforts have primarily focused on enhancing visual quality and addressing temporal inconsistencies, such as flickering. Despite progress in these areas, the generated videos often fall short in terms of motion complexity and physical plausibility, with many outputs either appearing static or exhibiting unrealistic motion. In this work, we propose a framework to improve the realism of motion in generated videos, exploring a complementary direction to much of the existing literature. Specifically, we advocate for the incorporation of a retrieval mechanism during the generation phase. The retrieved videos act as grounding signals, providing the model with demonstrations of how the objects move. Our pipeline is designed to apply to any text-to-video diffusion model, conditioning a pretrained model on the retrieved samples with minimal fine-tuning. We demonstrate the superiority of our approach through established metrics, recently proposed benchmarks, and qualitative results, and we highlight additional applications of the framework.

Paper Structure

This paper contains 26 sections, 7 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: We evaluate the Fréchet Video Distance (FVD) using the captions and videos from the validation set of the WebVid10M Bain21 dataset. We plot it against the cosine similarity with respect to the retrieved examples in the DINOv2 embedding space. Ideally, the best model should produce high-quality videos (indicated by low FVD) while avoiding direct copying from the grounding examples (indicated by low cosine similarity).
  • Figure 2: Pipeline of RagMe. (a) We show a general T2V pipeline with RAG capabilities. Given a textual prompt, we retrieve related videos from a database and use it to enhance the generation capabilities of a T2V model. (b) We detail the specific implementation. Each video frame from the retrieved videos is encoded using CLIP and then processed by a transformer temporal enhancer module to obtain the final conditioning vector. This vector is used to condition a T2V model through cross-attention layers. Each video is color-coded, with different frames represented by varying shades of the base color.
  • Figure 3: We compare the role of different retrieval databases on the person-related subset of VBench huang2023vbench. We retrieve it from the Kinetics kay2017kinetics and the WebVid10M Bain21.
  • Figure 4: We study the impact of the retrieved samples $K$ on the FVD vs Cosine Similarity trade-off. We select $K=5$ as a good trade-off between the two.
  • Figure 5: Visual comparison of the different methods. We report the prompt at the bottom.
  • ...and 5 more figures