Table of Contents
Fetching ...

MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation

Chenhui Zhu, Yilu Wu, Shuai Wang, Gangshan Wu, Limin Wang

TL;DR

MotionRAG tackles the challenge of realistic motion in image-to-video generation by retrieving motion priors from a large video-text database and adapting them to a target image through Context-Aware Motion Adaptation (CAMA). A text-based retrieval stage identifies semantically relevant references, while a causal Transformer (Motion Context Transformer) performs in-context learning to produce adapted motion tokens that are injected into pretrained diffusion-based video generators via Motion-Adapter. The approach delivers consistent motion-quality gains across multiple base models and domains, with negligible inference overhead and strong zero-shot generalization by simply updating the retrieval database. This retrieval-augmented paradigm demonstrates practical benefits for open-domain video synthesis, enabling realistic dynamics without domain-specific fine-tuning.

Abstract

Image-to-video generation has made remarkable progress with the advancements in diffusion models, yet generating videos with realistic motion remains highly challenging. This difficulty arises from the complexity of accurately modeling motion, which involves capturing physical constraints, object interactions, and domain-specific dynamics that are not easily generalized across diverse scenarios. To address this, we propose MotionRAG, a retrieval-augmented framework that enhances motion realism by adapting motion priors from relevant reference videos through Context-Aware Motion Adaptation (CAMA). The key technical innovations include: (i) a retrieval-based pipeline extracting high-level motion features using video encoder and specialized resamplers to distill semantic motion representations; (ii) an in-context learning approach for motion adaptation implemented through a causal transformer architecture; (iii) an attention-based motion injection adapter that seamlessly integrates transferred motion features into pretrained video diffusion models. Extensive experiments demonstrate that our method achieves significant improvements across multiple domains and various base models, all with negligible computational overhead during inference. Furthermore, our modular design enables zero-shot generalization to new domains by simply updating the retrieval database without retraining any components. This research enhances the core capability of video generation systems by enabling the effective retrieval and transfer of motion priors, facilitating the synthesis of realistic motion dynamics.

MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation

TL;DR

MotionRAG tackles the challenge of realistic motion in image-to-video generation by retrieving motion priors from a large video-text database and adapting them to a target image through Context-Aware Motion Adaptation (CAMA). A text-based retrieval stage identifies semantically relevant references, while a causal Transformer (Motion Context Transformer) performs in-context learning to produce adapted motion tokens that are injected into pretrained diffusion-based video generators via Motion-Adapter. The approach delivers consistent motion-quality gains across multiple base models and domains, with negligible inference overhead and strong zero-shot generalization by simply updating the retrieval database. This retrieval-augmented paradigm demonstrates practical benefits for open-domain video synthesis, enabling realistic dynamics without domain-specific fine-tuning.

Abstract

Image-to-video generation has made remarkable progress with the advancements in diffusion models, yet generating videos with realistic motion remains highly challenging. This difficulty arises from the complexity of accurately modeling motion, which involves capturing physical constraints, object interactions, and domain-specific dynamics that are not easily generalized across diverse scenarios. To address this, we propose MotionRAG, a retrieval-augmented framework that enhances motion realism by adapting motion priors from relevant reference videos through Context-Aware Motion Adaptation (CAMA). The key technical innovations include: (i) a retrieval-based pipeline extracting high-level motion features using video encoder and specialized resamplers to distill semantic motion representations; (ii) an in-context learning approach for motion adaptation implemented through a causal transformer architecture; (iii) an attention-based motion injection adapter that seamlessly integrates transferred motion features into pretrained video diffusion models. Extensive experiments demonstrate that our method achieves significant improvements across multiple domains and various base models, all with negligible computational overhead during inference. Furthermore, our modular design enables zero-shot generalization to new domains by simply updating the retrieval database without retraining any components. This research enhances the core capability of video generation systems by enabling the effective retrieval and transfer of motion priors, facilitating the synthesis of realistic motion dynamics.

Paper Structure

This paper contains 25 sections, 7 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Illustration of cross-domain motion transfer. Our approach retrieves videos of people riding horses and transfers their motion priors to generate an astronaut riding a horse on the moon, while preserving the appearance characteristics of the input image.
  • Figure 2: Our MotionRAG framework. Text prompts retrieve relevant videos from a database. Motion information from these references are adapted to the input image via our Motion Context Transformer, then injected into an image-to-video generator to produce the final output.
  • Figure 3: Context-Aware Motion Adaptation (CAMA) architecture. Appearance and motion features from retrieved videos and the target image are processed through a causal transformer, which learns to predict appropriate motion features for the target image through in-context learning.
  • Figure 4: Qualitative comparison between baseline models and our retrieval-augmented approach across diverse scenarios. Our method generates more physically plausible and coherent motion, such as realistic object physics, natural animal/human movements, and corrects static or artifacts found in baseline models. Video results are available at our https://github.com/MCG-NJU/MotionRAG
  • Figure 5: Retrieval and generation examples. Each panel shows a different scenario: (top-left) metal balls suspended in air with pendulum-like motion, (top-right) a person pouring water into a teacup, (bottom-left) a man running on a dirt road, and (bottom-right) a person riding on a horse led by another person. For each example, the top row displays frames from our generated video, while the rows below show frames from retrieved reference videos. Note how our system extracts relevant motion patterns from visually different but semantically similar videos.
  • ...and 1 more figures