Table of Contents
Fetching ...

Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

Tae Eun Choi, Sumin Shim, Junhyeok Kim, Seong Jae Hwang

Abstract

Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.

Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

Abstract

Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.
Paper Structure (33 sections, 8 equations, 19 figures, 6 tables)

This paper contains 33 sections, 8 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: We introduce a training-free approach on the task of generative inbetweening which generates intermediate frames using the two keyframes and text. In (a), our method correctly recognizes the train and produces consistent and coherent frames. In (b), we improve semantic alignment between the text and generated frames, accurately capturing the 'counterclockwise' movement, in contrast to Wan wan.
  • Figure 2: Pace Stability Comparison. This figure compares the pace stability of Wan and our method against ground truth (GT). The paraglider’s motion is visualized by overlaying same, uniformly sampled indices for GT, Wan, and Ours with the background aligned, marking sampled positions with red dots ($\bullet$) and displacements with black arrows ($\pmb{\dashleftarrow}$). Wan exhibits pace instability, in that the paraglider alternately accelerates and decelerates, producing uneven spacing whereas our method closely matches the ground truth with smooth motion and stable pacing.
  • Figure 3: Overall Pipeline of Our Method. Our model is built upon a video DiT pipeline that consists of DiT blocks with self-attention and cross-attention layers. Left:Keyframe-anchored Attention Bias is performed for each condition's cross-attention, which aggregates cross-attention maps from each keyframes to form keyframe anchors. These keyframe anchors are interpolated to frame-wise target anchors, which are used as a small logit bias to guide each intermediate frames. Right: Furthermore, we introduce Rescaled Temporal RoPE, which increases temporal RoPE scale at the edges and reduces in the middle. As a result, edge frames place most of their attention on nearby frames while middle frames spread their attention across a wider temporal range.
  • Figure 4: TGI-Bench.(a) One example from each challenge of our TGI-Bench is presented. For each example, the first and last frames of the video along with its text description are shown. (b) The distribution of challenges according to the number of frames is illustrated.
  • Figure 5: Qualitative Comparison with Baselines.(a) Our method outperforms prior works in all three target challenges: semantic fidelity, pace stability and frame consistency. Although Wan performs better than SVD-based models such as TRF, ViBiDSampler, GI and FCVG, it shows failures in one or more qualities. For instance, for (b), the dog marked by a yellow circle () disappears to the left and suddenly reappears in the middle of the frame in later sequences, showing semantic infidelity while in (c), the location of the person barely moves from frame 48 to frame 60, showing pace instability. On the other hand, our method overcomes all three challenges.
  • ...and 14 more figures