Table of Contents
Fetching ...

Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation

Junyoung Seo, Rodrigo Mira, Alexandros Haliassos, Stella Bounareli, Honglie Chen, Linh Tran, Seungryong Kim, Zoe Landgraf, Jie Shen

TL;DR

Identity drift limits long-form audio-driven animation with diffusion transformers. Lookahead Anchoring relocates keyframes to a future lookahead distance $D$, turning them into directional guidance and enabling self-keyframing via ${I}_ ext{ref}$ without extra keyframe generation. The approach, validated across three DiTs, improves lip synchronization, identity preservation, and visual quality on HDTF and AVSpeech, with a dedicated fine-tuning strategy to learn distance-dependent conditioning. It supports narrative-driven generation by leveraging external image models for distant anchors, offering a practical path to high-quality, arbitrarily long audio-driven human animation. The method strikes a practical balance between expressivity and identity fidelity and generalizes across architectures, datasets, and long-sequence generation tasks.

Abstract

Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Video results are available at the following link: https://lookahead-anchoring.github.io.

Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation

TL;DR

Identity drift limits long-form audio-driven animation with diffusion transformers. Lookahead Anchoring relocates keyframes to a future lookahead distance , turning them into directional guidance and enabling self-keyframing via without extra keyframe generation. The approach, validated across three DiTs, improves lip synchronization, identity preservation, and visual quality on HDTF and AVSpeech, with a dedicated fine-tuning strategy to learn distance-dependent conditioning. It supports narrative-driven generation by leveraging external image models for distant anchors, offering a practical path to high-quality, arbitrarily long audio-driven human animation. The method strikes a practical balance between expressivity and identity fidelity and generalizes across architectures, datasets, and long-sequence generation tasks.

Abstract

Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Video results are available at the following link: https://lookahead-anchoring.github.io.

Paper Structure

This paper contains 41 sections, 7 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Lookahead Anchoring enables robust long-form audio-driven animation. While autoregressive generation with HunyuanAvatar chen2025hunyuanvideo (left) and OmniAvatar xu2023omniavatar (right) progressively loses character identity and lip sync quality, our approach maintains both throughout extended generation. We provide video results in https://lookahead-anchoring.github.io.
  • Figure 2: Motivation. (a) We depart from the convention of using conditional keyframes as generation window endpoints. Instead, we reposition keyframes as temporally distant anchors beyond the window, decoupling them from the actual generated sequence. This eliminates constraints such as audio synchronization requirements while enabling flexible conditioning. (b) Models naturally learn that longer temporal distances allow for greater scene variation. We exploit this prior strategically: distant keyframes provide high-level guidance without imposing strict physical constraints, enabling diverse yet coherent generation.
  • Figure 3: Exploration of distant frame relationships in a pretrained video DiT xu2023omniavatar. Given a conditional frame, we generate separate two-frame videos with artificially increased temporal gaps. Testing beyond the training distribution naturally degrades visual quality but reveals adaptive motion behavior. We propose fine-tuning to harness this observed temporal structure.
  • Figure 4: Qualitative results. We compare our method with three audio-conditioned DiT baselines under the temporal sement-wise autoregressive framework on AVSpeech ephrat2018looking and HDTF zhang2021flow, presenting mid-sequence frames to demonstrate generation quality. Video results are available in https://lookahead-anchoring.github.io.
  • Figure 5: Performance over time. We report FID computed with 1-second sliding windows, normalized relative to the first window.
  • ...and 7 more figures