Table of Contents
Fetching ...

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie, Tat-Seng Chua

TL;DR

Video foundation models struggle with compositionality when generating scenes with multiple objects, relations, and dynamics. TTOM mitigates this by performing test-time optimization of lightweight parameters $\phi$ guided by an LLM-generated spatiotemporal layout and by maintaining a parametric memory to reuse past optimizations in a streaming prompt setting, avoiding per-sample latent edits. The core contributions are the layout-to-video alignment objective $L_{align}$ based on cross-attention to soft layout masks, a memory mechanism with Insert/Read/Update/Delete capabilities, and continual TTO that enhances transferability across prompts and sessions. Empirically, TTOM improves cross-modal alignment and semantic coherence on T2V-CompBench and VBench, demonstrating practical, scalable, and efficient compositional video generation without additional supervision.

Abstract

Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

TL;DR

Video foundation models struggle with compositionality when generating scenes with multiple objects, relations, and dynamics. TTOM mitigates this by performing test-time optimization of lightweight parameters guided by an LLM-generated spatiotemporal layout and by maintaining a parametric memory to reuse past optimizations in a streaming prompt setting, avoiding per-sample latent edits. The core contributions are the layout-to-video alignment objective based on cross-attention to soft layout masks, a memory mechanism with Insert/Read/Update/Delete capabilities, and continual TTO that enhances transferability across prompts and sessions. Empirically, TTOM improves cross-modal alignment and semantic coherence on T2V-CompBench and VBench, demonstrating practical, scalable, and efficient compositional video generation without additional supervision.

Abstract

Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.

Paper Structure

This paper contains 15 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Current video generative models wan2025wan still suffer from text-video misalignment problems in compositional scenarios. We introduce a test-time optimization and memorization method that substantially enhances alignment while maintaining high visual fidelity.
  • Figure 2: Overview of the TTOM framework for compositional text-to-video generation. A stream of text prompts is first fed into LLMs for spatial-temporal layout planning. Meanwhile, a denoising sampling process of video foundation models is performed, in which cross-attention maps are extracted, followed by test-time optimization for alignment. Historical optimization context is maintained by the parametric memory.
  • Figure 3: Attention-layout overlap (evaluated by mIoU everingham2010pascal over 200 prompts) between cross-modal attention maps extracted from each layer of foundation models and segmentation maps detected from generated videos by GroundingDINO liu2024grounding + SAM 2 ravi2024sam.
  • Figure 4: Qualitative results of motion pattern transfer with memory. Solid arrows indicate insert or update operations, while dotted arrows represent reading and loading parameters from memory into foundation models for inference.
  • Figure 5: Qualitative comparison between the foundation, the baseline, and our method on T2V-CompBench.
  • ...and 1 more figures