TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Leigang Qu; Ziyang Wang; Na Zheng; Wenjie Wang; Liqiang Nie; Tat-Seng Chua

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

Leigang Qu, Ziyang Wang, Na Zheng, Wenjie Wang, Liqiang Nie, Tat-Seng Chua

TL;DR

Video foundation models struggle with compositionality when generating scenes with multiple objects, relations, and dynamics. TTOM mitigates this by performing test-time optimization of lightweight parameters $\phi$ guided by an LLM-generated spatiotemporal layout and by maintaining a parametric memory to reuse past optimizations in a streaming prompt setting, avoiding per-sample latent edits. The core contributions are the layout-to-video alignment objective $L_{align}$ based on cross-attention to soft layout masks, a memory mechanism with Insert/Read/Update/Delete capabilities, and continual TTO that enhances transferability across prompts and sessions. Empirically, TTOM improves cross-modal alignment and semantic coherence on T2V-CompBench and VBench, demonstrating practical, scalable, and efficient compositional video generation without additional supervision.

Abstract

Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relation). In this work, we introduce Test-Time Optimization and Memorization (TTOM), a training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment. Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete. Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework to achieve cross-modal alignment for compositional video generation on the fly.

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

TL;DR

Abstract

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)