Table of Contents
Fetching ...

TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets

Zhixuan Liu, Peter Schaldenbrand, Yijun Li, Long Mai, Aniruddha Mahapatra, Cusuh Ham, Jean Oh, Jui-Hsien Wang

Abstract

We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.

TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets

Abstract

We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.

Paper Structure

This paper contains 24 sections, 23 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Explicit spatiotemporal masking. (top) Composable masking: localize different sliders to different concepts (person vs. campfire) and compose them in one video. (bottom) Spatial/temporal masking: leveraging TokenDial and soft masks, we can easily make only the top portion of the ink videos redder and create a gradient effect, or make aurora brighter only towards the end of the video.
  • Figure 2: Overview of TokenDial. We inject learnable spatiotemporal token offsets into intermediate video patch tokens of a frozen text-to-video DiT. Offsets are trained with external understanding models: appearance via semantic direction matching and motion via motion-magnitude scaling.
  • Figure 3: (a)TokenDial learned token offsets transfers zero-shot across video resolutions and lengths. (b)TokenDial composes attributes by combining offsets, enabling independent control along multiple sliders (e.g., ink "redder" and "more diluted").
  • Figure 4: Generalization to Wan. TokenDial transfers to the Wan 2.1 backbone, enabling continuous appearance sliders by injecting offsets into Wan’s feature stream. Examples show a "more kitten" slider (cat) and a "more furry" slider (dog).
  • Figure 5: Semantic debiasing. "Older" edits learned from InternVideo2 can also increase body weight (b); debiasing removes this coupling (c).
  • ...and 13 more figures