Table of Contents
Fetching ...

AVControl: Efficient Framework for Training Audio-Visual Controls

Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem, Mohammad Salama, Harel Cain, Asaf Joseph, Anthony Chen, Urska Jelercic, Ofir Bibi

Abstract

Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.

AVControl: Efficient Framework for Training Audio-Visual Controls

Abstract

Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.

Paper Structure

This paper contains 37 sections, 21 figures, 5 tables.

Figures (21)

  • Figure 1: AVControl trains each control modality as a lightweight LoRA. Each column shows control input (top) and generated output (bottom), covering spatial controls, camera trajectory, motion, editing, and audio-visual generation.
  • Figure 2: Overview of AVControl. The reference signal is placed on a parallel canvas as additional tokens in self-attention. A LoRA adapter is the only trainable component; the backbone remains frozen.
  • Figure 3: Spatial concatenation for depth-guided generation. Each panel shows the input depth map (top) and the output from a concatenation-based LoRA (bottom). The model captures general scene semantics but fails to faithfully follow the spatial structure of the depth signal, motivating our adoption of the parallel canvas approach.
  • Figure 4: Qualitative comparison on the VACE Benchmark (depth and pose). Each triplet: control input, ours, VACEjiang2025vace. Our outputs show higher structural fidelity, consistent with Table \ref{['tab:vbench_comparison']}.
  • Figure 5: Partial gallery of control modalities. Each row pair shows control input (top) and generated output (bottom, blue border) across five sampled frames. Each modality is an independent LoRA trained in 200--15,000 steps. We provide additional video examples in the supplementary.
  • ...and 16 more figures