Table of Contents
Fetching ...

CAGE: Unsupervised Visual Composition and Animation for Controllable Video Generation

Aram Davtyan, Sepehr Sameni, Björn Ommer, Paolo Favaro

TL;DR

CAGE tackles the problem of controllable video generation without supervision by conditioning on sparse, self-supervised visual tokens (DINOv2) to jointly compose scenes and animate objects. It formulates generation as conditional flow matching in the latent space of a pretrained VQGAN, integrating controls via cross-attention and training with varied conditioning to enable both guided and from-scratch synthesis. The method introduces scale/position invariance through random crops, and uses classifier-free guidance to enable out-of-distribution generalization, achieving zero-shot transfer across domains. Experiments on CLEVRER, BAIR, and EPIC-KITCHENS demonstrate improved controllability and video quality relative to prior unsupervised approaches, with qualitative results showing versatile scene composition, cross-domain transfers, and longer-horizon generation capabilities.

Abstract

The field of video generation has expanded significantly in recent years, with controllable and compositional video generation garnering considerable interest. Most methods rely on leveraging annotations such as text, objects' bounding boxes, and motion cues, which require substantial human effort and thus limit their scalability. In contrast, we address the challenge of controllable and compositional video generation without any annotations by introducing a novel unsupervised approach. Our model is trained from scratch on a dataset of unannotated videos. At inference time, it can compose plausible novel scenes and animate objects by placing object parts at the desired locations in space and time. The core innovation of our method lies in the unified control format and the training process, where video generation is conditioned on a randomly selected subset of pre-trained self-supervised local features. This conditioning compels the model to learn how to inpaint the missing information in the video both spatially and temporally, thereby learning the inherent compositionality of a scene and the dynamics of moving objects. The abstraction level and the imposed invariance of the conditioning input to minor visual perturbations enable control over object motion by simply using the same features at all the desired future locations. We call our model CAGE, which stands for visual Composition and Animation for video GEneration. We conduct extensive experiments to validate the effectiveness of CAGE across various scenarios, demonstrating its capability to accurately follow the control and to generate high-quality videos that exhibit coherent scene composition and realistic animation.

CAGE: Unsupervised Visual Composition and Animation for Controllable Video Generation

TL;DR

CAGE tackles the problem of controllable video generation without supervision by conditioning on sparse, self-supervised visual tokens (DINOv2) to jointly compose scenes and animate objects. It formulates generation as conditional flow matching in the latent space of a pretrained VQGAN, integrating controls via cross-attention and training with varied conditioning to enable both guided and from-scratch synthesis. The method introduces scale/position invariance through random crops, and uses classifier-free guidance to enable out-of-distribution generalization, achieving zero-shot transfer across domains. Experiments on CLEVRER, BAIR, and EPIC-KITCHENS demonstrate improved controllability and video quality relative to prior unsupervised approaches, with qualitative results showing versatile scene composition, cross-domain transfers, and longer-horizon generation capabilities.

Abstract

The field of video generation has expanded significantly in recent years, with controllable and compositional video generation garnering considerable interest. Most methods rely on leveraging annotations such as text, objects' bounding boxes, and motion cues, which require substantial human effort and thus limit their scalability. In contrast, we address the challenge of controllable and compositional video generation without any annotations by introducing a novel unsupervised approach. Our model is trained from scratch on a dataset of unannotated videos. At inference time, it can compose plausible novel scenes and animate objects by placing object parts at the desired locations in space and time. The core innovation of our method lies in the unified control format and the training process, where video generation is conditioned on a randomly selected subset of pre-trained self-supervised local features. This conditioning compels the model to learn how to inpaint the missing information in the video both spatially and temporally, thereby learning the inherent compositionality of a scene and the dynamics of moving objects. The abstraction level and the imposed invariance of the conditioning input to minor visual perturbations enable control over object motion by simply using the same features at all the desired future locations. We call our model CAGE, which stands for visual Composition and Animation for video GEneration. We conduct extensive experiments to validate the effectiveness of CAGE across various scenarios, demonstrating its capability to accurately follow the control and to generate high-quality videos that exhibit coherent scene composition and realistic animation.
Paper Structure (17 sections, 5 equations, 13 figures, 6 tables)

This paper contains 17 sections, 5 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Scene composition and animation with CAGE on the CLEVRER and the BAIR datasets. CAGE is able to combine multiple object features from different source images and use them to compose and animate the scene in a controllable way. The selected features are shown as overlaying red patches. Blue patches in the controls correspond to the intended future locations of the objects. Notice the ability of the model to carefully adjust the appearances (e.g., sizes, shadows and lights) of the objects based on their location in the target layout. Use the Acrobat Reader to play the first images in the generated sequences as videos.
  • Figure 2: Overall pipeline of CAGE. The model takes all the colored frames and processes them equally and in parallel. The pipeline for a single frame ($x^i$, in red) is illustrated. CAGE is trained to predict the denoising direction for the future frames ($x^{i:i+2}$) in the CFM lipman2022flow framework conditioned on the past frames (context and reference) and sparse random sets of DINOv2 Oquab2023DINOv2LR features. The frames communicate with each other via the Temporal Blocks while being separately processed by the Spatial Blocks. The controls are incorporated through Cross-Attention.
  • Figure 3: The process of selecting controls for conditioning. The image is first cropped and resized to $224\times224$ resolution that can be fed to DINOv2 to obtain the spatial tokens. Those are then pasted back to the original location of the crop and sparsified. This is done to prevent overfitting to the position information that is present in DINOv2 features. Besides this, calculating the features on the crops of the image makes the model scale invariant. That is, at inference we are able to copy objects from the background and paste them to the foreground and vice versa. The model should be able to automatically figure out how to scale the objects according to their target position in the scene as well as how to add other position-related textures (e.g. shadows).
  • Figure 4: The effect of CFG on the generalization to out of distribution controls on the BAIR dataset. While the robotic arm can be controlled with no guidance ($w = 0.0$), with larger $w$ the model is also able to move the background objects not moving on their own in the training data. Click on the images to play them as videos in Acrobat Reader.
  • Figure 5: Examples of cross-domain transfer. The features of the objects from images in the first column are borrowed to compose and animate the scenes in the CLEVRER dataset. Notice how CAGE resolves the domain gap and performs a reasonable transfer of the objects (in terms of shapes and colors). However some objects with irregular shape and texture, such as the Rubik's dodecahedron, may turn to multiple objects when transfered. Click on the first images in the generated sequences to play them as videos in Acrobat Reader.
  • ...and 8 more figures