Table of Contents
Fetching ...

AKiRa: Augmentation Kit on Rays for optical video generation

Xi Wang, Robin Courant, Marc Christie, Vicky Kalogeiton

TL;DR

AKiRa introduces an optical video generation framework that endows diffusion-based backbones with explicit camera and optical controls by learning a camera adapter on top of a frozen backbone. It couples an extended pinhole camera model, a Plücker-ray representation, and an aperture map with a data-augmentation kit to disentangle motion, focal length, distortion, and depth-of-field effects. The approach delivers cinematic capabilities such as dolly zoom, fisheye distortion, and adjustable bokeh while maintaining high video quality and temporal coherence, outperforming state-of-the-art methods across multiple backbones and datasets. By providing scalable evaluation via FlowSim and a comprehensive qualitative study, AKiRa demonstrates a practical, optically coherent path toward controllable video generation with broad creative and research implications.

Abstract

Recent advances in text-conditioned video diffusion have greatly improved video quality. However, these methods offer limited or sometimes no control to users on camera aspects, including dynamic camera motion, zoom, distorted lens and focus shifts. These motion and optical aspects are crucial for adding controllability and cinematic elements to generation frameworks, ultimately resulting in visual content that draws focus, enhances mood, and guides emotions according to filmmakers' controls. In this paper, we aim to close the gap between controllable video generation and camera optics. To achieve this, we propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework that builds and trains a camera adapter with a complex camera model over an existing video generation backbone. It enables fine-tuned control over camera motion as well as complex optical parameters (focal length, distortion, aperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh. Extensive experiments demonstrate AKiRa's effectiveness in combining and composing camera optics while outperforming all state-of-the-art methods. This work sets a new landmark in controlled and optically enhanced video generation, paving the way for future optical video generation methods.

AKiRa: Augmentation Kit on Rays for optical video generation

TL;DR

AKiRa introduces an optical video generation framework that endows diffusion-based backbones with explicit camera and optical controls by learning a camera adapter on top of a frozen backbone. It couples an extended pinhole camera model, a Plücker-ray representation, and an aperture map with a data-augmentation kit to disentangle motion, focal length, distortion, and depth-of-field effects. The approach delivers cinematic capabilities such as dolly zoom, fisheye distortion, and adjustable bokeh while maintaining high video quality and temporal coherence, outperforming state-of-the-art methods across multiple backbones and datasets. By providing scalable evaluation via FlowSim and a comprehensive qualitative study, AKiRa demonstrates a practical, optically coherent path toward controllable video generation with broad creative and research implications.

Abstract

Recent advances in text-conditioned video diffusion have greatly improved video quality. However, these methods offer limited or sometimes no control to users on camera aspects, including dynamic camera motion, zoom, distorted lens and focus shifts. These motion and optical aspects are crucial for adding controllability and cinematic elements to generation frameworks, ultimately resulting in visual content that draws focus, enhances mood, and guides emotions according to filmmakers' controls. In this paper, we aim to close the gap between controllable video generation and camera optics. To achieve this, we propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework that builds and trains a camera adapter with a complex camera model over an existing video generation backbone. It enables fine-tuned control over camera motion as well as complex optical parameters (focal length, distortion, aperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh. Extensive experiments demonstrate AKiRa's effectiveness in combining and composing camera optics while outperforming all state-of-the-art methods. This work sets a new landmark in controlled and optically enhanced video generation, paving the way for future optical video generation methods.

Paper Structure

This paper contains 32 sections, 14 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: While current state-of-the-art video generation approaches offer limited control to users on camera motion, we propose a dedicated data augmentation framework —AKiRa— to train an optical video generation model that provides users with a panel of controls on camera motions (top row), camera focal length (second row), lens distortion (third row), or bokeh (camera aperture an in/out of focus regions in bottom row). See more in our https://www.lix.polytechnique.fr/vista/projects/2024_akira_wang.
  • Figure 2: Overview of AKiRa training. The camera adapter is trained by jointly augmenting camera data and frames using AKiRa augmentations. The adapter processes multiple camera parameters—motion, focal length, distortion, aperture, and focus point. This adapter is integrated into a pre-trained, frozen backbone, resulting in an optical video generation model.
  • Figure 3: Optical effect overview. Visualization of various optical effects proposed in our system —zoom, distortion, and bokeh—and their impacts on both the camera parameters (top row) and visual output (bottom row). In addition, as with state-of-art techniques, we enable the control of the camera motion (not displayed here).
  • Figure 4: Qualitative results of AKiRa on Animatediff guo2023animatediff and SVD blattmann2023svd backbones. We recommend viewing the supplementary video.
  • Figure 5: Difference between zoom and push forward. Zooming (change of focal length) is similar to image cropping and resizing while pushing forward changes the perspective of the scene.
  • ...and 7 more figures