AKiRa: Augmentation Kit on Rays for optical video generation
Xi Wang, Robin Courant, Marc Christie, Vicky Kalogeiton
TL;DR
AKiRa introduces an optical video generation framework that endows diffusion-based backbones with explicit camera and optical controls by learning a camera adapter on top of a frozen backbone. It couples an extended pinhole camera model, a Plücker-ray representation, and an aperture map with a data-augmentation kit to disentangle motion, focal length, distortion, and depth-of-field effects. The approach delivers cinematic capabilities such as dolly zoom, fisheye distortion, and adjustable bokeh while maintaining high video quality and temporal coherence, outperforming state-of-the-art methods across multiple backbones and datasets. By providing scalable evaluation via FlowSim and a comprehensive qualitative study, AKiRa demonstrates a practical, optically coherent path toward controllable video generation with broad creative and research implications.
Abstract
Recent advances in text-conditioned video diffusion have greatly improved video quality. However, these methods offer limited or sometimes no control to users on camera aspects, including dynamic camera motion, zoom, distorted lens and focus shifts. These motion and optical aspects are crucial for adding controllability and cinematic elements to generation frameworks, ultimately resulting in visual content that draws focus, enhances mood, and guides emotions according to filmmakers' controls. In this paper, we aim to close the gap between controllable video generation and camera optics. To achieve this, we propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework that builds and trains a camera adapter with a complex camera model over an existing video generation backbone. It enables fine-tuned control over camera motion as well as complex optical parameters (focal length, distortion, aperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh. Extensive experiments demonstrate AKiRa's effectiveness in combining and composing camera optics while outperforming all state-of-the-art methods. This work sets a new landmark in controlled and optically enhanced video generation, paving the way for future optical video generation methods.
