Table of Contents
Fetching ...

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Lingen Li, Guangzhi Wang, Xiaoyu Li, Zhaoyang Zhang, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan

Abstract

Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting $\leq$ 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360° videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360° video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams. Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios. Project page: https://lg-li.github.io/project/cubecomposer

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Abstract

Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360° videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360° video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams. Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios. Project page: https://lg-li.github.io/project/cubecomposer
Paper Structure (26 sections, 9 equations, 7 figures, 3 tables)

This paper contains 26 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Existing perspective-to-360° video generation models are typically limited to a maximum resolution of 1K tan2024imagine360luo2025beyondfang2025panoramic, as they rely on the generation capability of vanilla video diffusion models with full attention. Even when augmented with advanced video super-resolution techniques like VEnhancer he2024venhancer, the video quality of the state-of-the-art method Argus luo2025beyond remains unsatisfactory. In contrast, our CubeComposer introduces a spatio-temporal autoregressive diffusion model featuring an effective context mechanism and efficient attention design, enabling for the first time the native generation (without super-resolution) of 4K 360° videos with diffusion models. Zoom in for a better view.
  • Figure 2: Comparison between the overall pipeline of previous methods and ours. CubeComposer generates the 360° video in a cubemap face-wise spatio-temporal autoregressive manner, significantly reduce the peak computational memory requirement and enable native 4K generation.
  • Figure 3: Pipeline overview of CubeComposer. Given a perspective video, we convert it to cubemap to obtain masked conditional inputs. The generation sequence is divided into multiple temporal windows, in which the faces are generated in a coverage-guided spatio-temporal order. At each step, CubeComposer generates a video conditioned on the context tokens with an efficient sparse context attention mechanism. F, R, L, B, U, D represent the front, right, left, back, up, and down faces, respectively.
  • Figure 4: Context mechanism of CubeComposer, taking the generation step of face R in the $i$-th time window as example. For each generation iteration, our context mechanism composes 3 parts of tokens: (a) History Tokens, which includes $H$ windows already generated in previous iterations; (b) Current Time Window Tokens, which includes generated faces in the current window and perspective video conditions for ungenerated faces, always serving as a local context; and (c) Future Fragment Tokens, where we dynamically select the temporally nearest fragment from spatially adjacent future faces (including current face) containing effective content above the spatial coverage threshold $r$.
  • Figure 5: Continuity-aware designs in CubeComposer, which are used to tackle the discontinuity caused by the spatially separated generation in our spatio-temporal autoregressive manner.
  • ...and 2 more figures