CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Lingen Li; Guangzhi Wang; Xiaoyu Li; Zhaoyang Zhang; Qi Dou; Jinwei Gu; Tianfan Xue; Ying Shan

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Lingen Li, Guangzhi Wang, Xiaoyu Li, Zhaoyang Zhang, Qi Dou, Jinwei Gu, Tianfan Xue, Ying Shan

Abstract

Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting $\leq$ 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360° videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360° video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams. Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios. Project page: https://lg-li.github.io/project/cubecomposer

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Abstract

1K resolution native generation and relying on suboptimal post super-resolution to increase resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360° videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360° video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams. Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios. Project page: https://lg-li.github.io/project/cubecomposer

Paper Structure (26 sections, 9 equations, 7 figures, 3 tables)

This paper contains 26 sections, 9 equations, 7 figures, 3 tables.

Introduction
Related Work
360° Video Generation
Video Diffusion Model
Autoregressive Video Generation
Methodology
Problem Formulation and Notation
Model Overview
Spatio-Temporal Autoregressive Planning
Context Mechanism with Efficient Attention
Context Mechanism.
Sparse Context Attention.
Continuity-aware Designs
Training and Inference
Experiments
...and 11 more sections

Figures (7)

Figure 1: Existing perspective-to-360° video generation models are typically limited to a maximum resolution of 1K tan2024imagine360luo2025beyondfang2025panoramic, as they rely on the generation capability of vanilla video diffusion models with full attention. Even when augmented with advanced video super-resolution techniques like VEnhancer he2024venhancer, the video quality of the state-of-the-art method Argus luo2025beyond remains unsatisfactory. In contrast, our CubeComposer introduces a spatio-temporal autoregressive diffusion model featuring an effective context mechanism and efficient attention design, enabling for the first time the native generation (without super-resolution) of 4K 360° videos with diffusion models. Zoom in for a better view.
Figure 2: Comparison between the overall pipeline of previous methods and ours. CubeComposer generates the 360° video in a cubemap face-wise spatio-temporal autoregressive manner, significantly reduce the peak computational memory requirement and enable native 4K generation.
Figure 3: Pipeline overview of CubeComposer. Given a perspective video, we convert it to cubemap to obtain masked conditional inputs. The generation sequence is divided into multiple temporal windows, in which the faces are generated in a coverage-guided spatio-temporal order. At each step, CubeComposer generates a video conditioned on the context tokens with an efficient sparse context attention mechanism. F, R, L, B, U, D represent the front, right, left, back, up, and down faces, respectively.
Figure 4: Context mechanism of CubeComposer, taking the generation step of face R in the $i$-th time window as example. For each generation iteration, our context mechanism composes 3 parts of tokens: (a) History Tokens, which includes $H$ windows already generated in previous iterations; (b) Current Time Window Tokens, which includes generated faces in the current window and perspective video conditions for ungenerated faces, always serving as a local context; and (c) Future Fragment Tokens, where we dynamically select the temporally nearest fragment from spatially adjacent future faces (including current face) containing effective content above the spatial coverage threshold $r$.
Figure 5: Continuity-aware designs in CubeComposer, which are used to tackle the discontinuity caused by the spatially separated generation in our spatio-temporal autoregressive manner.
...and 2 more figures

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Abstract

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Authors

Abstract

Table of Contents

Figures (7)