On Computational Limits of FlowAR Models: Expressivity and Efficiency
Yang Cao, Chengyue Gong, Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song
TL;DR
This work analyzes the FlowAR visual autoregressive model through the lens of circuit complexity. It proves that FlowAR, including its attention and flow-matching components, can be simulated by a $DLOGTIME$-uniform $ extsf{TC}^0$ circuit of constant depth and polynomial size, indicating limited expressivity under TC$^0$. To address practical efficiency, the authors introduce a low-rank approximate attention and a fast FlowAR variant, achieving almost quadratic inference time $O(n^{2+o(1)})$ with controlled additive error. The results establish a theoretical baseline for flow-based autoregressive architectures and offer concrete guidelines for designing more efficient, yet expressive, generative models.
Abstract
The expressive power and computational complexity of deep visual generative models, such as flow-based and autoregressive (AR) models, have gained considerable interest for their wide-ranging applications in generative tasks. However, the theoretical characterization of their expressiveness through the lens of circuit complexity remains underexplored, particularly for the state-of-the-art architecture like FlowAR proposed by [Ren et al., 2024], which integrates flow-based and autoregressive mechanisms. This gap limits our understanding of their inherent computational limits and practical efficiency. In this study, we address this gap by analyzing the circuit complexity of the FlowAR architecture. We demonstrate that when the largest feature map produced by the FlowAR model has dimensions $n \times n \times c$, the FlowAR model is simulable by a family of threshold circuits $\mathsf{TC}^0$, which have constant depth $O(1)$ and polynomial width $\mathrm{poly}(n)$. This is the first study to rigorously highlight the limitations in the expressive power of FlowAR models. Furthermore, we identify the conditions under which the FlowAR model computations can achieve almost quadratic time. To validate our theoretical findings, we present efficient model variant constructions based on low-rank approximations that align with the derived criteria. Our work provides a foundation for future comparisons with other generative paradigms and guides the development of more efficient and expressive implementations.
