Building Vision Models upon Heat Conduction
Zhaozhi Wang, Yue Liu, Yunjie Tian, Yunfan Liu, Yaowei Wang, Qixiang Ye
TL;DR
This work tackles the high computational burden of attention in vision models when pursuing large receptive fields. It introduces the Heat Conduction Operator (HCO), a physics-inspired module that simulates diffusion of visual information in the frequency domain via $\mathbf{DCT_{2D}}$ and $\mathbf{IDCT_{2D}}$, yielding a global receptive field with $O(N^{1.5})$ complexity. By coupling HCO with learnable Frequency Value Embeddings (FVEs) to predict adaptive diffusivity $k$, the vHeat backbone achieves strong performance across ImageNet, COCO, and ADE20K while delivering higher throughput and lower memory and FLOPs than Swin-Transformer. Extensive ablations show the effectiveness of adaptive diffusion and the superiority of HCO over global filters, highlighting the practical impact of physics-informed representations for efficient, interpretable vision models. Overall, vHeat demonstrates that physics-inspired diffusion with frequency-domain processing can deliver scalable, accurate vision models with enhanced efficiency and interpretability.
Abstract
Visual representation models leveraging attention mechanisms are challenged by significant computational overhead, particularly when pursuing large receptive fields. In this study, we aim to mitigate this challenge by introducing the Heat Conduction Operator (HCO) built upon the physical heat conduction principle. HCO conceptualizes image patches as heat sources and models their correlations through adaptive thermal energy diffusion, enabling robust visual representations. HCO enjoys a computational complexity of O(N^1.5), as it can be implemented using discrete cosine transformation (DCT) operations. HCO is plug-and-play, combining with deep learning backbones produces visual representation models (termed vHeat) with global receptive fields. Experiments across vision tasks demonstrate that, beyond the stronger performance, vHeat achieves up to a 3x throughput, 80% less GPU memory allocation, and 35% fewer computational FLOPs compared to the Swin-Transformer. Code is available at https://github.com/MzeroMiko/vHeat.
