Building Vision Models upon Heat Conduction

Zhaozhi Wang; Yue Liu; Yunjie Tian; Yunfan Liu; Yaowei Wang; Qixiang Ye

Building Vision Models upon Heat Conduction

Zhaozhi Wang, Yue Liu, Yunjie Tian, Yunfan Liu, Yaowei Wang, Qixiang Ye

TL;DR

This work tackles the high computational burden of attention in vision models when pursuing large receptive fields. It introduces the Heat Conduction Operator (HCO), a physics-inspired module that simulates diffusion of visual information in the frequency domain via $\mathbf{DCT_{2D}}$ and $\mathbf{IDCT_{2D}}$, yielding a global receptive field with $O(N^{1.5})$ complexity. By coupling HCO with learnable Frequency Value Embeddings (FVEs) to predict adaptive diffusivity $k$, the vHeat backbone achieves strong performance across ImageNet, COCO, and ADE20K while delivering higher throughput and lower memory and FLOPs than Swin-Transformer. Extensive ablations show the effectiveness of adaptive diffusion and the superiority of HCO over global filters, highlighting the practical impact of physics-informed representations for efficient, interpretable vision models. Overall, vHeat demonstrates that physics-inspired diffusion with frequency-domain processing can deliver scalable, accurate vision models with enhanced efficiency and interpretability.

Abstract

Visual representation models leveraging attention mechanisms are challenged by significant computational overhead, particularly when pursuing large receptive fields. In this study, we aim to mitigate this challenge by introducing the Heat Conduction Operator (HCO) built upon the physical heat conduction principle. HCO conceptualizes image patches as heat sources and models their correlations through adaptive thermal energy diffusion, enabling robust visual representations. HCO enjoys a computational complexity of O(N^1.5), as it can be implemented using discrete cosine transformation (DCT) operations. HCO is plug-and-play, combining with deep learning backbones produces visual representation models (termed vHeat) with global receptive fields. Experiments across vision tasks demonstrate that, beyond the stronger performance, vHeat achieves up to a 3x throughput, 80% less GPU memory allocation, and 35% fewer computational FLOPs compared to the Swin-Transformer. Code is available at https://github.com/MzeroMiko/vHeat.

Building Vision Models upon Heat Conduction

TL;DR

and

, yielding a global receptive field with

complexity. By coupling HCO with learnable Frequency Value Embeddings (FVEs) to predict adaptive diffusivity

, the vHeat backbone achieves strong performance across ImageNet, COCO, and ADE20K while delivering higher throughput and lower memory and FLOPs than Swin-Transformer. Extensive ablations show the effectiveness of adaptive diffusion and the superiority of HCO over global filters, highlighting the practical impact of physics-informed representations for efficient, interpretable vision models. Overall, vHeat demonstrates that physics-inspired diffusion with frequency-domain processing can deliver scalable, accurate vision models with enhanced efficiency and interpretability.

Abstract

Paper Structure (28 sections, 11 equations, 12 figures, 12 tables)

This paper contains 28 sections, 11 equations, 12 figures, 12 tables.

Introduction
Related Work
Methodology
Preliminaries: Physical Heat Conduction
vHeat: Visual Heat Conduction
Heat Conduction Operator (HCO)
Adaptive Thermal Diffusivity
vHeat Model
Discussion
Experiment & Analysis
Experimental Results
Analysis of Dynamic Locality
Comparison With Global Filters
Conclusion
Acknowledgement
...and 13 more sections

Figures (12)

Figure 1: Throughput / GPU memory / FLOPs comparisons of our proposed approach (vHeat) with Swin-Transformer Swin2021 under different image resolutions. The throughput and GPU memory are tested on 80 GB Tesla A100 GPUs with batch size 64. Swin-B is tested with scaled window size here.
Figure 2: Comparison of information conduction mechanisms: self-attention vs. heat conduction. (a) The self-attention operator uniformly "conducts" information from a pixel to all other pixels, resulting in $\mathcal{O}(N^2)$ complexity. (b) The heat conduction operator (HCO) conceptualizes the center pixel as the heat source and conducts information propagation through DCT ($\mathcal{F}$) and IDCT ($\mathcal{F}^{-1}$), which enjoys interpretability, global receptive fields, and $\mathcal{O}(N^{1.5})$ complexity.
Figure 3: The network architecture of vHeat. Following the traditional principles of visual model design, we built vHeat with 4 HCO blocks, connected by downsampling layers in between.
Figure 4: HCO and HCO layer. FVEs, FFN, LN, and DWConv respectively denote frequency value embeddings, feed-forward network, layer normalization, and depth-wise convolution. Please refer to Sec. \ref{['appendix: dwconv']} in the supplementary, where we demonstrate that while depth-wise convolution aids in feature extraction, the primary improvements are attributed to the proposed HCO.
Figure 5: Illustration of temperature distribution $U^t$$w.r.t.$ thermal diffusivity $k$, given a heat source as the initial condition. The predicted $k$ leads to nonuniform visual heat conduction, which facilitates the adaptability of visual representation. (Best viewed in color)
...and 7 more figures

Building Vision Models upon Heat Conduction

TL;DR

Abstract

Building Vision Models upon Heat Conduction

Authors

TL;DR

Abstract

Table of Contents

Figures (12)