Gradient Residual Connections

Yangchen Pan; Qizhen Ying; Philip Torr; Bo Liu

Gradient Residual Connections

Yangchen Pan, Qizhen Ying, Philip Torr, Bo Liu

TL;DR

This paper introduces gradient residual connections that incorporate gradient information from intermediate representations into skip connections to better approximate high-frequency functions. By combining the gradient residual with the standard skip through a convex, sigmoid-controlled mix, and optionally normalizing the gradient term, the method aims to maintain stability while enhancing sensitivity to high-frequency structure. Theoretical analysis links gradient directions to local frequency content, demonstrating conditions under which gradient directions become nearly opposite for nearby points, aiding discrimination in high-frequency regions. Empirical results on synthetic 1D tasks and image super-resolution show clear gains in high-frequency settings, while standard vision tasks like classification and segmentation remain largely comparable to traditional residual networks, underscoring both the potential and the architecture-dependent nature of gradient residuals.

Abstract

Existing work has linked properties of a function's gradient to the difficulty of function approximation. Motivated by these insights, we study how gradient information can be leveraged to improve neural network's ability to approximate high-frequency functions, and we propose a gradient-based residual connection as a complement to the standard identity skip connection used in residual networks. We provide simple theoretical intuition for why gradient information can help distinguish inputs and improve the approximation of functions with rapidly varying behaviour. On a synthetic regression task with a high-frequency sinusoidal ground truth, we show that conventional residual connections struggle to capture high-frequency patterns. In contrast, our gradient residual substantially improves approximation quality. We then introduce a convex combination of the standard and gradient residuals, allowing the network to flexibly control how strongly it relies on gradient information. After validating the design choices of our proposed method through an ablation study, we further validate our approach's utility on the single-image super-resolution task, where the underlying function may be high-frequency. Finally, on standard tasks such as image classification and segmentation, our method achieves performance comparable to standard residual networks, suggesting its broad utility.

Gradient Residual Connections

TL;DR

Abstract

Paper Structure (23 sections, 6 theorems, 44 equations, 11 figures, 18 tables)

This paper contains 23 sections, 6 theorems, 44 equations, 11 figures, 18 tables.

Introduction
Background
Residual Connections
Frequency and Function Approximation
Gradient-based Residual Connections
Gradient-based Residual
Theoretical Insight
Synthetic Experiments
Image Experiments
Super-resolution task
Classification and Segmentation
Discussions
Appendix
Proof
Bound the sum of gradient in high frequency components.
...and 8 more sections

Key Result

Lemma 3.5

[Upper bound of high frequency component gradient.] The gradient of $f_h$ at $\mathbf{\boldsymbol{x}}_0, \mathbf{\boldsymbol{x}}_1$ satisfies:

Figures (11)

Figure 1: Learning curves on the sin dataset in terms of test MSE vs. number of training epochs. Results are averaged over $30$ random seeds. For standard residual, the solid line is the version with a trainable scalar on the residual connection.
Figure 2: Learning curves on the sin dataset in terms of test Mean Squared Error (MSE) vs. number of training epochs. (a) shows those with/without backpropagating through the gradient. (b) shows those without normalizing the gradient. Results are averaged over $30$ runs.
Figure 3: Learned functions vs. the ground-truth function when $d=16$. For each algorithm, we visualize results from the first random seed under the best hyperparameter setting, using models saved at the end of training. The region where our gradient residual method achieves substantially better approximation is highlighted by the purple circle. For clarity of visualization, only the most relevant subset of algorithms is included.
Figure 4: Learning curves of test performance measured by mean PSNR (mean over images) vs. training epochs on standard benchmarks. We train for 500 epochs in total and evaluate every 10 epochs. The results are averaged over 3 random seeds.
Figure 5: Learning curves (testing MSE v.s. training epochs) when using different initialization scalars for our approach. Results are averaged over 30 runs/random seeds.
...and 6 more figures

Theorems & Definitions (13)

Remark 3.1
Remark 3.2
Remark 3.5
Lemma 3.5
Lemma 3.5
Remark 3.7
Theorem 3.8
Lemma 1.0
proof
Lemma 1.0
...and 3 more

Gradient Residual Connections

TL;DR

Abstract

Gradient Residual Connections

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (13)