Gradient Residual Connections
Yangchen Pan, Qizhen Ying, Philip Torr, Bo Liu
TL;DR
This paper introduces gradient residual connections that incorporate gradient information from intermediate representations into skip connections to better approximate high-frequency functions. By combining the gradient residual with the standard skip through a convex, sigmoid-controlled mix, and optionally normalizing the gradient term, the method aims to maintain stability while enhancing sensitivity to high-frequency structure. Theoretical analysis links gradient directions to local frequency content, demonstrating conditions under which gradient directions become nearly opposite for nearby points, aiding discrimination in high-frequency regions. Empirical results on synthetic 1D tasks and image super-resolution show clear gains in high-frequency settings, while standard vision tasks like classification and segmentation remain largely comparable to traditional residual networks, underscoring both the potential and the architecture-dependent nature of gradient residuals.
Abstract
Existing work has linked properties of a function's gradient to the difficulty of function approximation. Motivated by these insights, we study how gradient information can be leveraged to improve neural network's ability to approximate high-frequency functions, and we propose a gradient-based residual connection as a complement to the standard identity skip connection used in residual networks. We provide simple theoretical intuition for why gradient information can help distinguish inputs and improve the approximation of functions with rapidly varying behaviour. On a synthetic regression task with a high-frequency sinusoidal ground truth, we show that conventional residual connections struggle to capture high-frequency patterns. In contrast, our gradient residual substantially improves approximation quality. We then introduce a convex combination of the standard and gradient residuals, allowing the network to flexibly control how strongly it relies on gradient information. After validating the design choices of our proposed method through an ablation study, we further validate our approach's utility on the single-image super-resolution task, where the underlying function may be high-frequency. Finally, on standard tasks such as image classification and segmentation, our method achieves performance comparable to standard residual networks, suggesting its broad utility.
