Table of Contents
Fetching ...

Understand the Effectiveness of Shortcuts through the Lens of DCA

Youran Sun, Yihua Liu, Yi-Shuai Niu

TL;DR

This work reframes neural-network training through the Difference-of-Convex Algorithm (DCA) lens, modeling a vanilla network as a DC decomposition so that the resulting gradient contains a parallel shortcut component akin to ResNet paths. It shows that applying DCA yields a ResNet-like gradient structure for both MSE and CE losses, and that high-dimensional geometry supports this with near-orthogonality between typical "vanilla" and shortcut directions. The authors also propose NegNet, a quasi-DC decomposition, which achieves CIFAR-10 performance on par with ResNet and demonstrates the practical utility of a DCA-guided architectural design. Overall, the paper offers a unified, mathematically grounded explanation for the effectiveness of shortcuts and outlines how DC-based reasoning can inspire new architectures with better convergence properties.

Abstract

Difference-of-Convex Algorithm (DCA) is a well-known nonconvex optimization algorithm for minimizing a nonconvex function that can be expressed as the difference of two convex ones. Many famous existing optimization algorithms, such as SGD and proximal point methods, can be viewed as special DCAs with specific DC decompositions, making it a powerful framework for optimization. On the other hand, shortcuts are a key architectural feature in modern deep neural networks, facilitating both training and optimization. We showed that the shortcut neural network gradient can be obtained by applying DCA to vanilla neural networks, networks without shortcut connections. Therefore, from the perspective of DCA, we can better understand the effectiveness of networks with shortcuts. Moreover, we proposed a new architecture called NegNet that does not fit the previous interpretation but performs on par with ResNet and can be included in the DCA framework.

Understand the Effectiveness of Shortcuts through the Lens of DCA

TL;DR

This work reframes neural-network training through the Difference-of-Convex Algorithm (DCA) lens, modeling a vanilla network as a DC decomposition so that the resulting gradient contains a parallel shortcut component akin to ResNet paths. It shows that applying DCA yields a ResNet-like gradient structure for both MSE and CE losses, and that high-dimensional geometry supports this with near-orthogonality between typical "vanilla" and shortcut directions. The authors also propose NegNet, a quasi-DC decomposition, which achieves CIFAR-10 performance on par with ResNet and demonstrates the practical utility of a DCA-guided architectural design. Overall, the paper offers a unified, mathematically grounded explanation for the effectiveness of shortcuts and outlines how DC-based reasoning can inspire new architectures with better convergence properties.

Abstract

Difference-of-Convex Algorithm (DCA) is a well-known nonconvex optimization algorithm for minimizing a nonconvex function that can be expressed as the difference of two convex ones. Many famous existing optimization algorithms, such as SGD and proximal point methods, can be viewed as special DCAs with specific DC decompositions, making it a powerful framework for optimization. On the other hand, shortcuts are a key architectural feature in modern deep neural networks, facilitating both training and optimization. We showed that the shortcut neural network gradient can be obtained by applying DCA to vanilla neural networks, networks without shortcut connections. Therefore, from the perspective of DCA, we can better understand the effectiveness of networks with shortcuts. Moreover, we proposed a new architecture called NegNet that does not fit the previous interpretation but performs on par with ResNet and can be included in the DCA framework.

Paper Structure

This paper contains 9 sections, 1 theorem, 50 equations, 2 figures, 1 algorithm.

Key Result

Theorem 2.1

Let $\{x^k\}$ and $\{y^k\}$ be the sequences generated by DCA for the DC problem prob:P, starting from an initial point $x^0 \in \mathrm{dom}~\partial H$. Suppose that both $\{x^k\}$ and $\{y^k\}$ are bounded. Then:

Figures (2)

  • Figure 1: The architectures of the vanilla network, ResNet, and the NegNet proposed in this paper.
  • Figure 2: Performance comparison between NegNet18 (Eq. \ref{['eq:negnet']}) and ResNet18 on the CIFAR10 dataset. The results show that NegNet achieves performance on par with ResNet. Note that the activation layer has been moved before the shortcut in NegNet for consistency with Eq. \ref{['eq:resnet']}.

Theorems & Definitions (1)

  • Theorem 2.1: Convergence Theorem of DCA, see e.g., dca1997niu2022convergence