Table of Contents
Fetching ...

LSNet: See Large, Focus Small

Ao Wang, Hui Chen, Zijia Lin, Jungong Han, Guiguang Ding

TL;DR

LSNet addresses the efficiency–accuracy dilemma in vision models by introducing LS convolution, which combines large-kernel perception with small-kernel dynamic aggregation to mimic See Large, Focus Small. Built from LS blocks across a four-stage backbone, LSNet achieves strong accuracy with low FLOPs and delivers robust performance across classification, detection, segmentation, and robustness benchmarks. Ablation studies show the benefits of large-kernel perception, adaptive aggregation, and group-based dynamics, and LS convolution generalizes to ResNet and DeiT. The results suggest LSNet as a competitive baseline for lightweight vision systems with practical deployment benefits.

Abstract

Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a ``See Large, Focus Small'' strategy for lightweight vision network design. We introduce LS (\textbf{L}arge-\textbf{S}mall) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks. Codes and models are available at https://github.com/jameslahm/lsnet.

LSNet: See Large, Focus Small

TL;DR

LSNet addresses the efficiency–accuracy dilemma in vision models by introducing LS convolution, which combines large-kernel perception with small-kernel dynamic aggregation to mimic See Large, Focus Small. Built from LS blocks across a four-stage backbone, LSNet achieves strong accuracy with low FLOPs and delivers robust performance across classification, detection, segmentation, and robustness benchmarks. Ablation studies show the benefits of large-kernel perception, adaptive aggregation, and group-based dynamics, and LS convolution generalizes to ResNet and DeiT. The results suggest LSNet as a competitive baseline for lightweight vision systems with practical deployment benefits.

Abstract

Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a ``See Large, Focus Small'' strategy for lightweight vision network design. We introduce LS (\textbf{L}arge-\textbf{S}mall) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks. Codes and models are available at https://github.com/jameslahm/lsnet.

Paper Structure

This paper contains 21 sections, 7 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: The mechanism of self attention (a) and convolution (b). (c) shows that the human vision system can "See Large" through the peripheral vision, and "Focus Small" through the central vision. (d) shows the distribution of rods and cones depending on the eccentricity from the fovea of the human eye. They contribute to the formation of extensive peripheral vision and focal central vision.
  • Figure 2: Comparison of self-attention, convolution, and LS conv.
  • Figure 3: (a) The illustration of our proposed LS convolution. (b) The illustration of our proposed LSNet. LSNet has four stages with $\frac{H}{8}\times \frac{W}{8}$, $\frac{H}{16}\times \frac{W}{16}$, $\frac{H}{32}\times \frac{W}{32}$, and $\frac{H}{64}\times \frac{W}{64}$ resolutions respectively, where $H$ and $W$ denote the width and height of the input image. $C$ represents the channel dimension. The norm layer and nonlinearity are omitted for simplicity.
  • Figure 4: Visualization of the effective receptive field. Best viewed when zoomed in. (a) and (b) show that RepMixer and CGA exhibit unnatural patterns in the effective receptive field. (c) illustrates that LS convolution enables broad peripheral perception and central view focusing simultaneously. (d) shows that without LKP, LS convolution presents a smaller receptive field compared with (c), indicating the effectiveness of LKP.
  • Figure 5: Superiority of LS conv.
  • ...and 5 more figures