Table of Contents
Fetching ...

OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels

Meng Lou, Yizhou Yu

TL;DR

This paper addresses the inefficiency of purely hierarchical ConvNets in capturing global context by proposing OverLoCK, a pure ConvNet backbone that emulates the overview-first-then-look closely strategy of human vision. It introduces Deep-stage Decomposition (DDS) to split the network into Base-Net, Overview-Net, and Focus-Net, and a Context-Mixing Dynamic Convolution (ContMix) to inject top-down semantic context into dynamic kernels. The approach yields state-of-the-art or competitive results across ImageNet-1K, COCO, and ADE20K, notably achieving $84.2\%$ Top-1 on ImageNet-1K for OverLoCK-T and $85.1\%$ for OverLoCK-B with favorable compute, as well as strong robustness to out-of-distribution data. The work demonstrates that combining biomimetic attention with a dynamic, context-aware convolution mechanism can surpass large-kernel or transformer-based rivals while maintaining the efficiency and inductive biases of ConvNets.

Abstract

Top-down attention plays a crucial role in the human vision system, wherein the brain initially obtains a rough overview of a scene to discover salient cues (i.e., overview first), followed by a more careful finer-grained examination (i.e., look closely next). However, modern ConvNets remain confined to a pyramid structure that successively downsamples the feature map for receptive field expansion, neglecting this crucial biomimetic principle. We present OverLoCK, the first pure ConvNet backbone architecture that explicitly incorporates a top-down attention mechanism. Unlike pyramid backbone networks, our design features a branched architecture with three synergistic sub-networks: 1) a Base-Net that encodes low/mid-level features; 2) a lightweight Overview-Net that generates dynamic top-down attention through coarse global context modeling (i.e., overview first); and 3) a robust Focus-Net that performs finer-grained perception guided by top-down attention (i.e., look closely next). To fully unleash the power of top-down attention, we further propose a novel context-mixing dynamic convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases, addressing critical limitations in existing convolutions. Our OverLoCK exhibits a notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2%, significantly surpassing ConvNeXt-B while using only around one-third of the FLOPs/parameters. On object detection, our OverLoCK-S clearly surpasses MogaNet-B by 1% in AP^b. On semantic segmentation, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7% in mIoU. Code is publicly available at https://github.com/LMMMEng/OverLoCK.

OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels

TL;DR

This paper addresses the inefficiency of purely hierarchical ConvNets in capturing global context by proposing OverLoCK, a pure ConvNet backbone that emulates the overview-first-then-look closely strategy of human vision. It introduces Deep-stage Decomposition (DDS) to split the network into Base-Net, Overview-Net, and Focus-Net, and a Context-Mixing Dynamic Convolution (ContMix) to inject top-down semantic context into dynamic kernels. The approach yields state-of-the-art or competitive results across ImageNet-1K, COCO, and ADE20K, notably achieving Top-1 on ImageNet-1K for OverLoCK-T and for OverLoCK-B with favorable compute, as well as strong robustness to out-of-distribution data. The work demonstrates that combining biomimetic attention with a dynamic, context-aware convolution mechanism can surpass large-kernel or transformer-based rivals while maintaining the efficiency and inductive biases of ConvNets.

Abstract

Top-down attention plays a crucial role in the human vision system, wherein the brain initially obtains a rough overview of a scene to discover salient cues (i.e., overview first), followed by a more careful finer-grained examination (i.e., look closely next). However, modern ConvNets remain confined to a pyramid structure that successively downsamples the feature map for receptive field expansion, neglecting this crucial biomimetic principle. We present OverLoCK, the first pure ConvNet backbone architecture that explicitly incorporates a top-down attention mechanism. Unlike pyramid backbone networks, our design features a branched architecture with three synergistic sub-networks: 1) a Base-Net that encodes low/mid-level features; 2) a lightweight Overview-Net that generates dynamic top-down attention through coarse global context modeling (i.e., overview first); and 3) a robust Focus-Net that performs finer-grained perception guided by top-down attention (i.e., look closely next). To fully unleash the power of top-down attention, we further propose a novel context-mixing dynamic convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases, addressing critical limitations in existing convolutions. Our OverLoCK exhibits a notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2%, significantly surpassing ConvNeXt-B while using only around one-third of the FLOPs/parameters. On object detection, our OverLoCK-S clearly surpasses MogaNet-B by 1% in AP^b. On semantic segmentation, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7% in mIoU. Code is publicly available at https://github.com/LMMMEng/OverLoCK.

Paper Structure

This paper contains 20 sections, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Performance comparisons between our OverLoCK and other representative backbone networks on vision tasks.
  • Figure 2: (a) Comparison of Effective Receptive Fields (ERF) luo2016understanding at the last layer of deep stages (i.e., Stages 3 and 4) among backbone networks. The results are obtained by averaging over 300 images from ImageNet-1K validation set. As shown, despite being a pure ConvNet, OverLoCK-T has a larger ERF than VMamba-T that emphasizes global modeling, in both Stages 3 and 4. (b) Visualizations of class activation maps computed using Grad-CAM selvaraju2020grad for the output of deep stages (i.e., Stages 3 and 4). The category labels of these two images are "Barrel" and "Neck Brace". The results demonstrate that although classic hierarchical models can capture long-range dependencies to varying degrees, they struggle to localize objects with the correct category label, especially in Stage 3, which is farther from the classifier. In contrast, our proposed new network architecture can produce more accurate class activation maps in both Stages 3 and 4.
  • Figure 3: The architecture of our OverLoCK network.
  • Figure 4: Structures of network building blocks.
  • Figure 5: (a) A schematic diagram of our proposed dynamic convolution (ContMix). (b) An illustration of ContMix's ability in capturing long-range dependencies and preserving inductive biases.
  • ...and 2 more figures