OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
Meng Lou, Yizhou Yu
TL;DR
This paper addresses the inefficiency of purely hierarchical ConvNets in capturing global context by proposing OverLoCK, a pure ConvNet backbone that emulates the overview-first-then-look closely strategy of human vision. It introduces Deep-stage Decomposition (DDS) to split the network into Base-Net, Overview-Net, and Focus-Net, and a Context-Mixing Dynamic Convolution (ContMix) to inject top-down semantic context into dynamic kernels. The approach yields state-of-the-art or competitive results across ImageNet-1K, COCO, and ADE20K, notably achieving $84.2\%$ Top-1 on ImageNet-1K for OverLoCK-T and $85.1\%$ for OverLoCK-B with favorable compute, as well as strong robustness to out-of-distribution data. The work demonstrates that combining biomimetic attention with a dynamic, context-aware convolution mechanism can surpass large-kernel or transformer-based rivals while maintaining the efficiency and inductive biases of ConvNets.
Abstract
Top-down attention plays a crucial role in the human vision system, wherein the brain initially obtains a rough overview of a scene to discover salient cues (i.e., overview first), followed by a more careful finer-grained examination (i.e., look closely next). However, modern ConvNets remain confined to a pyramid structure that successively downsamples the feature map for receptive field expansion, neglecting this crucial biomimetic principle. We present OverLoCK, the first pure ConvNet backbone architecture that explicitly incorporates a top-down attention mechanism. Unlike pyramid backbone networks, our design features a branched architecture with three synergistic sub-networks: 1) a Base-Net that encodes low/mid-level features; 2) a lightweight Overview-Net that generates dynamic top-down attention through coarse global context modeling (i.e., overview first); and 3) a robust Focus-Net that performs finer-grained perception guided by top-down attention (i.e., look closely next). To fully unleash the power of top-down attention, we further propose a novel context-mixing dynamic convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases, addressing critical limitations in existing convolutions. Our OverLoCK exhibits a notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2%, significantly surpassing ConvNeXt-B while using only around one-third of the FLOPs/parameters. On object detection, our OverLoCK-S clearly surpasses MogaNet-B by 1% in AP^b. On semantic segmentation, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7% in mIoU. Code is publicly available at https://github.com/LMMMEng/OverLoCK.
