Table of Contents
Fetching ...

Efficiency Follows Global-Local Decoupling

Zhenyu Yang, Gensheng Pei, Tao Chen, Yichao Zhou, Tianfei Zhou, Yazhou Yao, Fumin Shen

Abstract

Modern vision models must capture image-level context without sacrificing local detail while remaining computationally affordable. We revisit this tradeoff and advance a simple principle: decouple the roles of global reasoning and local representation. To operationalize this principle, we introduce ConvNeur, a two-branch architecture in which a lightweight neural memory branch aggregates global context on a compact set of tokens, and a locality-preserving branch extracts fine structure. A learned gate lets global cues modulate local features without entangling their objectives. This separation yields subquadratic scaling with image size, retains inductive priors associated with local processing, and reduces overhead relative to fully global attention. On standard classification, detection, and segmentation benchmarks, ConvNeur matches or surpasses comparable alternatives at similar or lower compute and offers favorable accuracy versus latency trade-offs at similar budgets. These results support the view that efficiency follows global-local decoupling.

Efficiency Follows Global-Local Decoupling

Abstract

Modern vision models must capture image-level context without sacrificing local detail while remaining computationally affordable. We revisit this tradeoff and advance a simple principle: decouple the roles of global reasoning and local representation. To operationalize this principle, we introduce ConvNeur, a two-branch architecture in which a lightweight neural memory branch aggregates global context on a compact set of tokens, and a locality-preserving branch extracts fine structure. A learned gate lets global cues modulate local features without entangling their objectives. This separation yields subquadratic scaling with image size, retains inductive priors associated with local processing, and reduces overhead relative to fully global attention. On standard classification, detection, and segmentation benchmarks, ConvNeur matches or surpasses comparable alternatives at similar or lower compute and offers favorable accuracy versus latency trade-offs at similar budgets. These results support the view that efficiency follows global-local decoupling.
Paper Structure (15 sections, 6 equations, 4 figures, 5 tables)

This paper contains 15 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Efficiency v.s. accuracy on the ImageNet-1K benchmark. The proposed ConvNeur family establishes a new state-of-the-art frontier, delivering superior TOP-1 Acc while demanding significantly fewer FLOPs and parameters than existing methods.
  • Figure 2: The overall architecture of ConvNeur. On the left, a locality-preserving convolutional branch extracts fine-grained features. In parallel, a compressed global branch performs chunked neural memory aggregation, and produces a gating map to modulate the local features. On the right, we show the neural-memory module: each chunk is linearly mapped to $\{q_t,k_t,v_t\}$. $q_t$ reads from the previous memory state $\mathcal{M}_{t-1}$, while $k_t$ and $v_t$ compute a surprise loss to update the memory to $\mathcal{M}_t$.
  • Figure 3: Qualitative results on the ADE20K zhou2017scene dataset. All examples are from the validation set. In the figure, B represents Boundary F1 score, T represents Trimap-based mIoU, and H represents Hausdorff distance.
  • Figure 4: Stage-wise visualization of global to local modulation. Columns correspond to the four stages from shallow to deep. Row 1 shows local-only features $F_\text{loc}$ produced by the locality-preserving branch without the global path. Row 2 shows the local features in ConvNeur before gating. Row 3 shows the gated features $F_\text{out}$. Green arrows mark the gating step.