Table of Contents
Fetching ...

IMC-Net: A Lightweight Content-Conditioned Encoder with Multi-Pass Processing for Image Classification

YiZhou Li

TL;DR

IMC-Net addresses inefficiency in fixed-depth encoders by introducing content-conditioned multi-pass processing driven by region-wise scores. A single lightweight core block is re-applied selectively, with a percentile-based mask and a compact representation cache to deliver input-conditioned depth while maintaining a minimal architectural footprint. The approach yields competitive accuracy on ImageNet and transfer tasks with substantially reduced parameters and FLOPs, alongside higher throughput, without relying on distillation or large-scale pretraining. This deployment-friendly design demonstrates robust generalization and offers a practical path toward scalable, resource-efficient visual recognition.

Abstract

We present a compact encoder for image categorization that emphasizes computation economy through content-conditioned multi-pass processing. The model employs a single lightweight core block that can be re-applied a small number of times, while a simple score-based selector decides whether further passes are beneficial for each region unit in the feature map. This design provides input-conditioned depth without introducing heavy auxiliary modules or specialized pretraining. On standard benchmarks, the approach attains competitive accuracy with reduced parameters, lower floating-point operations, and faster inference compared to similarly sized baselines. The method keeps the architecture minimal, implements module reuse to control footprint, and preserves stable training via mild regularization on selection scores. We discuss implementation choices for efficient masking, pass control, and representation caching, and show that the multi-pass strategy transfers well to several datasets without requiring task-specific customization.

IMC-Net: A Lightweight Content-Conditioned Encoder with Multi-Pass Processing for Image Classification

TL;DR

IMC-Net addresses inefficiency in fixed-depth encoders by introducing content-conditioned multi-pass processing driven by region-wise scores. A single lightweight core block is re-applied selectively, with a percentile-based mask and a compact representation cache to deliver input-conditioned depth while maintaining a minimal architectural footprint. The approach yields competitive accuracy on ImageNet and transfer tasks with substantially reduced parameters and FLOPs, alongside higher throughput, without relying on distillation or large-scale pretraining. This deployment-friendly design demonstrates robust generalization and offers a practical path toward scalable, resource-efficient visual recognition.

Abstract

We present a compact encoder for image categorization that emphasizes computation economy through content-conditioned multi-pass processing. The model employs a single lightweight core block that can be re-applied a small number of times, while a simple score-based selector decides whether further passes are beneficial for each region unit in the feature map. This design provides input-conditioned depth without introducing heavy auxiliary modules or specialized pretraining. On standard benchmarks, the approach attains competitive accuracy with reduced parameters, lower floating-point operations, and faster inference compared to similarly sized baselines. The method keeps the architecture minimal, implements module reuse to control footprint, and preserves stable training via mild regularization on selection scores. We discuss implementation choices for efficient masking, pass control, and representation caching, and show that the multi-pass strategy transfers well to several datasets without requiring task-specific customization.

Paper Structure

This paper contains 24 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Schematic of the baseline encoder. The image is partitioned into patches, linearly embedded, concatenated with a summary vector, augmented with positional information, and then processed by encoder blocks; the summary output is used for classification.
  • Figure 2: Workflow of the multi-pass encoder. A lightweight selector produces region-wise scores; a percentile mask retains only the regions deemed beneficial for an extra pass, while others remain unchanged. Representation caching keeps the procedure efficient.
  • Figure 3: Comparison of model parameters among different methods (smaller is better).
  • Figure 4: Comparison of FLOPs among different methods (lower is better).
  • Figure 5: Comparison of inference speed among different methods (higher is better).
  • ...and 1 more figures