Table of Contents
Fetching ...

REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders

Savya Khosla, Sethuraman TV, Barnett Lee, Alexander Schwing, Derek Hoiem

TL;DR

REN tackles the inefficiency of patch-based ViT representations by producing region-level tokens with point prompts and cross-attention, eliminating dependence on costly segmentation masks during inference. It trains with a self-supervised objective combining contrastive learning ($L_{cont}$) and feature-space alignment ($L_{feat}$) using SAM-derived region IDs, enabling robust, content-aware region tokens that generalize across encoders. REN achieves substantial efficiency gains (e.g., $60\times$ faster token generation and $35\times$ less memory) and competitive or superior performance on semantic segmentation, retrieval, and visual query localization, including state-of-the-art results on Ego4D VQ2D. By supporting transfer to multiple encoders (e.g., DINO, DINOv2, OpenCLIP) without retraining and providing practical prompting and aggregation strategies, REN offers a scalable route to region-based image representations. Code and models are available at the authors' GitHub repository.

Abstract

We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders-DINO, DINOv2, and OpenCLIP-and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM-based region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks' single-needle challenge. Code and models are available at: https://github.com/savya08/REN.

REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders

TL;DR

REN tackles the inefficiency of patch-based ViT representations by producing region-level tokens with point prompts and cross-attention, eliminating dependence on costly segmentation masks during inference. It trains with a self-supervised objective combining contrastive learning () and feature-space alignment () using SAM-derived region IDs, enabling robust, content-aware region tokens that generalize across encoders. REN achieves substantial efficiency gains (e.g., faster token generation and less memory) and competitive or superior performance on semantic segmentation, retrieval, and visual query localization, including state-of-the-art results on Ego4D VQ2D. By supporting transfer to multiple encoders (e.g., DINO, DINOv2, OpenCLIP) without retraining and providing practical prompting and aggregation strategies, REN offers a scalable route to region-based image representations. Code and models are available at the authors' GitHub repository.

Abstract

We introduce the Region Encoder Network (REN), a fast and effective model for generating region-based image representations using point prompts. Recent methods combine class-agnostic segmenters (e.g., SAM) with patch-based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders-DINO, DINOv2, and OpenCLIP-and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM-based region methods while being significantly faster. Notably, REN achieves state-of-the-art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks' single-needle challenge. Code and models are available at: https://github.com/savya08/REN.

Paper Structure

This paper contains 14 sections, 2 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Overview of REN. Point prompts interact with patch-based features through cross-attention blocks to produce region tokens. The training objective combines two components: (1) a contrastive loss that aligns region tokens with those generated from an augmented view of the same image, and (2) a feature similarity loss that aligns a linear projection of these tokens with average-pooled patch features obtained using SAM masks. REN eliminates the need for explicit segmentation at inference time while producing efficient and semantically rich region representations. We also show thresholded attention maps for three query points inside the cross-attention block, which show that the model learns to aggregate features primarily from the regions marked by the corresponding point prompts.
  • Figure 1: Runtime comparison. With 10$\times$ fewer parameters, REN achieves over 60$\times$ speedup compared to the fastest SAM-based approach, as measured on a single NVIDIA A40 GPU. Evaluations use either a 32$\times$32 grid prompts or 1024 SLIC-based prompts. Reported metrics exclude the patch-based image encoder (DINO ViT-B/8: 85.8M parameters, 0.011 s/img).
  • Figure 2: Point prompting strategies and token aggregation results. Region tokens corresponding to point prompts within the same-colored area are aggregated, and we show a representative point prompt for each region. Thus, each image can be represented with a few dozen tokens instead of the hundreds required by patch-based methods. (Best viewed in color)
  • Figure 2: Visual query localization on the Ego4D VQ2D benchmark. Our method substantially outperforms existing approaches, including those specifically developed for this task. Baseline results are sourced from the https://eval.ai/web/challenges/challenge-page/1843/leaderboard/4326.
  • Figure 3: Semantic segmentation using a linear classifier on frozen features. For reference, the absolute state-of-the-art results from deeplabv3 and onepiece are reported in the column heading. Results for methods using external segmenters are taken from dinosam.
  • ...and 6 more figures