Table of Contents
Fetching ...

TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

Arian Sabaghi, José Oramas

TL;DR

TriLite is presented, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters for both classification and localization.

Abstract

Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be released soon.

TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

TL;DR

TriLite is presented, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters for both classification and localization.

Abstract

Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be released soon.
Paper Structure (14 sections, 6 equations, 6 figures, 4 tables)

This paper contains 14 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of TriLite for WSOL. A frozen ViT backbone extracts patch features, while only a lightweight classification layer is trained using class token. The TriHead module applies a single convolutional layer to produce foreground, background, and ambiguous heatmaps. Supervision is applied to foreground and background embeddings.
  • Figure 2: Comparison of localization results on CUB-200-2011 and ILSVRC datasets. Green and red colors are used for ground-truth and predicted bounding boxes.
  • Figure 3: Partial coverage on CUB-200-2011. TriLite activations may miss occluded regions, yielding fragmented bounding boxes (“Partial”). Merging all activated regions into a single box (“Merged”) recovers full coverage.
  • Figure 4: Qualitative results on the OpenImages dataset. The first column shows the original image with the ground-truth mask overlaid.
  • Figure 5: Comparison between binary and three-channel outputs. Regions mistakenly activated in the binary setting are reassigned to the ambiguous channel in the three-channel formulation.
  • ...and 1 more figures