TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

Arian Sabaghi; José Oramas

TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

Arian Sabaghi, José Oramas

TL;DR

TriLite is presented, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters for both classification and localization.

Abstract

Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be released soon.

TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

TL;DR

Abstract

Paper Structure (14 sections, 6 equations, 6 figures, 4 tables)

This paper contains 14 sections, 6 equations, 6 figures, 4 tables.

Introduction
Related Work
TriLite
Experiments
Experimental Settings
Results of WSOL
Supervised vs. Self-supervised Backbones
Analysis of CUB-200-2011 Performance
Results of WSSS
Parameter Efficiency and Simplicity
Ablation Study
Failure cases and Future Directions
Conclusion
Acknowledgements

Figures (6)

Figure 1: Overview of TriLite for WSOL. A frozen ViT backbone extracts patch features, while only a lightweight classification layer is trained using class token. The TriHead module applies a single convolutional layer to produce foreground, background, and ambiguous heatmaps. Supervision is applied to foreground and background embeddings.
Figure 2: Comparison of localization results on CUB-200-2011 and ILSVRC datasets. Green and red colors are used for ground-truth and predicted bounding boxes.
Figure 3: Partial coverage on CUB-200-2011. TriLite activations may miss occluded regions, yielding fragmented bounding boxes (“Partial”). Merging all activated regions into a single box (“Merged”) recovers full coverage.
Figure 4: Qualitative results on the OpenImages dataset. The first column shows the original image with the ground-truth mask overlaid.
Figure 5: Comparison between binary and three-channel outputs. Regions mistakenly activated in the binary setting are reassigned to the ambiguous channel in the three-channel formulation.
...and 1 more figures

TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

TL;DR

Abstract

TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

Authors

TL;DR

Abstract

Table of Contents

Figures (6)