ORION: ORthonormal Text Encoding for Universal VLM AdaptatION

Omprakash Chakraborty; Jose Dolz; Ismail Ben Ayed

ORION: ORthonormal Text Encoding for Universal VLM AdaptatION

Omprakash Chakraborty, Jose Dolz, Ismail Ben Ayed

TL;DR

This work introduces ORION, a text encoder fine tuning framework that improves pretrained VLMs using only class names, and provides a probabilistic interpretation of the orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens theorem.

Abstract

Vision language models (VLMs) have demonstrated remarkable generalization across diverse tasks, yet their performance remains constrained by the quality and geometry of the textual prototypes used to represent classes. Standard zero shot classifiers, derived from frozen text encoders and handcrafted prompts, may yield correlated or weakly separated embeddings that limit task specific discriminability. We introduce ORION, a text encoder fine tuning framework that improves pretrained VLMs using only class names. Our method optimizes, via low rank adaptation, a novel loss integrating two terms, one promoting pairwise orthogonality between the textual representations of the classes of a given task and the other penalizing deviations from the initial class prototypes. Furthermore, we provide a probabilistic interpretation of our orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens theorem. We report extensive experiments on 11 benchmarks and three large VLM backbones, showing that the refined textual embeddings yield powerful replacements for the standard CLIP prototypes. Added as plug and play module on top of various state of the art methods, and across different prediction settings (zero shot, few shot and test time adaptation), ORION improves the performance consistently and significantly.

ORION: ORthonormal Text Encoding for Universal VLM AdaptatION

TL;DR

Abstract

Paper Structure (22 sections, 8 equations, 3 figures, 9 tables)

This paper contains 22 sections, 8 equations, 3 figures, 9 tables.

Introduction
Related Work
Methodology
Problem Setup
Orthonormal Text Fine-Tuning
Parameter-Efficient Text Encoder Adaptation
Training-free Variants
Link to Maximum Likelihood Estimation
Huygens’ theorem (scatter decomposition)
Experiments
Zero-shot Classification with ORION
Results in the Few-shot Scenario
Robustness of ORION in Test-Time Adaptation
Conclusion
Datasets and Prompt Templates
...and 7 more sections

Figures (3)

Figure 1: Motivation of ORION. Visualization on EuroSAT helber2019eurosat shows that semantically related categories such as Crop Land and Pasture Land are highly entangled in CLIP’s zero-shot space, with their textual prototypes ($\times$) misaligned from visual clusters. Our orthonormal text encoder (ORION) re-centers them ($\circ$) toward the true image manifolds, improving class separation and inter-class geometry. Best viewed in color.
Figure 2: Motivational overview. Our text-only orthogonal fine-tuning (ORION) uses only class names—no images or captions—to refine the textual encoder of a frozen VLM. Across zero-shot (CLIP), test-time adaptation (StatA offline/online, MTA, TPT), and few-shot (CoOp/CLAP, 1- and 16-shot) settings, ORION consistently improves Top-1 accuracy (averaged over 11 datasets). Bars correspond to Baseline, +Avg Prompts, and ORION; numbers above the bars (in green) indicate absolute gains over the respective baselines. By optimizing only the textual prototypes, ORION yields a universal classifier that enhances diverse adaptation regimes without any visual supervision.
Figure 3: Effect of the orthogonality penalty weight $\lambda$ on average zero-shot performance across 11 datasets. We sweep $\log_{10}\lambda \in \{-1,0,1,2,3,4\}$, with "no-orth" corresponding to $\lambda=0$. Moderate regularization ($\lambda \approx 2$) yields the best performance, while overly large values degrade accuracy due to excessive hardening of class directions.

ORION: ORthonormal Text Encoding for Universal VLM AdaptatION

TL;DR

Abstract

ORION: ORthonormal Text Encoding for Universal VLM AdaptatION

Authors

TL;DR

Abstract

Table of Contents

Figures (3)