Table of Contents
Fetching ...

Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness

Yehonatan Elisha, Oren Barkan, Noam Koenigstein

TL;DR

This work introduces a novel finetuning framework that steers model reasoning toward concept-level semantics and confirms that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting the central hypothesis.

Abstract

Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground-background masks, which fail to capture the fine-grained semantic concepts that define an object (e.g., ``long beak'' and ``wings'' for a ``bird''). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class-relevant concepts are first proposed using an LLM-based, label-free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution benchmarks demonstrate that our method improves robustness across multiple ViT-based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.

Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness

TL;DR

This work introduces a novel finetuning framework that steers model reasoning toward concept-level semantics and confirms that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting the central hypothesis.

Abstract

Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground-background masks, which fail to capture the fine-grained semantic concepts that define an object (e.g., ``long beak'' and ``wings'' for a ``bird''). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class-relevant concepts are first proposed using an LLM-based, label-free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution benchmarks demonstrate that our method improves robustness across multiple ViT-based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.
Paper Structure (34 sections, 7 equations, 2 figures, 9 tables)

This paper contains 34 sections, 7 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Motivation for CFT: Relevance maps produced by ViTs often concentrate on spurious background cues rather than semantically meaningful concepts. The figure illustrates this issue using ViT-B on ImageNet-A and ImageNet-R, showing relevance maps before and after applying CFT. By encouraging the model to focus on class-relevant, discriminative concepts, CFT substantially improves the semantic alignment of relevance maps. Notably, after CFT, the model highlights meaningful object parts, such as the beak and wings of the bird (top row) or the fins and mouth of the fish (bottom row), despite never being fine-tuned on these datasets.
  • Figure 2: Qualitative examples of CFT correcting prediction failures on OOD datasets using the ViT-B model: the baseline model (Original) misclassifies the images, with relevance maps often highlighting misleading context. Our CFT-finetuned model successfully corrects the prediction (e.g., "scorpion" $\rightarrow$ "common newt") by focusing its relevance on the object's core semantic concepts, demonstrating improved reasoning.