Table of Contents
Fetching ...

FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

Cheng-Yu Hsieh, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Hadi Pouransari

TL;DR

FocalLens introduces a zero-shot conditional visual encoding framework that shapes image embeddings according to natural-language instructions by contrastive alignment with instruction outputs. It leverages visual instruction tuning data to train two instantiations, FocalLens-MLLM and FocalLens-CLIP, achieving improved focus on task-relevant features over fixed CLIP representations. Across more than 60 tasks, including image-image and image-text retrieval as well as classification, FocalLens provides consistent gains, with pronounced improvements on SugarCrepe and MMVP-VLM and notable benefits in low-data regimes. The work demonstrates the practicality and impact of adaptable, instruction-conditioned vision encoders for a wide range of downstream applications.

Abstract

Visual understanding is inherently contextual -- what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or the type of flowers, depending on the context of interest. Yet, most existing image encoding paradigms represent an image as a fixed, generic feature vector, overlooking the potential needs of prioritizing varying visual information for different downstream use cases. In this work, we introduce FocalLens, a conditional visual encoding method that produces different representations for the same image based on the context of interest, expressed flexibly through natural language. We leverage vision instruction tuning data and contrastively finetune a pretrained vision encoder to take natural language instructions as additional inputs for producing conditional image representations. Extensive experiments validate that conditional image representation from FocalLens better pronounce the visual features of interest compared to generic features produced by standard vision encoders like CLIP. In addition, we show FocalLens further leads to performance improvements on a range of downstream tasks including image-image retrieval, image classification, and image-text retrieval, with an average gain of 5 and 10 points on the challenging SugarCrepe and MMVP-VLM benchmarks, respectively.

FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations

TL;DR

FocalLens introduces a zero-shot conditional visual encoding framework that shapes image embeddings according to natural-language instructions by contrastive alignment with instruction outputs. It leverages visual instruction tuning data to train two instantiations, FocalLens-MLLM and FocalLens-CLIP, achieving improved focus on task-relevant features over fixed CLIP representations. Across more than 60 tasks, including image-image and image-text retrieval as well as classification, FocalLens provides consistent gains, with pronounced improvements on SugarCrepe and MMVP-VLM and notable benefits in low-data regimes. The work demonstrates the practicality and impact of adaptable, instruction-conditioned vision encoders for a wide range of downstream applications.

Abstract

Visual understanding is inherently contextual -- what we focus on in an image depends on the task at hand. For instance, given an image of a person holding a bouquet of flowers, we may focus on either the person such as their clothing, or the type of flowers, depending on the context of interest. Yet, most existing image encoding paradigms represent an image as a fixed, generic feature vector, overlooking the potential needs of prioritizing varying visual information for different downstream use cases. In this work, we introduce FocalLens, a conditional visual encoding method that produces different representations for the same image based on the context of interest, expressed flexibly through natural language. We leverage vision instruction tuning data and contrastively finetune a pretrained vision encoder to take natural language instructions as additional inputs for producing conditional image representations. Extensive experiments validate that conditional image representation from FocalLens better pronounce the visual features of interest compared to generic features produced by standard vision encoders like CLIP. In addition, we show FocalLens further leads to performance improvements on a range of downstream tasks including image-image retrieval, image classification, and image-text retrieval, with an average gain of 5 and 10 points on the challenging SugarCrepe and MMVP-VLM benchmarks, respectively.

Paper Structure

This paper contains 34 sections, 1 equation, 6 figures, 10 tables.

Figures (6)

  • Figure 1: For a given image, the CLIP embedding space is static and structured based on overall semantics. However, FocalLens dynamically rearranges the embedding space based on the specified condition, bringing instances that are more similar under that condition closer together. We show the top-2 nearest neighbors for both CLIP and FocalLens embeddings (once conditioned on "background" and once on "quantity").
  • Figure 2: FocalLens is applied to two vision-language models to extract text-conditioned visual features: (a) modifying Llava-like VLMs, which already have text-conditioning capabilities, to produce a global visual feature, and (b) modifying ViT dosovitskiy2020image based CLIP-like VLMs, which already produce a global visual feature, to condition their output feature based on a text condition.
  • Figure 3: ColorShape examples with a query image, three conditions, and corresponding positives and distractors.
  • Figure 4: Image-image retrieval results on ColorShape dataset. Conditional representations from FocalLens better capture the given conditions compared to the task-agnostic representations of CLIP.
  • Figure 5: Linear probing results comparing CLIP and FocalLens-CLIP.
  • ...and 1 more figures