Table of Contents
Fetching ...

Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru, Xiaochuang Han, Bhuwan Dhingra, Emily Dinan, Maha Elbayad

TL;DR

This work introduces the Text-Guided Semantic Image Encoder (TIE), a query-conditioned image encoder designed to integrate text inputs directly into image representation learning for vision–language models. By injecting text embeddings into all layers of the image encoder, TIE produces semantically richer, task-specific visual features that improve downstream image-to-text tasks while reducing the need for excessive image tiles. Across 1B and 3B-scale models and 14 diverse benchmarks, PLM-TIE achieves consistent gains over conventional baselines (average +1.5 and +1.3), with larger improvements on challenging datasets like DocVQA and InfoVQA, and maintains efficiency by enabling fewer tokens per image. Qualitative analyses further confirm query-aligned attention, demonstrating better grounding and interpretability, and ablations show that the gains stem from effective query conditioning rather than mere increases in compute. Overall, TIE establishes a robust framework for integrating textual guidance into image encoding, with practical implications for faster inference and improved multimodal understanding.

Abstract

Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.

Text-Guided Semantic Image Encoder

TL;DR

This work introduces the Text-Guided Semantic Image Encoder (TIE), a query-conditioned image encoder designed to integrate text inputs directly into image representation learning for vision–language models. By injecting text embeddings into all layers of the image encoder, TIE produces semantically richer, task-specific visual features that improve downstream image-to-text tasks while reducing the need for excessive image tiles. Across 1B and 3B-scale models and 14 diverse benchmarks, PLM-TIE achieves consistent gains over conventional baselines (average +1.5 and +1.3), with larger improvements on challenging datasets like DocVQA and InfoVQA, and maintains efficiency by enabling fewer tokens per image. Qualitative analyses further confirm query-aligned attention, demonstrating better grounding and interpretability, and ablations show that the gains stem from effective query conditioning rather than mere increases in compute. Overall, TIE establishes a robust framework for integrating textual guidance into image encoding, with practical implications for faster inference and improved multimodal understanding.

Abstract

Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.

Paper Structure

This paper contains 41 sections, 4 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: VLM in (a) prior works restrict text–image interaction to the LLM layers; (b) proposed TIE image encoder, which generates image representations/tokens conditioned on the given query.
  • Figure 2: TIE encodes an image conditioned on the corresponding query, yielding semantically enriched, query-specific image representations. Conditioning is performed across all layers of TIE. PLM-TIE training mechanism is depicted on the left, and inference on the right.
  • Figure 3: PLM-1B performance difference between 36 and 1 tiles, compared with the TIE–baseline gap. On datasets that benefit from more tiles, TIE correspondingly achieves larger gains.
  • Figure 4: Ablation results. Fewer tokens per image, either with dedicated models (left) or versatile models than support a variable number of tokens (right).
  • Figure 5: Attention pattern to images patches in regular image encoder from PLM-Cont (on the left) and the TIE model (on the right). Brown patches indicate higher attention values and white patches indicate lower values. In both examples above, the TIE model demonstrates stronger focus on the content most relevant to the given question. Examples are from the DocVQA benchmark.
  • ...and 1 more figures