Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru; Xiaochuang Han; Bhuwan Dhingra; Emily Dinan; Maha Elbayad

Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru, Xiaochuang Han, Bhuwan Dhingra, Emily Dinan, Maha Elbayad

TL;DR

This work introduces the Text-Guided Semantic Image Encoder (TIE), a query-conditioned image encoder designed to integrate text inputs directly into image representation learning for vision–language models. By injecting text embeddings into all layers of the image encoder, TIE produces semantically richer, task-specific visual features that improve downstream image-to-text tasks while reducing the need for excessive image tiles. Across 1B and 3B-scale models and 14 diverse benchmarks, PLM-TIE achieves consistent gains over conventional baselines (average +1.5 and +1.3), with larger improvements on challenging datasets like DocVQA and InfoVQA, and maintains efficiency by enabling fewer tokens per image. Qualitative analyses further confirm query-aligned attention, demonstrating better grounding and interpretability, and ablations show that the gains stem from effective query conditioning rather than mere increases in compute. Overall, TIE establishes a robust framework for integrating textual guidance into image encoding, with practical implications for faster inference and improved multimodal understanding.

Abstract

Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.

Text-Guided Semantic Image Encoder

TL;DR

Abstract

Text-Guided Semantic Image Encoder

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)