Table of Contents
Fetching ...

Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

Shojiro Yamabe, Futa Waseda, Daiki Shiono, Tsubasa Takahashi

TL;DR

Text-Printed Image (TPI) presents a lightweight, architecture-agnostic method to bridge the image-text modality gap in text-centric LVLM training by rendering textual descriptions onto a white canvas and routing them through the visual encoder. Across multiple models and benchmarks, TPI consistently outperforms diffusion-based synthetic images and, in many cases, approaches the performance of training with ground-truth images, while enabling far faster data generation (CPU-based) and low-cost data augmentation via LLM-generated descriptions. Comprehensive analyses using JS divergence, CKAs, t-SNE, and OCR considerations demonstrate that TPI improves cross-modal alignment and preserves semantic fidelity. The augmentation experiments further show that adding TPI-derived descriptions from a small seed or full dataset yields measurable gains, indicating a scalable path toward automated, large-scale data generation for LVLMs.

Abstract

Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.

Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

TL;DR

Text-Printed Image (TPI) presents a lightweight, architecture-agnostic method to bridge the image-text modality gap in text-centric LVLM training by rendering textual descriptions onto a white canvas and routing them through the visual encoder. Across multiple models and benchmarks, TPI consistently outperforms diffusion-based synthetic images and, in many cases, approaches the performance of training with ground-truth images, while enabling far faster data generation (CPU-based) and low-cost data augmentation via LLM-generated descriptions. Comprehensive analyses using JS divergence, CKAs, t-SNE, and OCR considerations demonstrate that TPI improves cross-modal alignment and preserves semantic fidelity. The augmentation experiments further show that adding TPI-derived descriptions from a small seed or full dataset yields measurable gains, indicating a scalable path toward automated, large-scale data generation for LVLMs.

Abstract

Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.

Paper Structure

This paper contains 67 sections, 4 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Text-Printed Image (TPI) provides an efficient and broadly applicable approach for text-centric training. While raw text input suffers from an image-text modality gap and synthetic images generated by a text-to-image model lose fidelity to Q&A pairs, TPI bridges this gap by embedding textual content into the visual pathway, achieving high fidelity to the ground-truth visual supervision.
  • Figure 2: TPI training yields output distributions most similar to models trained on GT-Images. We compare the similarity to the model trained on GT-Image in output distributions. Each value represents the JS divergence on the test split.
  • Figure 3: Text-only training causes substantial drift in intermediate representations, while TPI mitigates it. We compare the similarity to the model trained on GT-Image in intermediate representations by computing the layer-wise CKA.
  • Figure 4: TPI reduces the image–text modality gap. We visualize t-SNE embeddings of intermediate features. While Text-only produces a large separation from image features, TPI aligns more closely with the visual manifold.
  • Figure 5: Qualitative comparison of synthetic images on ScienceQA.
  • ...and 5 more figures