Table of Contents
Fetching ...

Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

Gabriel Downer, Sean Craven, Damian Ruck, Jake Thomas

TL;DR

Text2VLM introduces an automated pipeline that converts text-only evaluation prompts into multimodal inputs by rendering typographic representations of salient harmful content, enabling robust assessment of Visual Language Model (VLM) safety under multimodal prompt injections. The approach combines prompt summarization, salient-concept extraction, and typographic image generation, followed by LLM- and human-led evaluation of relevance, safety refusals, and understanding. Open-source VLMs show increased vulnerability to multimodal prompts, with safety alignment weakening when text and image inputs are combined, despite OCR and multimodal understanding limitations. The work provides an open-source tool for systematic safety evaluation and highlights the need for stronger alignment and OCR-capable models to ensure safer deployment of multimodal AI systems.

Abstract

The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose \textbf{Text2VLM}, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models' alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications.

Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

TL;DR

Text2VLM introduces an automated pipeline that converts text-only evaluation prompts into multimodal inputs by rendering typographic representations of salient harmful content, enabling robust assessment of Visual Language Model (VLM) safety under multimodal prompt injections. The approach combines prompt summarization, salient-concept extraction, and typographic image generation, followed by LLM- and human-led evaluation of relevance, safety refusals, and understanding. Open-source VLMs show increased vulnerability to multimodal prompts, with safety alignment weakening when text and image inputs are combined, despite OCR and multimodal understanding limitations. The work provides an open-source tool for systematic safety evaluation and highlights the need for stronger alignment and OCR-capable models to ensure safer deployment of multimodal AI systems.

Abstract

The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose \textbf{Text2VLM}, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models' alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications.

Paper Structure

This paper contains 39 sections, 1 equation, 9 figures.

Figures (9)

  • Figure 1: Overview of the Text2VLM pipeline with a malicious example. The text-only dataset is converted to a multimodal form, processed by the target VLM, and evaluated by relevancy and safety classifiers.
  • Figure 2: The Text2VLM pipeline shows that the automated summarizations receive positive ratings, with over 80% of outputs rated as "Great". Salient concept extraction is also highly effective, with most concepts successfully extracted and deemed valid. Demonstrating minimal failures in classification quality, with very few "Bad" or "Very Bad" ratings.
  • Figure 3: The Relevance and Refusal Classifiers perform similarly well. Relevance scores are predominantly rated as "Great" and "Good", and the Refusal Classifier consistently provides correct classifications. Demonstrating minimal failures in classification quality, with very few "Bad" or "Very Bad" ratings.
  • Figure 4: Open-source VLMs show reduced understanding when prompts are split between text and image compared to text-only inputs, especially VILA-8B, which struggled to understand many malicious text-only inputs as well. The comparison shows understanding scores between text-only (green bars) and multimodal (purple bars) inputs for two LLaVA and VILA models across five datasets. A score of 1 would suggest the target model understood each task in the dataset.
  • Figure 5: Models show lower refusal rates for harmful prompts when presented in a multimodal format versus text-only format, especially for medical safety (MedSafetyBench). Additionally, for the other datasets, even the text-only prompts show poor safety alignment prior to the Text2VLM transformation. The comparison shows refusal rates between text-only (blue bars) and multimodal (orange bars) inputs for two LLaVA and VILA models across five datasets. A score of 1 would suggest that the model is safely aligned.
  • ...and 4 more figures