Table of Contents
Fetching ...

Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications

Belkiss Souayed, Sarah Ebling, Yingqiang Gao

TL;DR

This work tackles the challenge of generating cognitively accessible visuals from text simplifications by introducing a template-based prompting framework with five distinct layouts and explicit accessibility constraints. Through a two-phase pipeline, the authors first identify a high-performing template (Basic Object Focus) using automatic CLIP-based scoring on 400 sentence-pairs, then scale to 4,000 images across ten visual styles and obtain expert annotations from four accessibility specialists. Key findings show visual minimalism (Basic Object Focus) and concrete styles (Retro, Realistic) as most accessible, with Wikipedia and ASSET providing favorable simplifications; however, automatic metrics like CLIPScore weakly align with human judgments, underscoring the need for human-centered evaluation and careful bias auditing. The study provides practical guidelines for accessible content generation and demonstrates the value of structured prompting in AI-assisted accessibility tools, while also highlighting limitations in inter-annotator agreement and style recognition that warrant future methodological refinements.

Abstract

Individuals with intellectual disabilities often have difficulties in comprehending complex texts. While many text-to-image models prioritize aesthetics over accessibility, it is not clear how visual illustrations relate to text simplifications (TS) generated from them. This paper presents a structured vision-language model (VLM) prompting framework for generating accessible images from simplified texts. We designed five prompt templates, i.e., Basic Object Focus, Contextual Scene, Educational Layout, Multi-Level Detail, and Grid Layout, each following distinct spatial arrangements while adhering to accessibility constraints such as object count limits, spatial separation, and content restrictions. Using 400 sentence-level simplifications from four established TS datasets (OneStopEnglish, SimPA, Wikipedia, and ASSET), we conducted a two-phase evaluation: Phase 1 assessed prompt template effectiveness with CLIPScores, and Phase 2 involved human annotation of generated images across ten visual styles by four accessibility experts. Results show that the Basic Object Focus prompt template achieved the highest semantic alignment, indicating that visual minimalism enhances language accessibility. Expert evaluation further identified Retro style as the most accessible and Wikipedia as the most effective data source. Inter-annotator agreement varied across dimensions, with Text Simplicity showing strong reliability and Image Quality proving more subjective. Overall, our framework offers practical guidelines for accessible content generation and underscores the importance of structured prompting in AI-generated visual accessibility tools.

Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications

TL;DR

This work tackles the challenge of generating cognitively accessible visuals from text simplifications by introducing a template-based prompting framework with five distinct layouts and explicit accessibility constraints. Through a two-phase pipeline, the authors first identify a high-performing template (Basic Object Focus) using automatic CLIP-based scoring on 400 sentence-pairs, then scale to 4,000 images across ten visual styles and obtain expert annotations from four accessibility specialists. Key findings show visual minimalism (Basic Object Focus) and concrete styles (Retro, Realistic) as most accessible, with Wikipedia and ASSET providing favorable simplifications; however, automatic metrics like CLIPScore weakly align with human judgments, underscoring the need for human-centered evaluation and careful bias auditing. The study provides practical guidelines for accessible content generation and demonstrates the value of structured prompting in AI-assisted accessibility tools, while also highlighting limitations in inter-annotator agreement and style recognition that warrant future methodological refinements.

Abstract

Individuals with intellectual disabilities often have difficulties in comprehending complex texts. While many text-to-image models prioritize aesthetics over accessibility, it is not clear how visual illustrations relate to text simplifications (TS) generated from them. This paper presents a structured vision-language model (VLM) prompting framework for generating accessible images from simplified texts. We designed five prompt templates, i.e., Basic Object Focus, Contextual Scene, Educational Layout, Multi-Level Detail, and Grid Layout, each following distinct spatial arrangements while adhering to accessibility constraints such as object count limits, spatial separation, and content restrictions. Using 400 sentence-level simplifications from four established TS datasets (OneStopEnglish, SimPA, Wikipedia, and ASSET), we conducted a two-phase evaluation: Phase 1 assessed prompt template effectiveness with CLIPScores, and Phase 2 involved human annotation of generated images across ten visual styles by four accessibility experts. Results show that the Basic Object Focus prompt template achieved the highest semantic alignment, indicating that visual minimalism enhances language accessibility. Expert evaluation further identified Retro style as the most accessible and Wikipedia as the most effective data source. Inter-annotator agreement varied across dimensions, with Text Simplicity showing strong reliability and Image Quality proving more subjective. Overall, our framework offers practical guidelines for accessible content generation and underscores the importance of structured prompting in AI-generated visual accessibility tools.

Paper Structure

This paper contains 33 sections, 1 equation, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Example image generated based on the simplified text "I will never forget the wonderful memories he has given us, like that magical night in Moscow." (Style: Artistic, Dataset: OneStopEnglish).
  • Figure 2: Relative contribution of each evaluation dimension per expert.
  • Figure 3: Highest-rated images by each expert, with corresponding simplified sentences.
  • Figure 4: Lowest-rated images by Expert A and K.
  • Figure 5: Lowest-rated images by Expert L and M.
  • ...and 1 more figures