Table of Contents
Fetching ...

AltGen: AI-Driven Alt Text Generation for Enhancing EPUB Accessibility

Yixian Shen, Hang Zhang, Yanxin Shen, Lun Wang, Chuanqi Shi, Shaoshuai Du, Yiyi Tao

TL;DR

AltGen tackles the accessibility gap in EPUBs by automating high-quality alt-text generation through an integrated AI pipeline that fuses computer vision features with surrounding textual context and transformer-based generation. It leverages CLIP and ViT for visual understanding, contextual grounding from EPUB content, and fine-tuned text generators to produce descriptive, contextually accurate alt text, followed by metadata enrichment, file reconstruction, and rigorous validation. The evaluation reports strong quantitative metrics (Cosine Similarity ≈ 0.93, BLEU ≈ 0.76) and a 97.5% reduction in accessibility errors, complemented by positive qualitative feedback from visually impaired users. This work delivers a scalable, WCAG-aligned solution for automated EPUB accessibility, with practical implications for large-scale digital publishing and inclusive content delivery.

Abstract

Digital accessibility is a cornerstone of inclusive content delivery, yet many EPUB files fail to meet fundamental accessibility standards, particularly in providing descriptive alt text for images. Alt text plays a critical role in enabling visually impaired users to understand visual content through assistive technologies. However, generating high-quality alt text at scale is a resource-intensive process, creating significant challenges for organizations aiming to ensure accessibility compliance. This paper introduces AltGen, a novel AI-driven pipeline designed to automate the generation of alt text for images in EPUB files. By integrating state-of-the-art generative models, including advanced transformer-based architectures, AltGen achieves contextually relevant and linguistically coherent alt text descriptions. The pipeline encompasses multiple stages, starting with data preprocessing to extract and prepare relevant content, followed by visual analysis using computer vision models such as CLIP and ViT. The extracted visual features are enriched with contextual information from surrounding text, enabling the fine-tuned language models to generate descriptive and accurate alt text. Validation of the generated output employs both quantitative metrics, such as cosine similarity and BLEU scores, and qualitative feedback from visually impaired users. Experimental results demonstrate the efficacy of AltGen across diverse datasets, achieving a 97.5% reduction in accessibility errors and high scores in similarity and linguistic fidelity metrics. User studies highlight the practical impact of AltGen, with participants reporting significant improvements in document usability and comprehension. Furthermore, comparative analyses reveal that AltGen outperforms existing approaches in terms of accuracy, relevance, and scalability.

AltGen: AI-Driven Alt Text Generation for Enhancing EPUB Accessibility

TL;DR

AltGen tackles the accessibility gap in EPUBs by automating high-quality alt-text generation through an integrated AI pipeline that fuses computer vision features with surrounding textual context and transformer-based generation. It leverages CLIP and ViT for visual understanding, contextual grounding from EPUB content, and fine-tuned text generators to produce descriptive, contextually accurate alt text, followed by metadata enrichment, file reconstruction, and rigorous validation. The evaluation reports strong quantitative metrics (Cosine Similarity ≈ 0.93, BLEU ≈ 0.76) and a 97.5% reduction in accessibility errors, complemented by positive qualitative feedback from visually impaired users. This work delivers a scalable, WCAG-aligned solution for automated EPUB accessibility, with practical implications for large-scale digital publishing and inclusive content delivery.

Abstract

Digital accessibility is a cornerstone of inclusive content delivery, yet many EPUB files fail to meet fundamental accessibility standards, particularly in providing descriptive alt text for images. Alt text plays a critical role in enabling visually impaired users to understand visual content through assistive technologies. However, generating high-quality alt text at scale is a resource-intensive process, creating significant challenges for organizations aiming to ensure accessibility compliance. This paper introduces AltGen, a novel AI-driven pipeline designed to automate the generation of alt text for images in EPUB files. By integrating state-of-the-art generative models, including advanced transformer-based architectures, AltGen achieves contextually relevant and linguistically coherent alt text descriptions. The pipeline encompasses multiple stages, starting with data preprocessing to extract and prepare relevant content, followed by visual analysis using computer vision models such as CLIP and ViT. The extracted visual features are enriched with contextual information from surrounding text, enabling the fine-tuned language models to generate descriptive and accurate alt text. Validation of the generated output employs both quantitative metrics, such as cosine similarity and BLEU scores, and qualitative feedback from visually impaired users. Experimental results demonstrate the efficacy of AltGen across diverse datasets, achieving a 97.5% reduction in accessibility errors and high scores in similarity and linguistic fidelity metrics. User studies highlight the practical impact of AltGen, with participants reporting significant improvements in document usability and comprehension. Furthermore, comparative analyses reveal that AltGen outperforms existing approaches in terms of accuracy, relevance, and scalability.
Paper Structure (22 sections, 2 equations, 1 figure, 3 tables)

This paper contains 22 sections, 2 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: A visual representation of the proposed methodology pipeline for AI-driven alt text generation in EPUB files. The pipeline includes five stages: (1) Data Preprocessing for file parsing and image identification; (2) Generative AI Model Integration for feature extraction, contextual analysis, and alt text generation; (3) Metadata Enrichment with language detection and metadata updates; (4) File Reconstruction for reassembly and integrity checks; and (5) Postprocessing and Validation through error verification and user feedback.