Table of Contents
Fetching ...

Alt-Text with Context: Improving Accessibility for Images on Twitter

Nikita Srivatsan, Sofia Samaniego, Omar Florez, Taylor Berg-Kirkpatrick

TL;DR

The paper tackles alt-text generation for social media by conditioning a language model on both visual content and the surrounding tweet text. It introduces a CLIP-to-embedding mapping that forms a multimodal prefix fed into GPT-2, enabling context-aware alt-text generation; CLIP is kept frozen to leverage pretrained Visual-Language knowledge. A large Twitter dataset of 371k image–alt-text pairs with associated tweets is released, along with extensive experiments showing >2x gains on BLEU@4 and >4x gains on CIDEr compared with baselines such as ClipCap and BLIP-2. The work demonstrates the practical value of leveraging contextual social media information to improve accessibility, while also addressing ethical considerations and limitations related to data quality and potential misuse.

Abstract

In this work we present an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter. More than just a special case of image captioning, alt-text is both more literally descriptive and context-specific. Also critically, images posted to Twitter are often accompanied by user-written text that despite not necessarily describing the image may provide useful context that if properly leveraged can be informative. We address this task with a multimodal model that conditions on both textual information from the associated social media post as well as visual signal from the image, and demonstrate that the utility of these two information sources stacks. We put forward a new dataset of 371k images paired with alt-text and tweets scraped from Twitter and evaluate on it across a variety of automated metrics as well as human evaluation. We show that our approach of conditioning on both tweet text and visual information significantly outperforms prior work, by more than 2x on BLEU@4.

Alt-Text with Context: Improving Accessibility for Images on Twitter

TL;DR

The paper tackles alt-text generation for social media by conditioning a language model on both visual content and the surrounding tweet text. It introduces a CLIP-to-embedding mapping that forms a multimodal prefix fed into GPT-2, enabling context-aware alt-text generation; CLIP is kept frozen to leverage pretrained Visual-Language knowledge. A large Twitter dataset of 371k image–alt-text pairs with associated tweets is released, along with extensive experiments showing >2x gains on BLEU@4 and >4x gains on CIDEr compared with baselines such as ClipCap and BLIP-2. The work demonstrates the practical value of leveraging contextual social media information to improve accessibility, while also addressing ethical considerations and limitations related to data quality and potential misuse.

Abstract

In this work we present an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter. More than just a special case of image captioning, alt-text is both more literally descriptive and context-specific. Also critically, images posted to Twitter are often accompanied by user-written text that despite not necessarily describing the image may provide useful context that if properly leveraged can be informative. We address this task with a multimodal model that conditions on both textual information from the associated social media post as well as visual signal from the image, and demonstrate that the utility of these two information sources stacks. We put forward a new dataset of 371k images paired with alt-text and tweets scraped from Twitter and evaluate on it across a variety of automated metrics as well as human evaluation. We show that our approach of conditioning on both tweet text and visual information significantly outperforms prior work, by more than 2x on BLEU@4.
Paper Structure (22 sections, 3 figures, 2 tables)

This paper contains 22 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Left: An image that requires textual context to write accurate alt-text for. Without conditioning on the tweet text, the election flyers are indistinguishable from books to a traditional captioning system. Right: Two similar images from our dataset and Conceptual Captions with their gold labels. The alt-text for the first image is literally descriptive while the second is more colloquial.
  • Figure 2: Overview of the alt-text model. An image is encoded via CLIP to obtain an embedding of visual features. This gets projected via a mapping network into word embedding space, where it is then concatenated with an embedded representation of the text from the corresponding tweet. This prefix is passed to a finetuned GPT-2 which autoregressively generates the alt-text caption.
  • Figure 3: Selected tweets with the user-written alt-text alongside our prediction and ClipCap's. We see that by conditioning on the tweet text, our model is able to focus on relevant details in the images, reference named places, and provide better transcription despite not being trained on OCR.