Table of Contents
Fetching ...

ARMADA: Attribute-Based Multimodal Data Augmentation

Xiaomeng Jin, Jeonghwan Kim, Yu Zhou, Kuan-Hao Huang, Te-Lin Wu, Nanyun Peng, Heng Ji

TL;DR

ARMADA tackles the high cost and semantic gaps of multimodal data augmentation by introducing a knowledge-guided, attribute-based pipeline. It extracts text entities and visual attributes, substitutes attribute values via a Wikidata/Wikipedia–based KB (or LLMs for auxiliary attributes), and edits the corresponding images with InstructPix2Pix to produce semantically grounded, diverse image–text pairs. The framework includes a data-selection step using Fréchet Inception Distance to maintain distributional fidelity. Across image classification, VQA, image–text retrieval, and image captioning, ARMADA yields consistent performance gains over strong baselines, validating the value of combining symbolic KBs with LLMs for grounded multimodal augmentation.

Abstract

In Multimodal Language Models (MLMs), the cost of manually annotating high-quality image-text pair data for fine-tuning and alignment is extremely high. While existing multimodal data augmentation frameworks propose ways to augment image-text pairs, they either suffer from semantic inconsistency between texts and images, or generate unrealistic images, causing knowledge gap with real world examples. To address these issues, we propose Attribute-based Multimodal Data Augmentation (ARMADA), a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes of the mentioned entities. Specifically, we extract entities and their visual attributes from the original text data, then search for alternative values for the visual attributes under the guidance of knowledge bases (KBs) and large language models (LLMs). We then utilize an image-editing model to edit the images with the extracted attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation, (ii) generates visually similar images of disparate categories using neighboring entities in the KB hierarchy, and (iii) uses the commonsense knowledge of LLMs to modulate auxiliary visual attributes such as backgrounds for more robust representation of original entities. Our empirical results over four downstream tasks demonstrate the efficacy of our framework to produce high-quality data and enhance the model performance. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.

ARMADA: Attribute-Based Multimodal Data Augmentation

TL;DR

ARMADA tackles the high cost and semantic gaps of multimodal data augmentation by introducing a knowledge-guided, attribute-based pipeline. It extracts text entities and visual attributes, substitutes attribute values via a Wikidata/Wikipedia–based KB (or LLMs for auxiliary attributes), and edits the corresponding images with InstructPix2Pix to produce semantically grounded, diverse image–text pairs. The framework includes a data-selection step using Fréchet Inception Distance to maintain distributional fidelity. Across image classification, VQA, image–text retrieval, and image captioning, ARMADA yields consistent performance gains over strong baselines, validating the value of combining symbolic KBs with LLMs for grounded multimodal augmentation.

Abstract

In Multimodal Language Models (MLMs), the cost of manually annotating high-quality image-text pair data for fine-tuning and alignment is extremely high. While existing multimodal data augmentation frameworks propose ways to augment image-text pairs, they either suffer from semantic inconsistency between texts and images, or generate unrealistic images, causing knowledge gap with real world examples. To address these issues, we propose Attribute-based Multimodal Data Augmentation (ARMADA), a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes of the mentioned entities. Specifically, we extract entities and their visual attributes from the original text data, then search for alternative values for the visual attributes under the guidance of knowledge bases (KBs) and large language models (LLMs). We then utilize an image-editing model to edit the images with the extracted attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation, (ii) generates visually similar images of disparate categories using neighboring entities in the KB hierarchy, and (iii) uses the commonsense knowledge of LLMs to modulate auxiliary visual attributes such as backgrounds for more robust representation of original entities. Our empirical results over four downstream tasks demonstrate the efficacy of our framework to produce high-quality data and enhance the model performance. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
Paper Structure (24 sections, 4 figures, 3 tables)

This paper contains 24 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Generated examples using two previous data augmentation methods and our approach. (a) is generated by TrivialAugment muller2021trivialaugment, showing the altered images from randomly solarizing or cropping the dog and the fence out from the original image, demonstrating semantic inconsistency. (b) shows the output image from MixGen hao2023mixgen, demonstrating the unrealistic output from simple image interpolation and text concatenation. (c) shows the augmented data from our method ARMADA, which are semantically consistent.
  • Figure 2: The overall framework of our data augmentation method. Given an image-text pair as input, we first extract entities and their corresponding visual attributes from text. If the object can be linked to an entity in our pre-defined attribute knowledge base, then we collect all possible attribute values from the information of the linked entity. If the object cannot be linked to the knowledge base, then we utilize Large Language Models (LLMs) to extract other possible values. After selecting which visual attribute to modify, we rewrite the original text and use an image editing model to generate new images based on the new text. Finally, we rank the augmented data and output data based on the similarity scores.
  • Figure 3: An example from the our pre-defined attribute library. Each node represents an entity collected from Wikidata. An outgoing edge is connected from a node to its parent category. Each node has its visual attributes extracted from the Wikipedia articles.
  • Figure 4: A case analysis that shows sample outputs on Flickr30k dataset for image captioning task. We select two images from the test set, the human-annotated captions, and the generated captions from each method. For the image on the left, our method is able to recognize the fine-grained concept karate. The image on the right demonstrates that the model is able to provide an accurate description of the hat, specifying its knit texture and beer logo pattern.