Table of Contents
Fetching ...

Contrastive Visual Data Augmentation

Yu Zhou, Bingxuan Li, Mohan Tang, Xiaomeng Jin, Te-Lin Wu, Kuan-Hao Huang, Heng Ji, Kai-Wei Chang, Nanyun Peng

TL;DR

CoDA tackles the challenge of recognizing novel and confusing visual concepts in large multimodal models by learning contrastive textual and visual features between target concepts and their confusions, then generating targeted synthetic data via diffusion-based text-to-image models. A two-stage feature filtering—discriminability and generability—selects informative attributes, which are validated automatically and, if needed, by human evaluators, before use in model updating and optional in-context prompting. The authors introduce NovelSpecies, a benchmark of newly described animal species unseen by existing models, and demonstrate substantial accuracy gains across INaturalist, SUN, and NovelSpecies on diverse backbones, including LLaVA-NeXT and GPT4o-mini. CoDA’s modular, plug-and-play design supports swapping components (e.g., better T2I models) to achieve further gains, highlighting practical impact for improving novel-concept recognition in real-world deployments.

Abstract

Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.

Contrastive Visual Data Augmentation

TL;DR

CoDA tackles the challenge of recognizing novel and confusing visual concepts in large multimodal models by learning contrastive textual and visual features between target concepts and their confusions, then generating targeted synthetic data via diffusion-based text-to-image models. A two-stage feature filtering—discriminability and generability—selects informative attributes, which are validated automatically and, if needed, by human evaluators, before use in model updating and optional in-context prompting. The authors introduce NovelSpecies, a benchmark of newly described animal species unseen by existing models, and demonstrate substantial accuracy gains across INaturalist, SUN, and NovelSpecies on diverse backbones, including LLaVA-NeXT and GPT4o-mini. CoDA’s modular, plug-and-play design supports swapping components (e.g., better T2I models) to achieve further gains, highlighting practical impact for improving novel-concept recognition in real-world deployments.

Abstract

Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.

Paper Structure

This paper contains 27 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: CoDA uses diffusion-generated synthetic data to help LMMs recognize novel and confusing concepts in the wild. The "Clouded Tiger Cat (L. pardinoides)" is a new animal species first described in April 2024, while "Resupply Base" is an example of a confusing concept for LMMs. Based on model failures (collected from GPT4o-2024-08-06 and LLaVA-NeXT 34B), CoDA extracts contrastive visual and textual features to generate synthetic image data for model updating.
  • Figure 2: The CoDA method. Including Feature Extraction, Feature Filtering, Feature-controlled Augmentation, and Augmented Image Filtering. The target concept and misidentified concept are highlighted respectively. Specific feature filtering scores are for illustration only. Here the example concepts Anodorhynchus Leari (Lear's Macaw) and Cyanopsitta Spixii (Spix's Macaw) are from the iNaturalist van2018inaturalist dataset, and augmented images are produced by the Recraft V3 model 2024RecraftV3.
  • Figure 3: Qualitative Comparison of CoDA and baseline visual data augmentation methods. Phyllobates Samperi and Tail-Spot Wrasse are example concepts from the NovelSpecies dataset. All CoDA images are generated using contrastive textual + visual features.