Table of Contents
Fetching ...

Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?

Tiantian Feng, Daniel Yang, Digbalay Bose, Shrikanth Narayanan

TL;DR

This paper addresses the problem of visual-modality missing in multimodal visual recognition and introduces GTI-MM, a Generative-Transformer Imputation framework that uses synthetic visuals via text-to-image prompts to augment training. The authors demonstrate that synthetic visuals improve data efficiency and robustness in audio-visual action recognition across datasets such as UCF101, ActivityNet, and Moments in Time under scenarios where visual data are missing in training or testing, and they analyze the effects of generation quantity, diversity, and prompt complexity. GTI-MM shows compatibility with robustness techniques like dropout and prompt learning, and the work discusses generalization to text-visual tasks while identifying challenges in audio-imputation with current models ($p$ and $q$ notation reflect missingness in training/testing). Overall, GTI-MM provides a practical route to leverage synthetic data for resilient multimodal learning, with opportunities for improved multimodal generative models in future work ($p$ up to $95\-99\%$, $q$ up to $90\%$).

Abstract

Multi-modal learning has emerged as an increasingly promising avenue in vision recognition, driving innovations across diverse domains ranging from media and education to healthcare and transportation. Despite its success, the robustness of multi-modal learning for visual recognition is often challenged by the unavailability of a subset of modalities, especially the visual modality. Conventional approaches to mitigate missing modalities in multi-modal learning rely heavily on algorithms and modality fusion schemes. In contrast, this paper explores the use of text-to-image models to assist multi-modal learning. Specifically, we propose a simple but effective multi-modal learning framework GTI-MM to enhance the data efficiency and model robustness against missing visual modality by imputing the missing data with generative transformers. Using multiple multi-modal datasets with visual recognition tasks, we present a comprehensive analysis of diverse conditions involving missing visual modality in data, including model training. Our findings reveal that synthetic images benefit training data efficiency with visual data missing in training and improve model robustness with visual data missing involving training and testing. Moreover, we demonstrate GTI-MM is effective with lower generation quantity and simple prompt techniques.

Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?

TL;DR

This paper addresses the problem of visual-modality missing in multimodal visual recognition and introduces GTI-MM, a Generative-Transformer Imputation framework that uses synthetic visuals via text-to-image prompts to augment training. The authors demonstrate that synthetic visuals improve data efficiency and robustness in audio-visual action recognition across datasets such as UCF101, ActivityNet, and Moments in Time under scenarios where visual data are missing in training or testing, and they analyze the effects of generation quantity, diversity, and prompt complexity. GTI-MM shows compatibility with robustness techniques like dropout and prompt learning, and the work discusses generalization to text-visual tasks while identifying challenges in audio-imputation with current models ( and notation reflect missingness in training/testing). Overall, GTI-MM provides a practical route to leverage synthetic data for resilient multimodal learning, with opportunities for improved multimodal generative models in future work ( up to , up to ).

Abstract

Multi-modal learning has emerged as an increasingly promising avenue in vision recognition, driving innovations across diverse domains ranging from media and education to healthcare and transportation. Despite its success, the robustness of multi-modal learning for visual recognition is often challenged by the unavailability of a subset of modalities, especially the visual modality. Conventional approaches to mitigate missing modalities in multi-modal learning rely heavily on algorithms and modality fusion schemes. In contrast, this paper explores the use of text-to-image models to assist multi-modal learning. Specifically, we propose a simple but effective multi-modal learning framework GTI-MM to enhance the data efficiency and model robustness against missing visual modality by imputing the missing data with generative transformers. Using multiple multi-modal datasets with visual recognition tasks, we present a comprehensive analysis of diverse conditions involving missing visual modality in data, including model training. Our findings reveal that synthetic images benefit training data efficiency with visual data missing in training and improve model robustness with visual data missing involving training and testing. Moreover, we demonstrate GTI-MM is effective with lower generation quantity and simple prompt techniques.
Paper Structure (43 sections, 13 figures, 11 tables)

This paper contains 43 sections, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Problem formulation of missing modalities in this work with audio-visual recognition as the example. The missing modality includes cases in training data alone or any data.
  • Figure 2: Visual data generation process in GTI-MM.
  • Figure 3: Learning framework of GTI-MM: Imputing missing visual modality with synthetic visual content for robust multi-modal learning.
  • Figure 4: Performance comparisons among GTI-MM and other baselines at different training visual modality ratios.
  • Figure 5: Comparisons between GTI-MM and zero-shot learning with synthetic visual data.
  • ...and 8 more figures