Table of Contents
Fetching ...

Enhance Modality Robustness in Text-Centric Multimodal Alignment with Adversarial Prompting

Yun-Da Tsai, Ting-Yu Yen, Keng-Te Liao, Shou-De Lin

TL;DR

Text-centric multimodal alignment enables cross-modal understanding by converting diverse inputs into text for LLMs, but robustness suffers under noise, permutation, and missing data. The paper introduces a text-centric adversarial training framework that adds an LLM-based perturbation module atop a text-centric pipeline of text transformation, modality summarization, and reasoning/augmentation, generating adversarial examples denoted by $x_{adv}$ with prompts and temperature $T$. Empirical results on PetFinder, InsideAirbnb, and Avito show significant robustness gains over traditional robust training and pure MLLMs, with complementary qualitative analysis illustrating how LLMs recover lost content and articulate implicit relations. The approach demonstrates strong invariance to modality perturbations and transferability across LLMs, suggesting practical benefits for real-world multimodal systems.

Abstract

Converting different modalities into generalized text, which then serves as input prompts for large language models (LLMs), is a common approach for aligning multimodal models, particularly when pairwise data is limited. Text-centric alignment method leverages the unique properties of text as a modality space, transforming diverse inputs into a unified textual representation, thereby enabling downstream models to effectively interpret various modal inputs. This study evaluates the quality and robustness of multimodal representations in the face of noise imperfections, dynamic input order permutations, and missing modalities, revealing that current text-centric alignment methods can compromise downstream robustness. To address this issue, we propose a new text-centric adversarial training approach that significantly enhances robustness compared to traditional robust training methods and pre-trained multimodal foundation models. Our findings underscore the potential of this approach to improve the robustness and adaptability of multimodal representations, offering a promising solution for dynamic and real-world applications.

Enhance Modality Robustness in Text-Centric Multimodal Alignment with Adversarial Prompting

TL;DR

Text-centric multimodal alignment enables cross-modal understanding by converting diverse inputs into text for LLMs, but robustness suffers under noise, permutation, and missing data. The paper introduces a text-centric adversarial training framework that adds an LLM-based perturbation module atop a text-centric pipeline of text transformation, modality summarization, and reasoning/augmentation, generating adversarial examples denoted by with prompts and temperature . Empirical results on PetFinder, InsideAirbnb, and Avito show significant robustness gains over traditional robust training and pure MLLMs, with complementary qualitative analysis illustrating how LLMs recover lost content and articulate implicit relations. The approach demonstrates strong invariance to modality perturbations and transferability across LLMs, suggesting practical benefits for real-world multimodal systems.

Abstract

Converting different modalities into generalized text, which then serves as input prompts for large language models (LLMs), is a common approach for aligning multimodal models, particularly when pairwise data is limited. Text-centric alignment method leverages the unique properties of text as a modality space, transforming diverse inputs into a unified textual representation, thereby enabling downstream models to effectively interpret various modal inputs. This study evaluates the quality and robustness of multimodal representations in the face of noise imperfections, dynamic input order permutations, and missing modalities, revealing that current text-centric alignment methods can compromise downstream robustness. To address this issue, we propose a new text-centric adversarial training approach that significantly enhances robustness compared to traditional robust training methods and pre-trained multimodal foundation models. Our findings underscore the potential of this approach to improve the robustness and adaptability of multimodal representations, offering a promising solution for dynamic and real-world applications.
Paper Structure (29 sections, 9 figures, 3 tables)

This paper contains 29 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Text-centric multimodal alignment, which converts different modalities into text to serve as input prompts for LLMs, is a common method for aligning large multimodal language models when pairwise multimodal data is limited. The potential model collapse phenomenon can jeopardize the robustness of the aligned representation.
  • Figure 2: Each raw input modality is transformed into text representations using a corresponding foundation model. Following modality summarization and LLM reasoning are applied in parallel. Finally, the output texts are concatenated as the input to a transformer model for downstream prediction. The inference phase follows a similar pattern. We apply a one-shot in-context learning approach to adapt the linguistic style as anticipated during training.
  • Figure 3: Examples of prompt templates for each module and the required information for input output specified.
  • Figure 4: To evaluate the robustness of our model under noisy conditions, we evaluated both relative robustness (top) and effective robustness (bottom) for three datasets. The results from these metrics consistently demonstrate that the text-centric method exhibits superior robustness and resilience to noise when compared to other baseline methods, particularly as noise levels increase. The evaluation was conducted using three different metrics: accuracy, MSE, and RMSE, tailored to each respective dataset.
  • Figure 5: The tabular data has dropped the color and fur length column (gray). However, it was recovered (blue) after applying alignment module with LLM that compensate the information from input image.
  • ...and 4 more figures