Enhance Modality Robustness in Text-Centric Multimodal Alignment with Adversarial Prompting
Yun-Da Tsai, Ting-Yu Yen, Keng-Te Liao, Shou-De Lin
TL;DR
Text-centric multimodal alignment enables cross-modal understanding by converting diverse inputs into text for LLMs, but robustness suffers under noise, permutation, and missing data. The paper introduces a text-centric adversarial training framework that adds an LLM-based perturbation module atop a text-centric pipeline of text transformation, modality summarization, and reasoning/augmentation, generating adversarial examples denoted by $x_{adv}$ with prompts and temperature $T$. Empirical results on PetFinder, InsideAirbnb, and Avito show significant robustness gains over traditional robust training and pure MLLMs, with complementary qualitative analysis illustrating how LLMs recover lost content and articulate implicit relations. The approach demonstrates strong invariance to modality perturbations and transferability across LLMs, suggesting practical benefits for real-world multimodal systems.
Abstract
Converting different modalities into generalized text, which then serves as input prompts for large language models (LLMs), is a common approach for aligning multimodal models, particularly when pairwise data is limited. Text-centric alignment method leverages the unique properties of text as a modality space, transforming diverse inputs into a unified textual representation, thereby enabling downstream models to effectively interpret various modal inputs. This study evaluates the quality and robustness of multimodal representations in the face of noise imperfections, dynamic input order permutations, and missing modalities, revealing that current text-centric alignment methods can compromise downstream robustness. To address this issue, we propose a new text-centric adversarial training approach that significantly enhances robustness compared to traditional robust training methods and pre-trained multimodal foundation models. Our findings underscore the potential of this approach to improve the robustness and adaptability of multimodal representations, offering a promising solution for dynamic and real-world applications.
