Table of Contents
Fetching ...

Enhance the Robustness of Text-Centric Multimodal Alignments

Ting-Yu Yen, Yun-Da Tsai, Keng-Te Liao, Shou-De Lin

TL;DR

This work scrutinizes the robustness of text-centric multimodal alignment, revealing that existing approaches are sensitive to missing data and noise due to issues like captioning collapse. It proposes a strengthened pipeline that adds modality summarization and LLM-based reasoning to fuse modalities and leverage external knowledge, improving downstream robustness across modalities. Empirical results on a multi-modal adoption dataset show significant improvements in resilience to imperfections and demonstrate transferability across different LLMs. The findings offer a practical pathway to more reliable multimodal perception systems in real-world settings.

Abstract

Converting different modalities into general text, serving as input prompts for large language models (LLMs), is a common method to align multimodal models when there is limited pairwise data. This text-centric approach leverages the unique properties of text as a modality space, transforming diverse inputs into a unified textual representation. This enables downstream models to effectively interpret various modal inputs. This study assesses the quality and robustness of multimodal representations in the presence of missing entries, noise, or absent modalities, revealing that current text-centric alignment methods compromise downstream robustness. To address this issue, we propose a new text-centric approach that achieves superior robustness compared to previous methods across various modalities in different settings. Our findings highlight the potential of this approach to enhance the robustness and adaptability of multimodal representations, offering a promising solution for dynamic and real-world applications.

Enhance the Robustness of Text-Centric Multimodal Alignments

TL;DR

This work scrutinizes the robustness of text-centric multimodal alignment, revealing that existing approaches are sensitive to missing data and noise due to issues like captioning collapse. It proposes a strengthened pipeline that adds modality summarization and LLM-based reasoning to fuse modalities and leverage external knowledge, improving downstream robustness across modalities. Empirical results on a multi-modal adoption dataset show significant improvements in resilience to imperfections and demonstrate transferability across different LLMs. The findings offer a practical pathway to more reliable multimodal perception systems in real-world settings.

Abstract

Converting different modalities into general text, serving as input prompts for large language models (LLMs), is a common method to align multimodal models when there is limited pairwise data. This text-centric approach leverages the unique properties of text as a modality space, transforming diverse inputs into a unified textual representation. This enables downstream models to effectively interpret various modal inputs. This study assesses the quality and robustness of multimodal representations in the presence of missing entries, noise, or absent modalities, revealing that current text-centric alignment methods compromise downstream robustness. To address this issue, we propose a new text-centric approach that achieves superior robustness compared to previous methods across various modalities in different settings. Our findings highlight the potential of this approach to enhance the robustness and adaptability of multimodal representations, offering a promising solution for dynamic and real-world applications.
Paper Structure (19 sections, 9 figures, 1 table)

This paper contains 19 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Text-centric multimodal alignment, which converts different modalities into text to serve as input prompts for LLMs, is a common method for aligning large multimodal language models when pairwise multimodal data is limited.
  • Figure 2: Each raw input modality is transformed into text representations using a corresponding foundation model. Following modality summarization and LLM reasoning are applied in parallel. Finally, the output texts are concatenated as the input to a transformer model for downstream prediction. The inference phase follows a similar pattern. We apply a one-shot in-context learning approach to adapt the linguistic style as anticipated during training.
  • Figure 3: To evaluate robustness with noise applied to all three modalities, we analyzed accuracy (left) and drop ratio (right). Both metrics indicate that the text-centric method demonstrates stronger robustness and better resistance to noise compared to other baselines as the noise level increases.
  • Figure 4: Drop ratio when noise is applied to modalities separately - Image (left) and Table (center) and Text (right). Text-centric methods outperforms all baselines and remain nearly no declined as the noise level increase.
  • Figure 5: Component ablation studies suggest that removing modality summarization and reasoning greatly reduces the performance of the our method under increased noise levels. This indicates the critical effectiveness of two major components in our method.
  • ...and 4 more figures