Table of Contents
Fetching ...

Corrupted but Not Broken: Understanding and Mitigating the Negative Impacts of Corrupted Data in Visual Instruction Tuning

Yunhao Gou, Hansi Yang, Zhili Liu, Kai Chen, Yihan Zeng, Lanqing Hong, Zhenguo Li, Qun Liu, Bo Han, James T. Kwok, Yu Zhang

TL;DR

Corrupted data in Visual Instruction Tuning can harm Multimodal LLMs, but the authors show the damage is reversible and that corrupted models can distinguish clean data. They demonstrate that pruning about 1.4 percent of parameters or using a corruption aware self validation loop can largely restore or even improve performance relative to clean data. They propose a corruption robust training paradigm that leverages the corrupted MLLM's ability to identify clean samples to guide training, significantly surpassing existing mitigation methods. The findings provide a practical path to exploit corrupted data rather than rely solely on costly data collection, with implications for robust multimodal instruction tuning.

Abstract

Visual Instruction Tuning (VIT) aims to enhance Multimodal Large Language Models (MLLMs), yet its effectiveness is often compromised by corrupted datasets with issues such as hallucinated content, incorrect responses, and poor OCR quality. Previous approaches to address these challenges have focused on refining datasets through high-quality data collection or rule-based filtering that can be costly or limited in scope. In this paper, we conduct a systematic investigation into the impact of corrupted data on MLLMs and discover that, although corrupted data degrade model performance, such adverse effects are largely reversible, and MLLMs are {\bf corrupted but not broken}. Specifically, we find that disabling a small subset of parameters can almost fully restore performance. Moreover, corrupted MLLMs inherently possess the capability to differentiate between clean and corrupted samples, facilitating dataset cleaning without external intervention. Building on these insights, we introduce a corruption-robust training paradigm that significantly surpasses existing strategies for mitigating the effects of corrupted data.

Corrupted but Not Broken: Understanding and Mitigating the Negative Impacts of Corrupted Data in Visual Instruction Tuning

TL;DR

Corrupted data in Visual Instruction Tuning can harm Multimodal LLMs, but the authors show the damage is reversible and that corrupted models can distinguish clean data. They demonstrate that pruning about 1.4 percent of parameters or using a corruption aware self validation loop can largely restore or even improve performance relative to clean data. They propose a corruption robust training paradigm that leverages the corrupted MLLM's ability to identify clean samples to guide training, significantly surpassing existing mitigation methods. The findings provide a practical path to exploit corrupted data rather than rely solely on costly data collection, with implications for robust multimodal instruction tuning.

Abstract

Visual Instruction Tuning (VIT) aims to enhance Multimodal Large Language Models (MLLMs), yet its effectiveness is often compromised by corrupted datasets with issues such as hallucinated content, incorrect responses, and poor OCR quality. Previous approaches to address these challenges have focused on refining datasets through high-quality data collection or rule-based filtering that can be costly or limited in scope. In this paper, we conduct a systematic investigation into the impact of corrupted data on MLLMs and discover that, although corrupted data degrade model performance, such adverse effects are largely reversible, and MLLMs are {\bf corrupted but not broken}. Specifically, we find that disabling a small subset of parameters can almost fully restore performance. Moreover, corrupted MLLMs inherently possess the capability to differentiate between clean and corrupted samples, facilitating dataset cleaning without external intervention. Building on these insights, we introduce a corruption-robust training paradigm that significantly surpasses existing strategies for mitigating the effects of corrupted data.

Paper Structure

This paper contains 52 sections, 22 equations, 20 figures, 9 tables.

Figures (20)

  • Figure 1: Examples of corrupted samples in VIT.
  • Figure 2: Left: Average task performance of MLLMs with various corruption ratios. Though simple fine-tuning suffers from a performance drop, disabling corruption-related parameters (1.4%) can largely restore the performance. Our method is robust to various corruption rates. Right: MLLM's (fine-tuned with corrupted samples) precisions of classifying clean and corrupted samples. Details are in Appendix \ref{['app:detail_intro']}.
  • Figure 3: Performance (y-axis) of LLaVA-1.5 (LLaMA-3.1-8B) under different corruption ratios (x-axis).
  • Figure 4: Effects of corruption on LLaVA-1.5 (LLaMA-3.1-8B). The evaluation datasets are shown in 3 groups: VQA, Conversation and MC-VQA. The corruption ratio here is 60%.
  • Figure 5: Precision-recall curves of MLLM's predictions on the correctness of 100K samples ($cr=50\%$).$x$-axis: recall; $y$-axis: precision. Solid and dotted line denote predictions based on Val_PPL and PPL, respectively. Color represents the corruption ratio of the training dataset: $\bullet$0%, $\bullet$10%, $\bullet$20%$\bullet$30%, $\bullet$40%, $\bullet$50%.
  • ...and 15 more figures