Table of Contents
Fetching ...

Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench

Zheyuan Liu, Guangyao Dou, Mengzhao Jia, Zhaoxuan Tan, Qingkai Zeng, Yongle Yuan, Meng Jiang

TL;DR

This work addresses privacy risks in multimodal large language models by introducing MLLMU-Bench, a dedicated benchmark to study multimodal unlearning. The dataset combines 500 fictitious profiles and 153 real celebrities across four sets (Forget, Test, Retain, Real Celebrity) with 20k+ image+text and text-only questions, enabling evaluation of unlearning efficacy, generalizability, and model utility under 5%, 10%, and 15% forget scenarios. Baseline investigations across two base MLLMs reveal modality-dependent patterns: unimodal unlearning tends to excel in generation and cloze tasks, while multimodal unlearning better supports classification with multimodal inputs; there is a notable trade-off between forgetting effectiveness and overall model utility. The findings underscore the need for sophisticated multimodal unlearning strategies and provide a framework for future work on privacy-preserving mechanisms and potential certified/unlearning guarantees in MLLMs.

Abstract

Generative models such as Large Language Models (LLM) and Multimodal Large Language models (MLLMs) trained on massive web corpora can memorize and disclose individuals' confidential and private data, raising legal and ethical concerns. While many previous works have addressed this issue in LLM via machine unlearning, it remains largely unexplored for MLLMs. To tackle this challenge, we introduce Multimodal Large Language Model Unlearning Benchmark (MLLMU-Bench), a novel benchmark aimed at advancing the understanding of multimodal machine unlearning. MLLMU-Bench consists of 500 fictitious profiles and 153 profiles for public celebrities, each profile feature over 14 customized question-answer pairs, evaluated from both multimodal (image+text) and unimodal (text) perspectives. The benchmark is divided into four sets to assess unlearning algorithms in terms of efficacy, generalizability, and model utility. Finally, we provide baseline results using existing generative model unlearning algorithms. Surprisingly, our experiments show that unimodal unlearning algorithms excel in generation and cloze tasks, while multimodal unlearning approaches perform better in classification tasks with multimodal inputs.

Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench

TL;DR

This work addresses privacy risks in multimodal large language models by introducing MLLMU-Bench, a dedicated benchmark to study multimodal unlearning. The dataset combines 500 fictitious profiles and 153 real celebrities across four sets (Forget, Test, Retain, Real Celebrity) with 20k+ image+text and text-only questions, enabling evaluation of unlearning efficacy, generalizability, and model utility under 5%, 10%, and 15% forget scenarios. Baseline investigations across two base MLLMs reveal modality-dependent patterns: unimodal unlearning tends to excel in generation and cloze tasks, while multimodal unlearning better supports classification with multimodal inputs; there is a notable trade-off between forgetting effectiveness and overall model utility. The findings underscore the need for sophisticated multimodal unlearning strategies and provide a framework for future work on privacy-preserving mechanisms and potential certified/unlearning guarantees in MLLMs.

Abstract

Generative models such as Large Language Models (LLM) and Multimodal Large Language models (MLLMs) trained on massive web corpora can memorize and disclose individuals' confidential and private data, raising legal and ethical concerns. While many previous works have addressed this issue in LLM via machine unlearning, it remains largely unexplored for MLLMs. To tackle this challenge, we introduce Multimodal Large Language Model Unlearning Benchmark (MLLMU-Bench), a novel benchmark aimed at advancing the understanding of multimodal machine unlearning. MLLMU-Bench consists of 500 fictitious profiles and 153 profiles for public celebrities, each profile feature over 14 customized question-answer pairs, evaluated from both multimodal (image+text) and unimodal (text) perspectives. The benchmark is divided into four sets to assess unlearning algorithms in terms of efficacy, generalizability, and model utility. Finally, we provide baseline results using existing generative model unlearning algorithms. Surprisingly, our experiments show that unimodal unlearning algorithms excel in generation and cloze tasks, while multimodal unlearning approaches perform better in classification tasks with multimodal inputs.

Paper Structure

This paper contains 49 sections, 14 equations, 28 figures, 3 tables.

Figures (28)

  • Figure 1: Demonstration of the multimodal unlearning task. MLLM is firstly fine-tuned on constructed profiles in the proposed benchmark. After fine-tuning, MLLM can answer multimodal questions related to profiles. We then conduct various unlearning methods on a portion of profiles (forget set). Finally, the performance on tasks related to the forget set and the remaining evaluation datasets are tested simultaneously.
  • Figure 2: Examples of question-answer pairs from all four distinct datasets used to assess model unlearning efficacy and model utility. The Forget, Test, Retain Set are fictitious individuals, while the Real Celebrity Set includes real public figures.
  • Figure 3: Classification, generation, and cloze performance of the GA algorithm applied to multimodal and unimodal setups with 5% forget data, using LLaVA as the base model. In subplots (a), (b), (e), (f), (i), (j), the $y$-axis shows the difference in classification accuracy, Rouge-L score, and cloze accuracy compared to the vanilla model, evaluated on the Forget and Test sets. In the rest of subplots, the $y$-axis shows the classification accuracy, Rouge-L score, and cloze accuracy, respectively. The $x$-axis reflects performance across different modalities.
  • Figure 4: The overall trade-off between unlearning effectiveness and model utility across all baselines using different forget data, with LLaVA as the base model. The $x$-axis shows the difference in forget classification accuracy relative to the vanilla model, while the $y$-axis reflects model utility from various perspectives. From left to right, these perspectives include retain accuracy, real celebrity accuracy, MMMU, and LLaVA-Bench performance, respectively.
  • Figure 5: GPT-4o Prompting Strategy for Factuality Score Evaluation with Few-Shot Examples.
  • ...and 23 more figures