LUMIA: Linear probing for Unimodal and MultiModal Membership Inference Attacks leveraging internal LLM states
Luis Ibanez-Lissen, Lorena Gonzalez-Manzano, Jose Maria de Fuentes, Nicolas Anciaux, Joaquin Garcia-Alfaro
TL;DR
LUMIA introduces a white-box, layer-wise probing framework that uses Linear Probes (LPs) on internal LLM activations to detect Membership Inference Attacks (MIAs) in both unimodal and multimodal settings. By training an LP after every transformer layer and evaluating via $AUC$, LUMIA not only improves detection performance (averaging a $+15.75\%$ AUC gain over SOTA in unimodal cases) but also identifies the most informative layers for MIAs, revealing how detectability varies with model size, dataset type, and deduplication. Multimodal experiments show strong detectability across several image-text tasks, with $85.9\%$ of experiments achieving $AUC>0.6$, suggesting that visual inputs contribute valuable signals for MIAs. The work provides a comprehensive, layer-by-layer analysis across a broad set of models and datasets, offering actionable insights for auditing and defending LLMs against MIAs while highlighting the influence of TB/NGB biases and data deduplication on attack effectiveness.
Abstract
Large Language Models (LLMs) are increasingly used in a variety of applications, but concerns around membership inference have grown in parallel. Previous efforts focus on black-to-grey-box models, thus neglecting the potential benefit from internal LLM information. To address this, we propose the use of Linear Probes (LPs) as a method to detect Membership Inference Attacks (MIAs) by examining internal activations of LLMs. Our approach, dubbed LUMIA, applies LPs layer-by-layer to get fine-grained data on the model inner workings. We test this method across several model architectures, sizes and datasets, including unimodal and multimodal tasks. In unimodal MIA, LUMIA achieves an average gain of 15.71 % in Area Under the Curve (AUC) over previous techniques. Remarkably, LUMIA reaches AUC>60% in 65.33% of cases -- an increment of 46.80% against the state of the art. Furthermore, our approach reveals key insights, such as the model layers where MIAs are most detectable. In multimodal models, LPs indicate that visual inputs can significantly contribute to detect MIAs -- AUC>60% is reached in 85.90% of experiments.
