Permissive Information-Flow Analysis for Large Language Models
Shoaib Ahmed Siddiqui, Radhika Gaonkar, Boris Köpf, David Krueger, Andrew Paverd, Ahmed Salem, Shruti Tople, Lukas Wutschitz, Menglin Xia, Santiago Zanella-Béguelin
TL;DR
The paper tackles security and privacy risks in retrieval-augmented LLMs by introducing a permissive information-flow label propagator that avoids label creep. It formalizes a diffusion-aware, influence-based method that propagates only labels of inputs actually influencing the output, and implements two realizations (prompt-based augmentation and $k$NN-LM) within a safety-guaranteed wrapper. Empirical results across synthetic, news, and LLM-agent datasets show that the prompt-based approach identifies minimal labels with high accuracy (up to ~86% exact matches on large label sets) and significantly improves labels in total-order lattices, while maintaining strong output alignment. The work demonstrates practical system-level benefits, enabling more permissive yet safe information flows and offering extensions to broader applications and efficiency improvements in LLM-based pipelines.
Abstract
Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy problems: poisoned data retrieved from one component can change the model's behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. One promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, this approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources. In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the labels of the samples that were influential in generating the model output and to eliminate the labels of unnecessary inputs. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt-based retrieval augmentation, and (ii) a $k$-nearest-neighbors language model. We compare these with a baseline that uses introspection to predict the output label. Our experimental results in an LLM agent setting show that the permissive label propagator improves over the baseline in more than 85% of the cases, which underscores the practicality of our approach.
