Atomic Information Flow: A Network Flow Model for Tool Attributions in RAG Systems
James Gao, Josh Zhou, Qi Sun, Ryan Huang, Steven Yoo
TL;DR
The paper presents Atomic Information Flow (AIF), a graph-based framework that attributes LLM-generated responses in multi-tool RAG systems to specific tool components by decomposing all outputs into minimal semantic atoms and tracking their flow. It introduces a three-stage methodology—atomic decomposition, atomic signal injection, and response-atom assignment—plus flow heuristics, and trains a small model (Gemma3-4B) to approximate a minimum-cut compression policy. Empirical results on HotPotQA and other QA benchmarks show that AIF signals enable substantial context compression with only modest losses in accuracy, bridging the gap between small and large models and enabling fine-grained attribution and trajectory debugging. The work lays a foundation for trajectory-level rewards and explainability in RAG stacks, with implications for efficiency and scalable tool orchestration in LLM systems.
Abstract
Many tool-based Retrieval Augmented Generation (RAG) systems lack precise mechanisms for tracing final responses back to specific tool components -- a critical gap as systems scale to complex multi-agent architectures. We present \textbf{Atomic Information Flow (AIF)}, a graph-based network flow model that decomposes tool outputs and LLM calls into atoms: indivisible, self-contained units of information. By modeling LLM orchestration as a directed flow of atoms from tool and LLM nodes to a response super-sink, AIF enables granular attribution metrics for AI explainability. Motivated by the max-flow min-cut theorem in network flow theory, we train a lightweight Gemma3 (4B parameter) language model as a context compressor to approximate the minimum cut of tool atoms using flow signals computed offline by AIF. We note that the base Gemma3-4B model struggles to identify critical information with \textbf{54.7\%} accuracy on HotpotQA, barely outperforming lexical baselines (BM25). However, post-training on AIF signals boosts accuracy to \textbf{82.71\%} (+28.01 points) while achieving \textbf{87.52\%} (+1.85\%) context token compression -- bridging the gap with the Gemma3-27B variant, a model nearly $7\times$ larger.
