On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search
Nick Hagar, Nicholas Diakopoulos, Jeremy Gilbert
TL;DR
This work tackles the challenge of enabling investigative journalism to harness AI-assisted document search without compromising accuracy, transparency, or data security. It introduces a journalist-centered on-premise pipeline using small language models with explicit citation chains and a five-stage workflow to ground every claim. Evaluation on two representative corpora with Gemma 3 12B, Qwen 3 14B, and GPT-OSS 20B reveals that while small models can deliver high citation validity and run on desktop hardware, reliability varies significantly and multi-stage synthesis can propagate errors, underscoring the need for careful model selection and sustained human oversight. The study demonstrates that auditable, on-prem AI is feasible for resource-constrained newsrooms and provides practical design guidance for balancing analytical power with editorial standards.
Abstract
Investigative journalists routinely confront large document collections. Large language models (LLMs) with retrieval-augmented generation (RAG) capabilities promise to accelerate the process of document discovery, but newsroom adoption remains limited due to hallucination risks, verification burden, and data privacy concerns. We present a journalist-centered approach to LLM-powered document search that prioritizes transparency and editorial control through a five-stage pipeline -- corpus summarization, search planning, parallel thread execution, quality evaluation, and synthesis -- using small, locally-deployable language models that preserve data security and maintain complete auditability through explicit citation chains. Evaluating three quantized models (Gemma 3 12B, Qwen 3 14B, and GPT-OSS 20B) on two corpora, we find substantial variation in reliability. All models achieved high citation validity and ran effectively on standard desktop hardware (e.g., 24 GB of memory), demonstrating feasibility for resource-constrained newsrooms. However, systematic challenges emerged, including error propagation through multi-stage synthesis and dramatic performance variation based on training data overlap with corpus content. These findings suggest that effective newsroom AI deployment requires careful model selection and system design, alongside human oversight for maintaining standards of accuracy and accountability.
