Unvalidated Trust: Cross-Stage Vulnerabilities in Large Language Model Architectures
Dominik Schwarz
TL;DR
This work tackles semantic security in multi-stage LLM pipelines by revealing cross-stage vulnerabilities that arise when downstream components implicitly trust upstream outputs. It introduces a mechanism-centered taxonomy of 41 risk patterns, backed by empirical observations across provider-default, text-only sessions with three commercial models and a local offline baseline, using metrics DS, IEO, POB, PDI, and RR. The study identifies core architectural failure modes—unvalidated trust inheritance, interpretation-driven state changes, and cross-modal/structural influences—arguing that traditional string filtering is insufficient. It proposes zero-trust architectural principles, including provenance enforcement, context sealing, and plan revalidation, and outlines Countermind as a companion blueprint for implementing these defenses. Collectively, the findings support defense-in-depth strategies to curb cross-stage manipulation, offering concrete guidance for auditing, architecture design, and ongoing evaluation in real-world AI pipelines.
Abstract
As Large Language Models (LLMs) are increasingly integrated into automated, multi-stage pipelines, risk patterns that arise from unvalidated trust between processing stages become a practical concern. This paper presents a mechanism-centered taxonomy of 41 recurring risk patterns in commercial LLMs. The analysis shows that inputs are often interpreted non-neutrally and can trigger implementation-shaped responses or unintended state changes even without explicit commands. We argue that these behaviors constitute architectural failure modes and that string-level filtering alone is insufficient. To mitigate such cross-stage vulnerabilities, we recommend zero-trust architectural principles, including provenance enforcement, context sealing, and plan revalidation, and we introduce "Countermind" as a conceptual blueprint for implementing these defenses.
