Table of Contents
Fetching ...

Unvalidated Trust: Cross-Stage Vulnerabilities in Large Language Model Architectures

Dominik Schwarz

TL;DR

This work tackles semantic security in multi-stage LLM pipelines by revealing cross-stage vulnerabilities that arise when downstream components implicitly trust upstream outputs. It introduces a mechanism-centered taxonomy of 41 risk patterns, backed by empirical observations across provider-default, text-only sessions with three commercial models and a local offline baseline, using metrics DS, IEO, POB, PDI, and RR. The study identifies core architectural failure modes—unvalidated trust inheritance, interpretation-driven state changes, and cross-modal/structural influences—arguing that traditional string filtering is insufficient. It proposes zero-trust architectural principles, including provenance enforcement, context sealing, and plan revalidation, and outlines Countermind as a companion blueprint for implementing these defenses. Collectively, the findings support defense-in-depth strategies to curb cross-stage manipulation, offering concrete guidance for auditing, architecture design, and ongoing evaluation in real-world AI pipelines.

Abstract

As Large Language Models (LLMs) are increasingly integrated into automated, multi-stage pipelines, risk patterns that arise from unvalidated trust between processing stages become a practical concern. This paper presents a mechanism-centered taxonomy of 41 recurring risk patterns in commercial LLMs. The analysis shows that inputs are often interpreted non-neutrally and can trigger implementation-shaped responses or unintended state changes even without explicit commands. We argue that these behaviors constitute architectural failure modes and that string-level filtering alone is insufficient. To mitigate such cross-stage vulnerabilities, we recommend zero-trust architectural principles, including provenance enforcement, context sealing, and plan revalidation, and we introduce "Countermind" as a conceptual blueprint for implementing these defenses.

Unvalidated Trust: Cross-Stage Vulnerabilities in Large Language Model Architectures

TL;DR

This work tackles semantic security in multi-stage LLM pipelines by revealing cross-stage vulnerabilities that arise when downstream components implicitly trust upstream outputs. It introduces a mechanism-centered taxonomy of 41 risk patterns, backed by empirical observations across provider-default, text-only sessions with three commercial models and a local offline baseline, using metrics DS, IEO, POB, PDI, and RR. The study identifies core architectural failure modes—unvalidated trust inheritance, interpretation-driven state changes, and cross-modal/structural influences—arguing that traditional string filtering is insufficient. It proposes zero-trust architectural principles, including provenance enforcement, context sealing, and plan revalidation, and outlines Countermind as a companion blueprint for implementing these defenses. Collectively, the findings support defense-in-depth strategies to curb cross-stage manipulation, offering concrete guidance for auditing, architecture design, and ongoing evaluation in real-world AI pipelines.

Abstract

As Large Language Models (LLMs) are increasingly integrated into automated, multi-stage pipelines, risk patterns that arise from unvalidated trust between processing stages become a practical concern. This paper presents a mechanism-centered taxonomy of 41 recurring risk patterns in commercial LLMs. The analysis shows that inputs are often interpreted non-neutrally and can trigger implementation-shaped responses or unintended state changes even without explicit commands. We argue that these behaviors constitute architectural failure modes and that string-level filtering alone is insufficient. To mitigate such cross-stage vulnerabilities, we recommend zero-trust architectural principles, including provenance enforcement, context sealing, and plan revalidation, and we introduce "Countermind" as a conceptual blueprint for implementing these defenses.

Paper Structure

This paper contains 806 sections, 8 equations, 4 figures, 102 tables.

Figures (4)

  • Figure 1: A diagram illustrating the Attack Graph, depicting the relationships and potential escalation pathways among the different risk patterns classes as detailed in Section \ref{['sec:attack_graph']}
  • Figure 2: The diagram includes layers like the Multimodal Sandbox, Intent Routing, and the Core Large Language Model with Introspective Filters, with callouts indicating which threat classes are mitigated at each layer.
  • Figure 3: OCR Example 1
  • Figure 4: Visual Trigger