Table of Contents
Fetching ...

Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains

Arun Chowdary Sanna

TL;DR

The paper investigates cross-LLM generalization of behavioral backdoor detection in AI agent supply chains, revealing a substantial generalization gap: detectors trained on a single LLM achieve 92.7% within-model accuracy but only 49.2% across models. It identifies temporal features as the primary source of this gap due to high cross-model variability, while structural features remain stable. The authors demonstrate a practical mitigation through model-aware detection that conditions the classifier on the generating model, achieving 90.6% universal accuracy across six production LLMs. Open science contributions include releasing a multi-LLM trace dataset and detection framework to facilitate reproducible, cross-model security research. Collectively, the work highlights cross-LLM generalization as a fundamental challenge and provides a viable defense path for diverse enterprise AI ecosystems.

Abstract

As AI agents become integral to enterprise workflows, their reliance on shared tool libraries and pre-trained components creates significant supply chain vulnerabilities. While previous work has demonstrated behavioral backdoor detection within individual LLM architectures, the critical question of cross-LLM generalization remains unexplored, a gap with serious implications for organizations deploying multiple AI systems. We present the first systematic study of cross-LLM behavioral backdoor detection, evaluating generalization across six production LLMs (GPT-5.1, Claude Sonnet 4.5, Grok 4.1, Llama 4 Maverick, GPT-OSS 120B, and DeepSeek Chat V3.1). Through 1,198 execution traces and 36 cross-model experiments, we quantify a critical finding: single-model detectors achieve 92.7% accuracy within their training distribution but only 49.2% across different LLMs, a 43.4 percentage point generalization gap equivalent to random guessing. Our analysis reveals that this gap stems from model-specific behavioral signatures, particularly in temporal features (coefficient of variation > 0.8), while structural features remain stable across architectures. We show that model-aware detection incorporating model identity as an additional feature achieves 90.6% accuracy universally across all evaluated models. We release our multi-LLM trace dataset and detection framework to enable reproducible research.

Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains

TL;DR

The paper investigates cross-LLM generalization of behavioral backdoor detection in AI agent supply chains, revealing a substantial generalization gap: detectors trained on a single LLM achieve 92.7% within-model accuracy but only 49.2% across models. It identifies temporal features as the primary source of this gap due to high cross-model variability, while structural features remain stable. The authors demonstrate a practical mitigation through model-aware detection that conditions the classifier on the generating model, achieving 90.6% universal accuracy across six production LLMs. Open science contributions include releasing a multi-LLM trace dataset and detection framework to facilitate reproducible, cross-model security research. Collectively, the work highlights cross-LLM generalization as a fundamental challenge and provides a viable defense path for diverse enterprise AI ecosystems.

Abstract

As AI agents become integral to enterprise workflows, their reliance on shared tool libraries and pre-trained components creates significant supply chain vulnerabilities. While previous work has demonstrated behavioral backdoor detection within individual LLM architectures, the critical question of cross-LLM generalization remains unexplored, a gap with serious implications for organizations deploying multiple AI systems. We present the first systematic study of cross-LLM behavioral backdoor detection, evaluating generalization across six production LLMs (GPT-5.1, Claude Sonnet 4.5, Grok 4.1, Llama 4 Maverick, GPT-OSS 120B, and DeepSeek Chat V3.1). Through 1,198 execution traces and 36 cross-model experiments, we quantify a critical finding: single-model detectors achieve 92.7% accuracy within their training distribution but only 49.2% across different LLMs, a 43.4 percentage point generalization gap equivalent to random guessing. Our analysis reveals that this gap stems from model-specific behavioral signatures, particularly in temporal features (coefficient of variation > 0.8), while structural features remain stable across architectures. We show that model-aware detection incorporating model identity as an additional feature achieves 90.6% accuracy universally across all evaluated models. We release our multi-LLM trace dataset and detection framework to enable reproducible research.

Paper Structure

This paper contains 99 sections, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Cross-LLM detection accuracy matrix. Diagonal (blue boxes): same-model detection averaging 92.7%. Off-diagonal: cross-model detection averaging 49.2% (equivalent to random guessing).
  • Figure 2: Ensemble approach comparison. Model-aware detection (rightmost) achieves 90.6% universal accuracy, outperforming all alternatives.