Table of Contents
Fetching ...

Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Ziyuan Yang, Wenxuan Ding, Shangbin Feng, Yulia Tsvetkov

TL;DR

The paper tackles safety risks in decentralized multi-LLM collaboration by constructing four threat models (M1–M4) and evaluating their impact across API-, text-, logit-, and weight-level collaboration on ten datasets, revealing substantial performance degradations in reasoning and safety domains. It introduces two mitigation strategies—supervisor-free internal checks and supervisor-based external evaluation (LLM-as-judge or reward models)—and shows these defenses can recover about $95.3\%$ of the initial performance on average, though full resistance remains an open problem. Activation steering and RL-based maliciousness produce the strongest adverse effects, while prompting-based attacks have comparatively milder impact; robustness varies across domains, with CocoNot and HumanEval showing notable challenges. Overall, the work highlights important safety considerations for open, decentralized AI and provides practical defense mechanisms, while also outlining directions for future research on malicious generalization and stronger defenses.

Abstract

Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.

Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

TL;DR

The paper tackles safety risks in decentralized multi-LLM collaboration by constructing four threat models (M1–M4) and evaluating their impact across API-, text-, logit-, and weight-level collaboration on ten datasets, revealing substantial performance degradations in reasoning and safety domains. It introduces two mitigation strategies—supervisor-free internal checks and supervisor-based external evaluation (LLM-as-judge or reward models)—and shows these defenses can recover about of the initial performance on average, though full resistance remains an open problem. Activation steering and RL-based maliciousness produce the strongest adverse effects, while prompting-based attacks have comparatively milder impact; robustness varies across domains, with CocoNot and HumanEval showing notable challenges. Overall, the work highlights important safety considerations for open, decentralized AI and provides practical defense mechanisms, while also outlining directions for future research on malicious generalization and stronger defenses.

Abstract

Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.
Paper Structure (22 sections, 2 equations, 4 figures, 15 tables)

This paper contains 22 sections, 2 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: We study the impact of malicious models in four levels of multi-LLM collaboration systems. We construct malicious LLMs via non-parametric and parametric methods, evaluate their impact across four types of model collaboration systems, and propose both supervisor-free and supervisor-based mitigation strategies that effectively identify malicious models and recover collaboration performance.
  • Figure 2: We show how malicious task diversity affects collaboration system performance. With the decrease of malicious diversity, the collaboration performance generally degrades.
  • Figure 3: We show how the amount of malicious models influences collaboration performance. With the number of malicious models gradually increasing, the collaboration performance generally degrades.
  • Figure 4: Impact of out-of-domain SFT malicious models on collaboration system performance. Boxes on the diagonal (w/ red boundaries) indicate in-domain SFT. While cross-domain malicious models also degrade collaboration performance, their impact is generally weaker than that of in-domain SFT malicious models.