Table of Contents
Fetching ...

Where Did It All Go Wrong? A Hierarchical Look into Multi-Agent Error Attribution

Adi Banerjee, Anirudh Nair, Tarik Borogovac

TL;DR

This work tackles the challenge of attributing errors in large language model–driven multi-agent systems, where errors can propagate across agents and steps. It introduces ECHO, a framework that integrates a four-layer hierarchical context representation with a panel of diverse objective analysts and a confidence-weighted consensus voting mechanism to attribute errors at both agent and step levels. Empirical results on the Who&When benchmark show that ECHO significantly outperforms traditional all-at-once, step-by-step, and binary-search baselines, achieving robust agent-level accuracy around 0.68 and improving step-level attribution with tolerance windows. The approach offers a scalable, bias-mitigating debugging paradigm for complex multi-agent AI deployments and opens avenues for further enhancements in dynamic context relevance, multi-agent debate, and partial correctness evaluation.

Abstract

Error attribution in Large Language Model (LLM) multi-agent systems presents a significant challenge in debugging and improving collaborative AI systems. Current approaches to pinpointing agent and step level failures in interaction traces - whether using all-at-once evaluation, step-by-step analysis, or binary search - fall short when analyzing complex patterns, struggling with both accuracy and consistency. We present ECHO (Error attribution through Contextual Hierarchy and Objective consensus analysis), a novel algorithm that combines hierarchical context representation, objective analysis-based evaluation, and consensus voting to improve error attribution accuracy. Our approach leverages a positional-based leveling of contextual understanding while maintaining objective evaluation criteria, ultimately reaching conclusions through a consensus mechanism. Experimental results demonstrate that ECHO outperforms existing methods across various multi-agent interaction scenarios, showing particular strength in cases involving subtle reasoning errors and complex interdependencies. Our findings suggest that leveraging these concepts of structured, hierarchical context representation combined with consensus-based objective decision-making, provides a more robust framework for error attribution in multi-agent systems.

Where Did It All Go Wrong? A Hierarchical Look into Multi-Agent Error Attribution

TL;DR

This work tackles the challenge of attributing errors in large language model–driven multi-agent systems, where errors can propagate across agents and steps. It introduces ECHO, a framework that integrates a four-layer hierarchical context representation with a panel of diverse objective analysts and a confidence-weighted consensus voting mechanism to attribute errors at both agent and step levels. Empirical results on the Who&When benchmark show that ECHO significantly outperforms traditional all-at-once, step-by-step, and binary-search baselines, achieving robust agent-level accuracy around 0.68 and improving step-level attribution with tolerance windows. The approach offers a scalable, bias-mitigating debugging paradigm for complex multi-agent AI deployments and opens avenues for further enhancements in dynamic context relevance, multi-agent debate, and partial correctness evaluation.

Abstract

Error attribution in Large Language Model (LLM) multi-agent systems presents a significant challenge in debugging and improving collaborative AI systems. Current approaches to pinpointing agent and step level failures in interaction traces - whether using all-at-once evaluation, step-by-step analysis, or binary search - fall short when analyzing complex patterns, struggling with both accuracy and consistency. We present ECHO (Error attribution through Contextual Hierarchy and Objective consensus analysis), a novel algorithm that combines hierarchical context representation, objective analysis-based evaluation, and consensus voting to improve error attribution accuracy. Our approach leverages a positional-based leveling of contextual understanding while maintaining objective evaluation criteria, ultimately reaching conclusions through a consensus mechanism. Experimental results demonstrate that ECHO outperforms existing methods across various multi-agent interaction scenarios, showing particular strength in cases involving subtle reasoning errors and complex interdependencies. Our findings suggest that leveraging these concepts of structured, hierarchical context representation combined with consensus-based objective decision-making, provides a more robust framework for error attribution in multi-agent systems.

Paper Structure

This paper contains 22 sections, 1 figure, 3 tables, 1 algorithm.

Figures (1)

  • Figure 1: Figure 1: ECHO Architecture. The system comprises: (1) Hierarchical Context - processes traces through 4 compression layers (L1-L4: full content → milestones) with specialized modules for handoffs, decisions, errors, and patterns; (2) Decoupled Analysis - uses 6 specialized agents (conservative to balanced) generating structured outputs with evidence, confidence scores, and hypotheses; (3) Consensus Voting - aggregates analyses via confidence-weighted voting and disagreement resolution.