Table of Contents
Fetching ...

CAML: Collaborative Auxiliary Modality Learning for Multi-Agent Systems

Rui Liu, Yu Shen, Peng Gao, Pratap Tokekar, Ming Lin

TL;DR

CAML introduces a unified framework for multi-modal multi-agent learning that enables training-time sharing of multi-modal data across agents while allowing inference with reduced modalities at test time. By employing a teacher model that aggregates cross-agent embeddings and a student model that distills this knowledge to operate on limited modalities, CAML achieves robust performance in dynamic, resource-constrained environments. The approach yields large gains in accident detection for connected autonomous driving (up to 58.1% ADR) and statewide semantic segmentation for aerial-ground robots (up to 10.6% mIoU), while improving communication efficiency relative to prior methods. These results highlight CAML’s practical impact for safe, scalable deployment in real-world multi-agent sensing scenarios.

Abstract

Multi-modal learning has emerged as a key technique for improving performance across domains such as autonomous driving, robotics, and reasoning. However, in certain scenarios, particularly in resource-constrained environments, some modalities available during training may be absent during inference. While existing frameworks effectively utilize multiple data sources during training and enable inference with reduced modalities, they are primarily designed for single-agent settings. This poses a critical limitation in dynamic environments such as connected autonomous vehicles (CAV), where incomplete data coverage can lead to decision-making blind spots. Conversely, some works explore multi-agent collaboration but without addressing missing modality at test time. To overcome these limitations, we propose Collaborative Auxiliary Modality Learning (CAML), a novel multi-modal multi-agent framework that enables agents to collaborate and share multi-modal data during training, while allowing inference with reduced modalities during testing. Experimental results in collaborative decision-making for CAV in accident-prone scenarios demonstrate that CAML achieves up to a 58.1% improvement in accident detection. Additionally, we validate CAML on real-world aerial-ground robot data for collaborative semantic segmentation, achieving up to a 10.6% improvement in mIoU.

CAML: Collaborative Auxiliary Modality Learning for Multi-Agent Systems

TL;DR

CAML introduces a unified framework for multi-modal multi-agent learning that enables training-time sharing of multi-modal data across agents while allowing inference with reduced modalities at test time. By employing a teacher model that aggregates cross-agent embeddings and a student model that distills this knowledge to operate on limited modalities, CAML achieves robust performance in dynamic, resource-constrained environments. The approach yields large gains in accident detection for connected autonomous driving (up to 58.1% ADR) and statewide semantic segmentation for aerial-ground robots (up to 10.6% mIoU), while improving communication efficiency relative to prior methods. These results highlight CAML’s practical impact for safe, scalable deployment in real-world multi-agent sensing scenarios.

Abstract

Multi-modal learning has emerged as a key technique for improving performance across domains such as autonomous driving, robotics, and reasoning. However, in certain scenarios, particularly in resource-constrained environments, some modalities available during training may be absent during inference. While existing frameworks effectively utilize multiple data sources during training and enable inference with reduced modalities, they are primarily designed for single-agent settings. This poses a critical limitation in dynamic environments such as connected autonomous vehicles (CAV), where incomplete data coverage can lead to decision-making blind spots. Conversely, some works explore multi-agent collaboration but without addressing missing modality at test time. To overcome these limitations, we propose Collaborative Auxiliary Modality Learning (CAML), a novel multi-modal multi-agent framework that enables agents to collaborate and share multi-modal data during training, while allowing inference with reduced modalities during testing. Experimental results in collaborative decision-making for CAV in accident-prone scenarios demonstrate that CAML achieves up to a 58.1% improvement in accident detection. Additionally, we validate CAML on real-world aerial-ground robot data for collaborative semantic segmentation, achieving up to a 10.6% improvement in mIoU.

Paper Structure

This paper contains 37 sections, 2 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Illustration of CAML.CAML enables (1) multiple agents to collaborate and share multi-modal data during training while allowing for runtime inference with reduced modalities during testing; (2) the number of agents can vary between training and testing, ensuring flexibility and robustness in deployment.
  • Figure 2: Overview of the Pipeline of CAML. The teacher model (top) aggregates and shares multi-modal embeddings across agents to make predictions using the full set of modalities. In contrast, the student model (bottom) processes a subset of modalities per agent and shares them to form a multi-modal embedding. Through knowledge distillation from the teacher, the student learns to produce robust predictions despite missing modalities, enabling effective inference during deployment. In the teacher model, the set of agents is denoted as ${\mathcal{A}}_\text{train} = \{{\mathcal{A}}_1, {\mathcal{A}}_2, \ldots, {\mathcal{A}}_N\}$. The set of modalities is denoted as ${\mathcal{I}}_\text{train}$. The observations of all agents are denoted as $X = \{x_1, x_2, \ldots, x_N\}$, where $x_i^k$ is the observation acquired by the $i$-th agent ${\mathcal{A}}_i \in {\mathcal{A}}_\text{train}$ for the $k$-th modality. In the student model, the set of agents is denoted as ${\mathcal{A}}_\text{test} = \{{\mathcal{A}}_1, {\mathcal{A}}_2, \ldots, {\mathcal{A}}_M\}$. The set of modalities is denoted as ${\mathcal{I}}_\text{test}$, which is a subset of ${\mathcal{I}}_\text{train}$. The set of agents that have access to the $j$-th modality ${\mathcal{I}}_j \in {\mathcal{I}}_\text{test}$ is denoted as ${\mathcal{A}}_\text{test}^{{\mathcal{I}}_j}$, and the number of agents in this set is given by $|{\mathcal{A}}_\text{test}^{{\mathcal{I}}_j}| = M_j$, with $x_{M_j}^j$ represents the observation acquired by the $M_j$-th agent ${\mathcal{A}}_{M_j} \in {\mathcal{A}}_\text{test}$ for the $j$-th modality.
  • Figure 3: Performance Comparison of CAML Against Baselines. We evaluate performance using two metrics: Accident Detection Rate (ADR) and Expert Imitation Rate (EIR) across three accident-prone scenarios: (a) Overtaking, (b) Left Turn, and (c) Red Light Violation. CAML demonstrates superior performance across all scenarios compared to these baselines by up to ${\bf 58.1\%}$, benefiting considerably from the multi-modal multi-agent collaboration.
  • Figure 4: Single-Agent System Generalizability of CAML. We evaluate the generalizability of CAML by testing the case where we have multi-agent collaboration during training, but only a single agent during testing. We compare the performance of our approach with COOPERNAUT and STGN under single-agent settings. CAML with a single agent during testing consistently outperforms the other baselines across all scenarios, offering a valuable and cost-effective solution for practical applications.
  • Figure 5: Qualitative results of different approaches on semantic segmentation on real-world data from aerial-ground robots in scenarios of both indoor and outdoor environments. From left to right, input image for the ground robot, ground truth segmentation map, FCN prediction, AML prediction, and CAML prediction. CAML prediction is the closest to the ground truth.
  • ...and 2 more figures