CAML: Collaborative Auxiliary Modality Learning for Multi-Agent Systems
Rui Liu, Yu Shen, Peng Gao, Pratap Tokekar, Ming Lin
TL;DR
CAML introduces a unified framework for multi-modal multi-agent learning that enables training-time sharing of multi-modal data across agents while allowing inference with reduced modalities at test time. By employing a teacher model that aggregates cross-agent embeddings and a student model that distills this knowledge to operate on limited modalities, CAML achieves robust performance in dynamic, resource-constrained environments. The approach yields large gains in accident detection for connected autonomous driving (up to 58.1% ADR) and statewide semantic segmentation for aerial-ground robots (up to 10.6% mIoU), while improving communication efficiency relative to prior methods. These results highlight CAML’s practical impact for safe, scalable deployment in real-world multi-agent sensing scenarios.
Abstract
Multi-modal learning has emerged as a key technique for improving performance across domains such as autonomous driving, robotics, and reasoning. However, in certain scenarios, particularly in resource-constrained environments, some modalities available during training may be absent during inference. While existing frameworks effectively utilize multiple data sources during training and enable inference with reduced modalities, they are primarily designed for single-agent settings. This poses a critical limitation in dynamic environments such as connected autonomous vehicles (CAV), where incomplete data coverage can lead to decision-making blind spots. Conversely, some works explore multi-agent collaboration but without addressing missing modality at test time. To overcome these limitations, we propose Collaborative Auxiliary Modality Learning (CAML), a novel multi-modal multi-agent framework that enables agents to collaborate and share multi-modal data during training, while allowing inference with reduced modalities during testing. Experimental results in collaborative decision-making for CAV in accident-prone scenarios demonstrate that CAML achieves up to a 58.1% improvement in accident detection. Additionally, we validate CAML on real-world aerial-ground robot data for collaborative semantic segmentation, achieving up to a 10.6% improvement in mIoU.
