Table of Contents
Fetching ...

HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation

Zirui Wang, Xinran Zhao, Simon Stepputtis, Woojun Kim, Tongshuang Wu, Katia Sycara, Yaqi Xie

TL;DR

This paper tackles multi-agent action anticipation, where prior models inadequately leverage inter-agent interactions and long-range context. It introduces HiMemFormer, a transformer-based architecture with a dual-hierarchical memory mechanism: a global memory module that aggregates joint history and an Agent-to-Context Encoder, plus a Context-to-Agent Decoder that performs coarse-to-fine refinement to produce agent-specific forecasts; the memory flow involves $M_L^{(a)}$, $M_L^{(c)}$, $\widehat{M}_L$, $M_S^{(c)}$, $M_S^{(a)}$, and learnable tokens $Q_F$, $Q'_F$. Empirical results on the LEMMA dataset show consistent gains over baselines such as LSTR and MAT across four scenarios, with additional HiMemFormer+ achieving further improvements. The work highlights the importance of modeling both long-term joint context and agent-specific short-term cues, advancing capabilities for safe and coordinated multi-agent systems.

Abstract

Understanding and predicting human actions has been a long-standing challenge and is a crucial measure of perception in robotics AI. While significant progress has been made in anticipating the future actions of individual agents, prior work has largely overlooked a key aspect of real-world human activity -- interactions. To address this gap in human-like forecasting within multi-agent environments, we present the Hierarchical Memory-Aware Transformer (HiMemFormer), a transformer-based model for online multi-agent action anticipation. HiMemFormer integrates and distributes global memory that captures joint historical information across all agents through a transformer framework, with a hierarchical local memory decoder that interprets agent-specific features based on these global representations using a coarse-to-fine strategy. In contrast to previous approaches, HiMemFormer uniquely hierarchically applies the global context with agent-specific preferences to avoid noisy or redundant information in multi-agent action anticipation. Extensive experiments on various multi-agent scenarios demonstrate the significant performance of HiMemFormer, compared with other state-of-the-art methods.

HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation

TL;DR

This paper tackles multi-agent action anticipation, where prior models inadequately leverage inter-agent interactions and long-range context. It introduces HiMemFormer, a transformer-based architecture with a dual-hierarchical memory mechanism: a global memory module that aggregates joint history and an Agent-to-Context Encoder, plus a Context-to-Agent Decoder that performs coarse-to-fine refinement to produce agent-specific forecasts; the memory flow involves , , , , , and learnable tokens , . Empirical results on the LEMMA dataset show consistent gains over baselines such as LSTR and MAT across four scenarios, with additional HiMemFormer+ achieving further improvements. The work highlights the importance of modeling both long-term joint context and agent-specific short-term cues, advancing capabilities for safe and coordinated multi-agent systems.

Abstract

Understanding and predicting human actions has been a long-standing challenge and is a crucial measure of perception in robotics AI. While significant progress has been made in anticipating the future actions of individual agents, prior work has largely overlooked a key aspect of real-world human activity -- interactions. To address this gap in human-like forecasting within multi-agent environments, we present the Hierarchical Memory-Aware Transformer (HiMemFormer), a transformer-based model for online multi-agent action anticipation. HiMemFormer integrates and distributes global memory that captures joint historical information across all agents through a transformer framework, with a hierarchical local memory decoder that interprets agent-specific features based on these global representations using a coarse-to-fine strategy. In contrast to previous approaches, HiMemFormer uniquely hierarchically applies the global context with agent-specific preferences to avoid noisy or redundant information in multi-agent action anticipation. Extensive experiments on various multi-agent scenarios demonstrate the significant performance of HiMemFormer, compared with other state-of-the-art methods.

Paper Structure

This paper contains 26 sections, 5 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: HiMemFormer Architecture In the Agent-to-Context Encoder, the observed agent's long-term memory is encoded to a abstract representation $\hat{\bold{M}}_L^{(a)}$ and cross-attention with context past history $\hat{\bold{M}}_L^{(c)}$. Then, the Context-to-Agent Decoder utilize both agent and global recent memories to learn the future information through a two-stage refinement approach.