Table of Contents
Fetching ...

Who Can We Trust? Scope-Aware Video Moment Retrieval with Multi-Agent Conflict

Chaochen Wu, Guan Luo, Meiyun Zuo, Zhitao Fan

TL;DR

This paper addresses the challenge of text-driven video moment retrieval by introducing an evidential multi-agent reinforcement learning framework. It combines ESRL, which scans videos with a fixed window and outputs locational evidence, with MARLCC, a tri-agent system that uses inter-agent conflict and trusted IoU to select the best localization and detect out-of-scope queries without extra training. The approach achieves state-of-the-art results among RL-based methods on Charades-STA and ActivityNet-Captions and demonstrates robust zero-shot OOS detection. The use of evidential learning to quantify uncertainty and manage competition among agents provides a principled mechanism to improve localization accuracy and reliability in real-world video retrieval tasks.

Abstract

Video moment retrieval uses a text query to locate a moment from a given untrimmed video reference. Locating corresponding video moments with text queries helps people interact with videos efficiently. Current solutions for this task have not considered conflict within location results from different models, so various models cannot integrate correctly to produce better results. This study introduces a reinforcement learning-based video moment retrieval model that can scan the whole video once to find the moment's boundary while producing its locational evidence. Moreover, we proposed a multi-agent system framework that can use evidential learning to resolve conflicts between agents' localization output. As a side product of observing and dealing with conflicts between agents, we can decide whether a query has no corresponding moment in a video (out-of-scope) without additional training, which is suitable for real-world applications. Extensive experiments on benchmark datasets show the effectiveness of our proposed methods compared with state-of-the-art approaches. Furthermore, the results of our study reveal that modeling competition and conflict of the multi-agent system is an effective way to improve RL performance in moment retrieval and show the new role of evidential learning in the multi-agent framework.

Who Can We Trust? Scope-Aware Video Moment Retrieval with Multi-Agent Conflict

TL;DR

This paper addresses the challenge of text-driven video moment retrieval by introducing an evidential multi-agent reinforcement learning framework. It combines ESRL, which scans videos with a fixed window and outputs locational evidence, with MARLCC, a tri-agent system that uses inter-agent conflict and trusted IoU to select the best localization and detect out-of-scope queries without extra training. The approach achieves state-of-the-art results among RL-based methods on Charades-STA and ActivityNet-Captions and demonstrates robust zero-shot OOS detection. The use of evidential learning to quantify uncertainty and manage competition among agents provides a principled mechanism to improve localization accuracy and reliability in real-world video retrieval tasks.

Abstract

Video moment retrieval uses a text query to locate a moment from a given untrimmed video reference. Locating corresponding video moments with text queries helps people interact with videos efficiently. Current solutions for this task have not considered conflict within location results from different models, so various models cannot integrate correctly to produce better results. This study introduces a reinforcement learning-based video moment retrieval model that can scan the whole video once to find the moment's boundary while producing its locational evidence. Moreover, we proposed a multi-agent system framework that can use evidential learning to resolve conflicts between agents' localization output. As a side product of observing and dealing with conflicts between agents, we can decide whether a query has no corresponding moment in a video (out-of-scope) without additional training, which is suitable for real-world applications. Extensive experiments on benchmark datasets show the effectiveness of our proposed methods compared with state-of-the-art approaches. Furthermore, the results of our study reveal that modeling competition and conflict of the multi-agent system is an effective way to improve RL performance in moment retrieval and show the new role of evidential learning in the multi-agent framework.

Paper Structure

This paper contains 12 sections, 26 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An example of video moment retrieval.
  • Figure 2: Method overview. The left part is the action and output for ESRL. The right part is the architecture of MARLCC. ESRL produces IoU, location, and evidence for each step, and it moves in the right direction in a fixed window size and step size. For MARLCC, ESRL is one of the agents from three. MARLCC uses agents' trusted IoU to select the best result from competition, and in the 2DSTB map, it is the "closest" agent to the ground truth; MARLCC also can use conflict to find OOS query, and in the 2DSTB map, the OOS query has higher conflict than matched queries.
  • Figure 3: ESRL architecture and two-dimensional representation of locational evidence.
  • Figure 4: MR examples with OOS queries and 2DSTB maps visualization for agents actions.