Table of Contents
Fetching ...

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu

TL;DR

This work addresses the challenge of reasoning over ultra-long egocentric videos, where questions require evidence spanning days or weeks. It proposes Ego-R1, a Chain-of-Tool-Thought framework that dynamically calls specialized perceptual tools (Hierarchical RAG, Video-LLM, and a general Vision-Language Model) under an RL-trained controller. A two-stage training regimen—supervised fine-tuning on CoTT data followed by reinforcement learning with Gradient-Regularized Policy Optimization—along with the Ego-R1 Data and Ego-R1 Bench enables scalable, interpretable long-duration reasoning. Empirical results demonstrate strong performance on both exocentric and egocentric long-video benchmarks, with notable gains from dynamic tool calling and multi-turn CoTT reasoning, suggesting practical implications for long-term life-oriented AI assistants and robust human-AI collaboration in open-world settings.

Abstract

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

TL;DR

This work addresses the challenge of reasoning over ultra-long egocentric videos, where questions require evidence spanning days or weeks. It proposes Ego-R1, a Chain-of-Tool-Thought framework that dynamically calls specialized perceptual tools (Hierarchical RAG, Video-LLM, and a general Vision-Language Model) under an RL-trained controller. A two-stage training regimen—supervised fine-tuning on CoTT data followed by reinforcement learning with Gradient-Regularized Policy Optimization—along with the Ego-R1 Data and Ego-R1 Bench enables scalable, interpretable long-duration reasoning. Empirical results demonstrate strong performance on both exocentric and egocentric long-video benchmarks, with notable gains from dynamic tool calling and multi-turn CoTT reasoning, suggesting practical implications for long-term life-oriented AI assistants and robust human-AI collaboration in open-world settings.

Abstract

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.

Paper Structure

This paper contains 32 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of Ego-R1. In this figure, we demonstrate how the Ego-R1 Agent orchestrates specialized tools (e.g., Hierarchical_RAG, Video LLM, and VLM) to answer the question step-by-step, based on the observations and previous actions. The system effectively answers questions that require careful searching within ultra-long videos and precise analysis of frame details.
  • Figure 2: Data generation pipeline of the Ego-R1 Data. We first obtained raw QA pairs from both AI-generated and human-annotated sources based on 6 raw videos collected from 6 participants and the corresponding log. The verified and processed Multiple Choice Questions (MCQs) serve as the foundation of the Ego-R1 Data (left). We take questions without answers for Chain-of-Tool-Thought (CoTT) generation, which involves creating reasoning chains that include explicit thinking steps and dynamic tool-calling sequences (right).
  • Figure 3: Overview of the two-stage training strategies in Ego-R1. Ego-R1 employs a two-stage training approach: Stage 1 utilizes supervised fine-tuning with CoTT data to establish structured tool-calling capabilities, while Stage 2 applies multi-turn reinforcement learning with rule-based rewards to optimize iterative reasoning and tool execution across diverse question types.
  • Figure 4: Overview of the Hierarchical RAG system. Based on the raw video and its 30-second clips, we generate the memory bank for each video from its 30-second-level summaries to day-level summaries. During the keywords retrieval, the system searches efficiently by starting with day-level summaries and drilling down to 10-minute segments as needed.
  • Figure 5: Qualitative results comparison with Video-R1. Case 1-3 illustrate successful examples where Ego-R1 outperforms Video-R1 by producing more detailed, interpretable step-by-step reasoning chains through dynamic tool-calling. In contrast, Case 4 highlights a failure case from Ego-R1 Agent. Although the observation in Step 1 correctly identified relevant information near timestamp DAY2_15500000, the subsequent tool call failed to adjust the temporal range accordingly, resulting in an incorrect or suboptimal retrieval in the next step, leading to the final error answer.