H$^2$R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents

Shicheng Ye; Chao Yu; Kaiqiang Ke; Chengdong Xu; Yinqi Wei

H$^2$R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents

Shicheng Ye, Chao Yu, Kaiqiang Ke, Chengdong Xu, Yinqi Wei

TL;DR

This work tackles coarse-grained knowledge transfer in multi-task LLM agents by introducing a hierarchical memory system that separates high-level planning memory from low-level execution memory. The central mechanism, Hierarchical Hindsight Reflection ($H^2R$), distills task-level strategies and subgoal-specific execution patterns from past interactions into structured memory units, enabling test-time retrieval that supports hierarchical decision making. Empirical results on AlfWorld and PDDLGame show that $H^2R$ outperforms strong baselines like ReAct and ExpeL, with notable improvements in complex planning scenarios. The findings highlight the value of modular, level-specific memories and reflection-driven memory construction for robust, scalable multi-task reasoning with LLM agents.

Abstract

Large language model (LLM)-based agents have shown strong potential in multi-task scenarios, owing to their ability to transfer knowledge across diverse tasks. However, existing approaches often treat prior experiences and knowledge as monolithic units, leading to inefficient and coarse-grained knowledge transfer. In this work, we propose a novel hierarchical memory architecture that enables fine-grained knowledge transfer by decoupling high-level planning memory from low-level execution memory. To construct and refine these hierarchical memories, we introduce Hierarchical Hindsight Reflection (H$^2$R), a mechanism that distills reusable and hierarchical knowledge from past agent-environment interactions. At test time, H$^2$R performs retrievals of high-level and low-level memories separately, allowing LLM-based agents to efficiently access and utilize task-relevant knowledge for new tasks.Experimental results across two benchmarks demonstrate that H$^2$R can improve generalization and decision-making performance, outperforming prior baselines such as Expel.

H$^2$R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents

TL;DR

), distills task-level strategies and subgoal-specific execution patterns from past interactions into structured memory units, enabling test-time retrieval that supports hierarchical decision making. Empirical results on AlfWorld and PDDLGame show that

outperforms strong baselines like ReAct and ExpeL, with notable improvements in complex planning scenarios. The findings highlight the value of modular, level-specific memories and reflection-driven memory construction for robust, scalable multi-task reasoning with LLM agents.

Abstract

R), a mechanism that distills reusable and hierarchical knowledge from past agent-environment interactions. At test time, H

R performs retrievals of high-level and low-level memories separately, allowing LLM-based agents to efficiently access and utilize task-relevant knowledge for new tasks.Experimental results across two benchmarks demonstrate that H

R can improve generalization and decision-making performance, outperforming prior baselines such as Expel.

H$^2$R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents

TL;DR

Abstract

H$^2$R: Hierarchical Hindsight Reflection for Multi-Task LLM Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)