Reward Prediction with Factorized World States

Yijun Shen; Delong Chen; Xianming Hu; Jiaming Mi; Hongbo Zhao; Kai Zhang; Pascale Fung

Reward Prediction with Factorized World States

Yijun Shen, Delong Chen, Xianming Hu, Jiaming Mi, Hongbo Zhao, Kai Zhang, Pascale Fung

TL;DR

This paper investigates whether well-defined world state representations alone can enable accurate reward prediction across domains, and shows promising zero-shot results against both VLWM-critic and LLM-as-a-Judge reward models.

Abstract

Agents must infer action outcomes and select actions that maximize a reward signal indicating how close the goal is to being reached. Supervised learning of reward models could introduce biases inherent to training data, limiting generalization to novel goals and environments. In this paper, we investigate whether well-defined world state representations alone can enable accurate reward prediction across domains. To address this, we introduce StateFactory, a factorized representation method that transforms unstructured observations into a hierarchical object-attribute structure using language models. This structured representation allows rewards to be estimated naturally as the semantic similarity between the current state and the goal state under hierarchical constraint. Overall, the compact representation structure induced by StateFactory enables strong reward generalization capabilities. We evaluate on RewardPrediction, a new benchmark dataset spanning five diverse domains and comprising 2,454 unique action-observation trajectories with step-wise ground-truth rewards. Our method shows promising zero-shot results against both VLWM-critic and LLM-as-a-Judge reward models, achieving 60% and 8% lower EPIC distance, respectively. Furthermore, this superior reward quality successfully translates into improved agent planning performance, yielding success rate gains of +21.64% on AlfWorld and +12.40% on ScienceWorld over reactive system-1 policies and enhancing system-2 agent planning. Project Page: https://statefactory.github.io

Reward Prediction with Factorized World States

TL;DR

Abstract

Paper Structure (46 sections, 15 equations, 5 figures, 33 tables)

This paper contains 46 sections, 15 equations, 5 figures, 33 tables.

Introduction
The RewardPrediction Benchmark
Formulation
Implementation
Reward Prediction Methods
Representation-free Methods
Finetuned Language Models
LLM-as-a-Judge Prompting
Representation-based Method (StateFactory)
State Extraction
Goal Interpretation
Hierarchical Routing
Experiments
Baselines and Setup
Main results on RewardPrediction
...and 31 more sections

Figures (5)

Figure 1: RewardPrediction benchmark Overview. Given a textual goal description, the model computes step-wise progress estimates $[\hat{r}_t]_{t=0}^n$ from sequences of action-observation pairs. The reward predictions are compared against ground-truth $[r_t]_{t=0}^n$ using EPIC distance gleave2020quantifying to quantify alignment.
Figure 2: RewardPrediction Benchmark Overview. Representative trajectories across five diverse domains. Each column displays a task instance with actions, observations, and the corresponding ground-truth task progress score ($R \in [0, 1]$) at key time steps.
Figure 3: Two paradigms of reward prediction. Unlike representation-free methods that regress rewards directly from raw inputs (top), representation-based frameworks derive progress signals by measuring the alignment between factorized state representations $s_t$ and goal interpretations $g_t$ (bottom).
Figure 4: StateFactory Framework. State extraction and goal interpretation are coupled recurrent processes (left). Our approach factorizes states into explicit objects and attributes (right), deriving dense rewards from semantic similarity between $\hat{s}_t$ and $\hat{g}_t$.
Figure 5: StateFactory ablation experiments. We report the EPIC distance ($D_{\text{EPIC}}$) across settings (a-d) and Triplet-based Accuracy for embedding models (e). For metrics, lower $D_{\text{EPIC}}$ and higher Accuracy indicate better performance. Specifically, Triplet-based Accuracy reflects the model's capability to enforce smaller distances for positive pairs compared to negative ones. Red hatched bars highlight the best performing configurations. Conversely, blue hatched bars indicate the worst performing ones.

Reward Prediction with Factorized World States

TL;DR

Abstract

Reward Prediction with Factorized World States

Authors

TL;DR

Abstract

Table of Contents

Figures (5)