TRACE: A Self-Improving Framework for Robot Behavior Forecasting with Vision-Language Models

Gokul Puthumanaillam; Paulo Padrao; Jose Fuentes; Pranay Thangeda; William E. Schafer; Jae Hyuk Song; Karan Jagdale; Leonardo Bobadilla; Melkior Ornik

TRACE: A Self-Improving Framework for Robot Behavior Forecasting with Vision-Language Models

Gokul Puthumanaillam, Paulo Padrao, Jose Fuentes, Pranay Thangeda, William E. Schafer, Jae Hyuk Song, Karan Jagdale, Leonardo Bobadilla, Melkior Ornik

TL;DR

TRACE introduces an iterative framework that couples vision-language reasoning with a world model and counterfactual exploration to forecast near-term robot trajectories from sparse observations. By building a tree-of-thought hypothesis space and probing edge cases with a counterfactual critic, TRACE achieves broader and more robust trajectory coverage than single-shot VLM or traditional model-based baselines. The self-improvement loop allows the VLM to internalize domain constraints and failure patterns over iterations, reducing invalid predictions while expanding feasible maneuver ideas, including rare edge cases. Demonstrations on marine autonomous surface vessels and simulated ground vehicles show significant gains in coverage and edge-case detection, highlighting TRACE’s practical potential for safe, proactive navigation in partially observable environments.

Abstract

Predicting the near-term behavior of a reactive agent is crucial in many robotic scenarios, yet remains challenging when observations of that agent are sparse or intermittent. Vision-Language Models (VLMs) offer a promising avenue by integrating textual domain knowledge with visual cues, but their one-shot predictions often miss important edge cases and unusual maneuvers. Our key insight is that iterative, counterfactual exploration--where a dedicated module probes each proposed behavior hypothesis, explicitly represented as a plausible trajectory, for overlooked possibilities--can significantly enhance VLM-based behavioral forecasting. We present TRACE (Tree-of-thought Reasoning And Counterfactual Exploration), an inference framework that couples tree-of-thought generation with domain-aware feedback to refine behavior hypotheses over multiple rounds. Concretely, a VLM first proposes candidate trajectories for the agent; a counterfactual critic then suggests edge-case variations consistent with partial observations, prompting the VLM to expand or adjust its hypotheses in the next iteration. This creates a self-improving cycle where the VLM progressively internalizes edge cases from previous rounds, systematically uncovering not only typical behaviors but also rare or borderline maneuvers, ultimately yielding more robust trajectory predictions from minimal sensor data. We validate TRACE on both ground-vehicle simulations and real-world marine autonomous surface vehicles. Experimental results show that our method consistently outperforms standard VLM-driven and purely model-based baselines, capturing a broader range of feasible agent behaviors despite sparse sensing. Evaluation videos and code are available at trace-robotics.github.io.

TRACE: A Self-Improving Framework for Robot Behavior Forecasting with Vision-Language Models

TL;DR

Abstract

TRACE: A Self-Improving Framework for Robot Behavior Forecasting with Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)