Table of Contents
Fetching ...

Grounded Test-Time Adaptation for LLM Agents

Arthur Chen, Zuxin Liu, Jianguo Zhang, Akshara Prabhakar, Zhiwei Liu, Shelby Heinecke, Silvio Savarese, Victor Zhong, Caiming Xiong

TL;DR

The paper investigates why LLM-based agents struggle to generalize to unseen environments due to syntactic and semantic mismatches. It introduces two deployment-time strategies: a parametric online adapter that bias-adjusts the output to match environment formats, and a non-parametric dynamics grounding approach that builds an in-context understanding of environment transitions through persona-driven exploration. Across web navigation and function calling benchmarks, both strategies yield consistent performance gains, with dynamics grounding particularly boosting performance in complex, unfamiliar settings. The study also analyzes trade-offs, ablations, and limitations, advocating for a principled integration mechanism to adapt strategies to environmental complexity.

Abstract

Large language model (LLM)-based agents struggle to generalize to novel and complex environments, such as unseen websites or new sets of functions, due to a fundamental mismatch between their pre-training and test-time conditions. This challenge stems from two distinct failure modes: a syntactic misunderstanding of environment-specific components like observation formats, and a semantic misunderstanding of state-transition dynamics, which are only revealed at test time. To address these issues, we propose two distinct and complementary strategies for adapting LLM agents by leveraging environment-specific information available during deployment. First, an online distributional adaptation method parameterizes environmental nuances by learning a lightweight adaptation vector that biases the model's output distribution, enabling rapid alignment with an environment response format. Second, a deployment-time dynamics grounding method employs a persona-driven exploration phase to systematically probe and learn the environment's causal dynamics before task execution, equipping the agent with a nonparametric world model. We evaluate these strategies across diverse agentic benchmarks, including function calling and web navigation. Our empirical results show the effectiveness of both strategies across all benchmarks with minimal computational cost. We find that dynamics grounding is particularly effective in complex environments where unpredictable dynamics pose a major obstacle, demonstrating a robust path toward more generalizable and capable LLM-based agents. For example, on the WebArena multi-site split, this method increases the agent's success rate from 2% to 23%.

Grounded Test-Time Adaptation for LLM Agents

TL;DR

The paper investigates why LLM-based agents struggle to generalize to unseen environments due to syntactic and semantic mismatches. It introduces two deployment-time strategies: a parametric online adapter that bias-adjusts the output to match environment formats, and a non-parametric dynamics grounding approach that builds an in-context understanding of environment transitions through persona-driven exploration. Across web navigation and function calling benchmarks, both strategies yield consistent performance gains, with dynamics grounding particularly boosting performance in complex, unfamiliar settings. The study also analyzes trade-offs, ablations, and limitations, advocating for a principled integration mechanism to adapt strategies to environmental complexity.

Abstract

Large language model (LLM)-based agents struggle to generalize to novel and complex environments, such as unseen websites or new sets of functions, due to a fundamental mismatch between their pre-training and test-time conditions. This challenge stems from two distinct failure modes: a syntactic misunderstanding of environment-specific components like observation formats, and a semantic misunderstanding of state-transition dynamics, which are only revealed at test time. To address these issues, we propose two distinct and complementary strategies for adapting LLM agents by leveraging environment-specific information available during deployment. First, an online distributional adaptation method parameterizes environmental nuances by learning a lightweight adaptation vector that biases the model's output distribution, enabling rapid alignment with an environment response format. Second, a deployment-time dynamics grounding method employs a persona-driven exploration phase to systematically probe and learn the environment's causal dynamics before task execution, equipping the agent with a nonparametric world model. We evaluate these strategies across diverse agentic benchmarks, including function calling and web navigation. Our empirical results show the effectiveness of both strategies across all benchmarks with minimal computational cost. We find that dynamics grounding is particularly effective in complex environments where unpredictable dynamics pose a major obstacle, demonstrating a robust path toward more generalizable and capable LLM-based agents. For example, on the WebArena multi-site split, this method increases the agent's success rate from 2% to 23%.

Paper Structure

This paper contains 33 sections, 4 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Overview of parametric test-time adaptation. This figure includes an example of web navigation shopping task to illustrate how the agent adapts to new environment. (1) At the start of each episode, we initialize an adaptation vector $\delta$ as a zero vector and construct inputs to the LLM agent. (2) During task execution, the agent receives environment instructions and observations. (3) At each step, we update the adaptation vector using cross-entropy loss on the current input, and apply the adaptation vector as a bias to the LLM's final hidden layer. This enables rapid alignment to environment-specific observation and action formats. (4) The LLM agent takes a new action with the updated vector, which shifts the model's output distribution to better match the test-time environment.
  • Figure 2: Overview of non-parametric test-time adaptation. This figure includes an example of web navigation shopping task to illustrate how the pipeline generates environment dynamics in language. (1) We synthesize diverse exploration tasks based on personas using environment descriptions. (2) An exploration agent interacts with the environment to collect interaction logs of state transitions and updates the environment rules/dynamics dynamically. (3) An LLM extracts and summarizes environment dynamics from these logs. (4) A reasoning model filters less informative rules, which are then used to augment the agent's context during evaluation, enabling more transition-aware decision making.