TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents

Junda Wang; Zonghai Tao; Hansi Zeng; Zhichao Yang; Hamed Zamani; Hong Yu

TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents

Junda Wang, Zonghai Tao, Hansi Zeng, Zhichao Yang, Hamed Zamani, Hong Yu

TL;DR

This work frames clinical question answering as an agent problem with two explicit, retrievable resources: skills, reusable clinical procedures such as guidelines, protocols, and pharmacologic mechanisms; and experience, verified reasoning trajectories from previously solved cases, which are then adapted to align the language model's intermediate reasoning with clinically valid logic at test time.

Abstract

Complex clinical decision making often fails not because a model lacks facts, but because it cannot reliably select and apply the right procedural knowledge and the right prior example at the right reasoning step. We frame clinical question answering as an agent problem with two explicit, retrievable resources: skills, reusable clinical procedures such as guidelines, protocols, and pharmacologic mechanisms; and experience, verified reasoning trajectories from previously solved cases (e.g., chain-of-thought solutions and their step-level decompositions). At test time, the agent retrieves both relevant skills and experiences from curated libraries and performs lightweight test-time adaptation to align the language model's intermediate reasoning with clinically valid logic. Concretely, we build (i) a skills library from guideline-style documents organized as executable decision rules, (ii) an experience library of exemplar clinical reasoning chains indexed by step-level transitions, and (iii) a step-aware retriever that selects the most useful skill and experience items for the current case. We then adapt the model on the retrieved items to reduce instance-step misalignment and to prevent reasoning from drifting toward unsupported shortcuts. Experiments on medical question-answering benchmarks show consistent gains over strong medical RAG baselines and prompting-only reasoning methods. Our results suggest that explicitly separating and retrieving clinical skills and experience, and then aligning the model at test time, is a practical approach to more reliable medical agents.

TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents

TL;DR

Abstract

Paper Structure (45 sections, 11 equations, 3 figures, 10 tables)

This paper contains 45 sections, 11 equations, 3 figures, 10 tables.

Introduction
Related Work
Problem Formulation
Hidden-chain view of reasoning.
Why naive retrieval misses the failing step.
Objective.
Experience & Skills Library Construction
Skills Library: Guideline-Style Decision Rules
Experience Library: Logical Chains & QA Traces
Design Principle 1 (Structured, Step-Indexed Reasoning).
Design Principle 2 (Entity-Grounded Abstraction).
Design Principle 3 (QA-Coupled Trajectories).
TARSE
Stage A: Test-time experience alignment and provisional chain.
Stage B: Step-aware skills retrieval and verification.
...and 30 more sections

Figures (3)

Figure 1: Motivation and overview of Skills & Experience Library + TARSE. (Left) A clinical QA example where the key difficulty is goal-conditional disambiguation: cisplatin’s ototoxicity (free radicals $\rightarrow$ hearing loss) is a side-effect chain, while its antitumor mechanism is the therapeutic chain (covalent DNA binding $\rightarrow$ DNA cross-linking $\rightarrow$ apoptosis). Humans solve this by combining experience (recalling a similar bladder-cancer + platinum-chemo case that links the symptom to cisplatin) with skills (reusable guideline/mechanism logical chains) to apply a “mechanism vs. side-effect” gate and select the correct branch (DNA cross-linking, option D). In contrast, a vanilla LLM tends to follow heuristic associations and conflates toxicity with mechanism, and traditional chunk RAG retrieves true but unstructured snippets that are not step-aligned, leading to the same confusion. (Right) Our method operationalizes this workflow: (1) retrieve similar experience traces and relevant skills; (2) perform lightweight test-time adaptation to align reasoning with the question intent; (3) generate a provisional hypothesis/chain; and (4) verify/repair it using retrieved skills to produce the final answer. The bottom-right plot summarizes the accuracy–latency trade-off, showing that TARSE achieves larger gains than CoT-only and conventional RAG baselines at comparable inference time.
Figure 2: TARSE process. Starting from a real clinical diagnosis question (top), the agent first retrieves a similar verified experience trace from an Experience Library (Step 1), then performs experience-conditioned hypothesis generation to propose candidate mechanisms (Step 2). Next, it retrieves Skills from a Skills Library (e.g., guideline/protocol rules) together with supporting experience items to verify key transitions and enforce “gate” checks (Step 3), producing a refined reasoning trace that cleanly separates the side-effect chain (cisplatin → free radicals → ototoxicity/hearing loss) from the therapeutic mechanism chain (cisplatin → DNA cross-linking → apoptosis). The final answer follows the verified therapeutic chain (Step 4).
Figure 3: Ablation study for TARSE. (Top) Effect of the number of retained small-batch samples used in logical chain supervision. (Bottom) Impact of varying the number of parallel users during batch-wise test-time training.

TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents

TL;DR

Abstract

TARSE: Test-Time Adaptation via Retrieval of Skills and Experience for Reasoning Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (3)