Table of Contents
Fetching ...

Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair

Aditya Kakade, Vivek Srivastava, Shirish Karande

Abstract

Gödel agent realize recursive self-improvement: an agent inspects its own policy and traces and then modifies that policy in a tested loop. We introduce Polaris, a Gödel agent for compact models that performs policy repair via experience abstraction, turning failures into policy updates through a structured cycle of analysis, strategy formation, abstraction, and minimal code pat ch repair with conservative checks. Unlike response level self correction or parameter tuning, Polaris makes policy level changes with small, auditable patches that persist in the policy and are reused on unseen instances within each benchmark. As part of the loop, the agent engages in meta reasoning: it explains its errors, proposes concrete revisions to its own policy, and then updates the policy. To enable cumulative policy refinement, we introduce experience abstraction, which distills failures into compact, reusable strategies that transfer to unseen instances. On MGSM, DROP, GPQA, and LitBench (covering arithmetic reasoning, compositional inference, graduate-level problem solving, and creative writing evaluation), a 7-billion-parameter model equipped with Polaris achieves consistent gains over the base policy and competitive baselines.

Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair

Abstract

Gödel agent realize recursive self-improvement: an agent inspects its own policy and traces and then modifies that policy in a tested loop. We introduce Polaris, a Gödel agent for compact models that performs policy repair via experience abstraction, turning failures into policy updates through a structured cycle of analysis, strategy formation, abstraction, and minimal code pat ch repair with conservative checks. Unlike response level self correction or parameter tuning, Polaris makes policy level changes with small, auditable patches that persist in the policy and are reused on unseen instances within each benchmark. As part of the loop, the agent engages in meta reasoning: it explains its errors, proposes concrete revisions to its own policy, and then updates the policy. To enable cumulative policy refinement, we introduce experience abstraction, which distills failures into compact, reusable strategies that transfer to unseen instances. On MGSM, DROP, GPQA, and LitBench (covering arithmetic reasoning, compositional inference, graduate-level problem solving, and creative writing evaluation), a 7-billion-parameter model equipped with Polaris achieves consistent gains over the base policy and competitive baselines.
Paper Structure (19 sections, 3 equations, 29 figures, 3 tables)

This paper contains 19 sections, 3 equations, 29 figures, 3 tables.

Figures (29)

  • Figure 1: Architectural overview of POLARIS.(a) Recursive self-improvement cycle: The agent selects actions based on its policy and goals, storing outputs and reasoning traces in Memory. Evaluation collects $N$ failed tasks from the validation set, triggering the Policy Repair module. (b) Policy repair cycle: Through experience abstraction, the agent performs Failure Analysis on the $N$ tasks, distills reusable strategies in Strategy Synthesis, generates minimal code patches, and integrates them into the current policy. A candidate version is execution-checked, and if valid, applied via runtime code mutation.
  • Figure 2: Policy update example on the MGSM dataset. We highlight the updates in the current policy with respect to the previous policy using green color (new statements added) and red color (statements deleted). We observe the addition of the logic to break down the problem and validate each part while deleting the comment for post-processing the response.
  • Figure 3: Successful evolution runs of Polaris with performance improvement compared to the base policy and COT-SC. Policy Repair Iteration 0 shows the performance with the base policy. For policy repair and experience abstraction, we consider a set of three failed instances from the validation set of each dataset ($N$=3).
  • Figure 4: Successful evolution runs of Polaris using Qwen3‑8B model, with performance improvement compared to the base policy and COT-SC. Policy Repair Iteration 0 shows the performance with the base policy. For policy repair and experience abstraction, we consider a set of three failed instances from the validation set of each dataset ($N$=3).
  • Figure 5: Prompt for analyzing failures on task samples through self-reflection.
  • ...and 24 more figures