Table of Contents
Fetching ...

Sensi: Learn One Thing at a Time -- Curriculum-Based Test-Time Learning for LLM Game Agents

Mohsen Arjmandi

Abstract

Large language model (LLM) agents deployed in unknown environments must learn task structure at test time, but current approaches require thousands of interactions to form useful hypotheses. We present Sensi, an LLM agent architecture for the ARC-AGI-3 game-playing challenge that introduces structured test-time learning through three mechanisms: (1) a two-player architecture separating perception from action, (2) a curriculum-based learning system managed by an external state machine, and (3) a database-as-control-plane that makes the agents context window programmatically steerable. We further introduce an LLM-as-judge component with dynamically generated evaluation rubrics to determine when the agent has learned enough about one topic to advance to the next. We report results across two iterations: Sensi v1 solves 2 game levels using the two-player architecture alone, while Sensi v2 adds curriculum learning and solves 0 levels - but completes its entire learning curriculum in approximately 32 action attempts, achieving 50-94x greater sample efficiency than comparable systems that require 1600-3000 attempts. We precisely diagnose the failure mode as a self-consistent hallucination cascade originating in the perception layer, demonstrating that the architectural bottleneck has shifted from learning efficiency to perceptual grounding - a more tractable problem.

Sensi: Learn One Thing at a Time -- Curriculum-Based Test-Time Learning for LLM Game Agents

Abstract

Large language model (LLM) agents deployed in unknown environments must learn task structure at test time, but current approaches require thousands of interactions to form useful hypotheses. We present Sensi, an LLM agent architecture for the ARC-AGI-3 game-playing challenge that introduces structured test-time learning through three mechanisms: (1) a two-player architecture separating perception from action, (2) a curriculum-based learning system managed by an external state machine, and (3) a database-as-control-plane that makes the agents context window programmatically steerable. We further introduce an LLM-as-judge component with dynamically generated evaluation rubrics to determine when the agent has learned enough about one topic to advance to the next. We report results across two iterations: Sensi v1 solves 2 game levels using the two-player architecture alone, while Sensi v2 adds curriculum learning and solves 0 levels - but completes its entire learning curriculum in approximately 32 action attempts, achieving 50-94x greater sample efficiency than comparable systems that require 1600-3000 attempts. We precisely diagnose the failure mode as a self-consistent hallucination cascade originating in the perception layer, demonstrating that the architectural bottleneck has shifted from learning efficiency to perceptual grounding - a more tractable problem.
Paper Structure (29 sections, 14 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 14 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Sensi v2 architecture. Each turn involves up to five LLM calls orchestrated through a pipeline. The SQLite database serves as the control plane: all agent state resides in database tables that are queried to construct prompts and updated with each turn's outputs. The curriculum state machine (left) manages learning progression, promoting figured-out items to facts when a learning item is completed. MetricGen (LLM$_2$) runs only once when a new learning item is activated.
  • Figure 2: Learning item state machine. Each curriculum item progresses through three states. The self-loop on learning represents repeated evaluation until the sense score meets the threshold $\tau$. Upon completion, accumulated figured-out items are promoted to facts, creating a knowledge accumulation chain.
  • Figure 3: Sense score progression over turns (illustrative). The agent's sense score for each curriculum item rises toward the threshold $\tau = 8$ as it accumulates figured-out items. Vertical dashed lines mark curriculum transitions where one item is completed and the next begins. The entire curriculum is completed in approximately 32 turns. Scores are representative of observed behavior; exact per-turn values vary across runs.
  • Figure 4: Architectural comparison of Sensi v1 and v2. V1 (left) uses a simple two-player loop with shared hypothesis lists. V2 (right) preserves the Player$_1$/Player$_2$ core (blue) but adds frame differencing, dynamic metric generation, sense scoring, and the database-as-control-plane (red/orange). The two-player core is embedded within a richer learning infrastructure.
  • Figure 5: Self-consistent hallucination cascade. A frame differencing error (step 1) propagates through the pipeline. Player$_1$ builds a wrong hypothesis (step 2), which the sense scorer validates because it is internally consistent (step 3). This triggers premature completion (step 4), and the wrong knowledge contaminates subsequent learning items (step 5). The ground truth (green, bottom-left) never enters the pipeline.