Table of Contents
Fetching ...

PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution

Arash Shahmansoori

TL;DR

This work introduces PRECEPT, a unified framework for test-time adaptation with three tightly coupled components: deterministic exact-match rule retrieval over structured condition keys, conflict-aware memory with Bayesian source reliability and threshold-based rule invalidation, and COMPASS, a Pareto-guided prompt-evolution outer loop.

Abstract

LLM agents that store knowledge as natural language suffer steep retrieval degradation as condition count grows, often struggle to compose learned rules reliably, and typically lack explicit mechanisms to detect stale or adversarial knowledge. We introduce PRECEPT, a unified framework for test-time adaptation with three tightly coupled components: (1) deterministic exact-match rule retrieval over structured condition keys, (2) conflict-aware memory with Bayesian source reliability and threshold-based rule invalidation, and (3) COMPASS, a Pareto-guided prompt-evolution outer loop. Exact retrieval eliminates partial-match interpretation errors on the deterministic path (0% by construction, vs 94.4% under Theorem~B.6's independence model at N=10) and supports compositional stacking through a semantic tier hierarchy; conflict-aware memory resolves static--dynamic disagreements and supports drift adaptation; COMPASS evaluates prompts through the same end-to-end execution pipeline. Results (9--10 seeds): PRECEPT achieves a +41.1pp first-try advantage over Full Reflexion (d>1.9), +33.3pp compositional generalization (d=1.55), 100% $P_1$ on 2-way logistics compositions (d=2.64), +40--55pp continuous learning gains, strong eventual robustness under adversarial static knowledge (100% logistics with adversarial SK active; partial recovery on integration), +55.0pp drift recovery (d=0.95, p=0.031), and 61% fewer steps. Core comparisons are statistically significant, often at p<0.001.

PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution

TL;DR

This work introduces PRECEPT, a unified framework for test-time adaptation with three tightly coupled components: deterministic exact-match rule retrieval over structured condition keys, conflict-aware memory with Bayesian source reliability and threshold-based rule invalidation, and COMPASS, a Pareto-guided prompt-evolution outer loop.

Abstract

LLM agents that store knowledge as natural language suffer steep retrieval degradation as condition count grows, often struggle to compose learned rules reliably, and typically lack explicit mechanisms to detect stale or adversarial knowledge. We introduce PRECEPT, a unified framework for test-time adaptation with three tightly coupled components: (1) deterministic exact-match rule retrieval over structured condition keys, (2) conflict-aware memory with Bayesian source reliability and threshold-based rule invalidation, and (3) COMPASS, a Pareto-guided prompt-evolution outer loop. Exact retrieval eliminates partial-match interpretation errors on the deterministic path (0% by construction, vs 94.4% under Theorem~B.6's independence model at N=10) and supports compositional stacking through a semantic tier hierarchy; conflict-aware memory resolves static--dynamic disagreements and supports drift adaptation; COMPASS evaluates prompts through the same end-to-end execution pipeline. Results (9--10 seeds): PRECEPT achieves a +41.1pp first-try advantage over Full Reflexion (d>1.9), +33.3pp compositional generalization (d=1.55), 100% on 2-way logistics compositions (d=2.64), +40--55pp continuous learning gains, strong eventual robustness under adversarial static knowledge (100% logistics with adversarial SK active; partial recovery on integration), +55.0pp drift recovery (d=0.95, p=0.031), and 61% fewer steps. Core comparisons are statistically significant, often at p<0.001.
Paper Structure (81 sections, 15 equations, 14 figures, 31 tables, 2 algorithms)

This paper contains 81 sections, 15 equations, 14 figures, 31 tables, 2 algorithms.

Figures (14)

  • Figure 1: PRECEPT architecture overview. The client handles orchestration, high-frequency monitoring, retrieval-time decision support, and learning updates; the server handles MCP dispatch, conflict-aware retrieval, low-frequency COMPASS evolution, domain execution, and persistent memory. The dashed arrow denotes the evolved prompt flowing from the server-side architect back to the client monitor.
  • Figure 2: Complete seven-phase execution flow of the PRECEPT agent: (1) Task parsing with hybrid rule+LLM fallback, (2) COMPASS complexity evaluation with block/proceed/fast-path decisions, (3) Three-mode context retrieval (compositional, hybrid, semantic), (4) Solution derivation with tier-sorted priority, (5) Domain action execution, (6) Outcome processing with threshold-based invalidation ($\theta{=}2$), and (7) Knowledge update via atomic precept extraction. Dashed arrow indicates the retry loop.
  • Figure 3: Knowledge Layer with three retrieval modes: exact-match $O(1)$ via dictionary lookup (highest priority), semantic similarity $O(\log n)$ via hybrid BM25+embedding search, and compositional retrieval via atomic precept decomposition with tier-sorted stacking. Conflicts are resolved through Bayesian Thompson sampling respecting the Safety $>$ Compliance $>$ Preferences hierarchy.
  • Figure 4: Complete PRECEPT execution pipeline. Tasks flow through the COMPASS Monitor ($O(1)$ constraint check), three-mode retrieval (exact-match, semantic, compositional), Bayesian conflict resolution via Thompson sampling, execution with deterministic pruning via RefineInterceptor, and threshold-based rule invalidation ($\theta{=}2$). On success, rules are persisted; on state-change events, the COMPASS Architect (low-frequency loop, dashed teal) triggers prompt evolution and Pareto selection. On failure, the retry loop (dashed red) returns to the Monitor with the failed option pruned.
  • Figure 5: Evo-Memory lifecycle for Type II (rule drift) handling. On success, the failure counter resets and confidence increases; on failure, confidence decays by half ($c \times 0.5$) and the failure counter increments. When $f \geq \theta$ ($\theta{=}2$ by default), the rule is invalidated via record_rule_failure(), triggering re-learning. This yields stale-rule persistence probability $(1{-}d)^\theta \leq 0.0025$ for PRECEPT (Corollary 6.7).
  • ...and 9 more figures

Theorems & Definitions (8)

  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof