Learning When to Switch: Adaptive Policy Selection via Reinforcement Learning

Chris Tava

Learning When to Switch: Adaptive Policy Selection via Reinforcement Learning

Chris Tava

TL;DR

The paper tackles the challenge of when to switch between exploration and exploitation in maze navigation. It introduces a Q-learning framework that adaptively selects a threshold between spiral coverage and A* convergence, using a compact 50-state representation based on coverage and distance. Across 240 configurations, adaptive switching yields 23-55% faster completion, dramatic variance reductions, and improved worst-case performance, with gains scaling with maze complexity. The approach achieves strong intra-episode adaptation with memory-efficient policies, and shows potential generalization to other domains requiring structured policy switching, while acknowledging limitations and avenues for broader application.

Abstract

Autonomous agents often require multiple strategies to solve complex tasks, but determining when to switch between strategies remains challenging. This research introduces a reinforcement learning technique to learn switching thresholds between two orthogonal navigation policies. Using maze navigation as a case study, this work demonstrates how an agent can dynamically transition between systematic exploration (coverage) and goal-directed pathfinding (convergence) to improve task performance. Unlike fixed-threshold approaches, the agent uses Q-learning to adapt switching behavior based on coverage percentage and distance to goal, requiring only minimal domain knowledge: maze dimensions and target location. The agent does not require prior knowledge of wall positions, optimal threshold values, or hand-crafted heuristics; instead, it discovers effective switching strategies dynamically during each run. The agent discretizes its state space into coverage and distance buckets, then adapts which coverage threshold (20-60\%) to apply based on observed progress signals. Experiments across 240 test configurations (4 maze sizes from 16$\times$16 to 128$\times$128 $\times$ 10 unique mazes $\times$ 6 agent variants) demonstrate that adaptive threshold learning outperforms both single-strategy agents and fixed 40\% threshold baselines. Results show 23-55\% improvements in completion time, 83\% reduction in runtime variance, and 71\% improvement in worst-case scenarios. The learned switching behavior generalizes within each size class to unseen wall configurations. Performance gains scale with problem complexity: 23\% improvement for 16$\times$16 mazes, 34\% for 32$\times$32, and 55\% for 64$\times$64, demonstrating that as the space of possible maze structures grows, the value of adaptive policy selection over fixed heuristics increases proportionally.

Learning When to Switch: Adaptive Policy Selection via Reinforcement Learning

TL;DR

Abstract

Learning When to Switch: Adaptive Policy Selection via Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents