Deployable Vision-driven UAV River Navigation via Human-in-the-loop Preference Alignment
Zihan Wang, Jianwen Li, Li-Fan Wu, Nina Mahmoudian
TL;DR
The paper tackles the challenge of deploying vision-based UAV river-navigation policies under distribution shift and safety constraints by introducing SPAR-H, a statewise hybrid preference alignment method. SPAR-H combines direct statewise preferences on the agent's actions with a reward-based pathway trained from the same preferences and updated via a trust-region RL surrogate, enabling data-efficient online adaptation under HITL feedback. Key contributions include a unified HITL framework that converts interventions into both direct policy updates and reward-target updates, a controlled simulation study with five HITL rollouts, and a real-world UAV deployment demonstrating rapid online adaptation despite perception imperfections. Results show SPAR-H achieves the highest final episodic reward and the most stable performance across initial conditions, with the reward estimator shifting toward human-preferred actions and propagating improvements to nearby non-intervened states. The work suggests dual statewise preferences offer a practical approach for safe, data-efficient online adaptation in safety-critical, partially observable domains such as river navigation.
Abstract
Rivers are critical corridors for environmental monitoring and disaster response, where Unmanned Aerial Vehicles (UAVs) guided by vision-driven policies can provide fast, low-cost coverage. However, deployment exposes simulation-trained policies with distribution shift and safety risks and requires efficient adaptation from limited human interventions. We study human-in-the-loop (HITL) learning with a conservative overseer who vetoes unsafe or inefficient actions and provides statewise preferences by comparing the agent's proposal with a corrective override. We introduce Statewise Hybrid Preference Alignment for Robotics (SPAR-H), which fuses direct preference optimization on policy logits with a reward-based pathway that trains an immediate-reward estimator from the same preferences and updates the policy using a trust-region surrogate. With five HITL rollouts collected from a fixed novice policy, SPAR-H achieves the highest final episodic reward and the lowest variance across initial conditions among tested methods. The learned reward model aligns with human-preferred actions and elevates nearby non-intervened choices, supporting stable propagation of improvements. We benchmark SPAR-H against imitation learning (IL), direct preference variants, and evaluative reinforcement learning (RL) in the HITL setting, and demonstrate real-world feasibility of continual preference alignment for UAV river following. Overall, dual statewise preferences empirically provide a practical route to data-efficient online adaptation in riverine navigation.
