Table of Contents
Fetching ...

Deployable Vision-driven UAV River Navigation via Human-in-the-loop Preference Alignment

Zihan Wang, Jianwen Li, Li-Fan Wu, Nina Mahmoudian

TL;DR

The paper tackles the challenge of deploying vision-based UAV river-navigation policies under distribution shift and safety constraints by introducing SPAR-H, a statewise hybrid preference alignment method. SPAR-H combines direct statewise preferences on the agent's actions with a reward-based pathway trained from the same preferences and updated via a trust-region RL surrogate, enabling data-efficient online adaptation under HITL feedback. Key contributions include a unified HITL framework that converts interventions into both direct policy updates and reward-target updates, a controlled simulation study with five HITL rollouts, and a real-world UAV deployment demonstrating rapid online adaptation despite perception imperfections. Results show SPAR-H achieves the highest final episodic reward and the most stable performance across initial conditions, with the reward estimator shifting toward human-preferred actions and propagating improvements to nearby non-intervened states. The work suggests dual statewise preferences offer a practical approach for safe, data-efficient online adaptation in safety-critical, partially observable domains such as river navigation.

Abstract

Rivers are critical corridors for environmental monitoring and disaster response, where Unmanned Aerial Vehicles (UAVs) guided by vision-driven policies can provide fast, low-cost coverage. However, deployment exposes simulation-trained policies with distribution shift and safety risks and requires efficient adaptation from limited human interventions. We study human-in-the-loop (HITL) learning with a conservative overseer who vetoes unsafe or inefficient actions and provides statewise preferences by comparing the agent's proposal with a corrective override. We introduce Statewise Hybrid Preference Alignment for Robotics (SPAR-H), which fuses direct preference optimization on policy logits with a reward-based pathway that trains an immediate-reward estimator from the same preferences and updates the policy using a trust-region surrogate. With five HITL rollouts collected from a fixed novice policy, SPAR-H achieves the highest final episodic reward and the lowest variance across initial conditions among tested methods. The learned reward model aligns with human-preferred actions and elevates nearby non-intervened choices, supporting stable propagation of improvements. We benchmark SPAR-H against imitation learning (IL), direct preference variants, and evaluative reinforcement learning (RL) in the HITL setting, and demonstrate real-world feasibility of continual preference alignment for UAV river following. Overall, dual statewise preferences empirically provide a practical route to data-efficient online adaptation in riverine navigation.

Deployable Vision-driven UAV River Navigation via Human-in-the-loop Preference Alignment

TL;DR

The paper tackles the challenge of deploying vision-based UAV river-navigation policies under distribution shift and safety constraints by introducing SPAR-H, a statewise hybrid preference alignment method. SPAR-H combines direct statewise preferences on the agent's actions with a reward-based pathway trained from the same preferences and updated via a trust-region RL surrogate, enabling data-efficient online adaptation under HITL feedback. Key contributions include a unified HITL framework that converts interventions into both direct policy updates and reward-target updates, a controlled simulation study with five HITL rollouts, and a real-world UAV deployment demonstrating rapid online adaptation despite perception imperfections. Results show SPAR-H achieves the highest final episodic reward and the most stable performance across initial conditions, with the reward estimator shifting toward human-preferred actions and propagating improvements to nearby non-intervened states. The work suggests dual statewise preferences offer a practical approach for safe, data-efficient online adaptation in safety-critical, partially observable domains such as river navigation.

Abstract

Rivers are critical corridors for environmental monitoring and disaster response, where Unmanned Aerial Vehicles (UAVs) guided by vision-driven policies can provide fast, low-cost coverage. However, deployment exposes simulation-trained policies with distribution shift and safety risks and requires efficient adaptation from limited human interventions. We study human-in-the-loop (HITL) learning with a conservative overseer who vetoes unsafe or inefficient actions and provides statewise preferences by comparing the agent's proposal with a corrective override. We introduce Statewise Hybrid Preference Alignment for Robotics (SPAR-H), which fuses direct preference optimization on policy logits with a reward-based pathway that trains an immediate-reward estimator from the same preferences and updates the policy using a trust-region surrogate. With five HITL rollouts collected from a fixed novice policy, SPAR-H achieves the highest final episodic reward and the lowest variance across initial conditions among tested methods. The learned reward model aligns with human-preferred actions and elevates nearby non-intervened choices, supporting stable propagation of improvements. We benchmark SPAR-H against imitation learning (IL), direct preference variants, and evaluative reinforcement learning (RL) in the HITL setting, and demonstrate real-world feasibility of continual preference alignment for UAV river following. Overall, dual statewise preferences empirically provide a practical route to data-efficient online adaptation in riverine navigation.

Paper Structure

This paper contains 15 sections, 15 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of HITL learning methods. Statewise hybrid preference alignment combines direct (RL-free) preference optimization applied to policy logits and RLHF-style reward model preferences that indirectly drive policy update. Both signals arise from per-state comparisons between the human override and the agent proposal. Imitation learning methods mainly use weighted behavior cloning on the human-intervened trajectory, where human corrective actions are given larger weights.
  • Figure 2: Vision-to-action pipeline for river following. RGB is segmented by SAM2 into a water mask, patchified, passed through a frozen GRU encoder, then split to a policy head (action) and a reward head (immediate reward).
  • Figure 3: Experiment design for HITL learning. Starting from a novice policy, models are sequentially trained on current existing episodes with human corrective data and saved as checkpoints. Cp stands for checkpoint, and Ep means episode.
  • Figure 4: Episodic rewards per checkpoint. SPAR-H yields the overall largest gains by combining direct and reward-based preference alignment.
  • Figure 5: Final checkpoint performance. SPAR-H achieves the highest mean reward and lowest variance across initial conditions.
  • ...and 3 more figures