CHOP: Counterfactual Human Preference Labels Improve Obstacle Avoidance in Visuomotor Navigation Policies

Gershom Seneviratne; Jianyu An; Vaibhav Shende; Sahire Ellahy; Yaxita Amin; Kondapi Manasanjani; Samarth Chopra; Jonathan Deepak Kannan; Dinesh Manocha

CHOP: Counterfactual Human Preference Labels Improve Obstacle Avoidance in Visuomotor Navigation Policies

Gershom Seneviratne, Jianyu An, Vaibhav Shende, Sahire Ellahy, Yaxita Amin, Kondapi Manasanjani, Samarth Chopra, Jonathan Deepak Kannan, Dinesh Manocha

TL;DR

The introduction of CHOP, a novel approach that leverages Counterfactual Human Preference Labels to align visuomotor navigation policies towards human intuition of safety and obstacle avoidance in navigation, highlights the value of counterfactual preference supervision in bridging the gap between large-scale visuomotor policies and human-aligned, safety-aware embodied navigation.

Abstract

Visuomotor navigation policies have shown strong perception-action coupling for embodied agents, yet they often struggle with safe navigation and dynamic obstacle avoidance in complex real-world environments. We introduce CHOP, a novel approach that leverages Counterfactual Human Preference Labels to align visuomotor navigation policies towards human intuition of safety and obstacle avoidance in navigation. In CHOP, for each visual observation, the robot's executed trajectory is included among a set of counterfactual navigation trajectories: alternative trajectories the robot could have followed under identical conditions. Human annotators provide pairwise preference labels over these trajectories based on anticipated outcomes such as collision risk and path efficiency. These aggregated preferences are then used to fine-tune visuomotor navigation policies, aligning their behavior with human preferences in navigation. Experiments on the SCAND dataset show that visuomotor navigation policies fine-tuned with CHOP reduce near-collision events by 49.7%, decrease deviation from human-preferred trajectories by 45.0%, and increase average obstacle clearance by 19.8% on average across multiple state-of-the-art models, compared to their pretrained baselines. These improvements transfer to real-world deployments on a Ghost Robotics Vision60 quadruped, where CHOP-aligned policies improve average goal success rates by 24.4%, increase minimum obstacle clearance by 6.8%, reduce collision and intervention events by 45.7%, and improve normalized path completion by 38.6% on average across navigation scenarios, compared to their pretrained baselines. Our results highlight the value of counterfactual preference supervision in bridging the gap between large-scale visuomotor policies and human-aligned, safety-aware embodied navigation.

CHOP: Counterfactual Human Preference Labels Improve Obstacle Avoidance in Visuomotor Navigation Policies

TL;DR

Abstract

Paper Structure (36 sections, 9 equations, 7 figures, 6 tables)

This paper contains 36 sections, 9 equations, 7 figures, 6 tables.

Introduction
Related Work
Vision-Based Navigation
Human Preference Alignment of Large Models
Background
Visuomotor Navigation Policies
Navigation Vision-Language-Action (VLA) models.
Counterfactual Data for Models
Preference Alignment
Supervised Fine-Tuning (SFT)
Reward-Model–Based Alignment
Methodology
Counterfactual Preference Dataset Generation
Counterfactual Trajectory Generation
Human-guided target annotation
...and 21 more sections

Figures (7)

Figure 1: CHOP: Counterfactual human preferences for obstacle avoidance and planning enable alignment of visuomotor navigation policies with human preference in ambiguous environments. (A) Given a single visual observation and goal, multiple distinct yet feasible navigation trajectories may exist. (B) Human annotators express preferences ($\checkmark$$>$ ? $>$$\times$ ) over these counterfactual alternatives, capturing safety and contextual cues not explicitly observable, sometimes picking trajectories which are better than the operator action sequence. (C) CHOP aligns the policy to select the human-preferred trajectory, whereas the baseline policy selects a suboptimal path. Our method shows improvements in obstacle clearance, success rate, and average collision rate in real-world scenarios.
Figure 2: Overview of CHOP. (1) Counterfactual dataset generation: Given a navigation dataset containing egocentric images and the trajectories executed by a robot, we generate multiple counterfactual trajectory alternatives for each observation. Human annotators provide binary preference labels over pairs of counterfactual trajectories, resulting in multiple comparisons per image (horizontal axis), while the process is repeated across the entire dataset (vertical axis). The aggregated preference annotations form the CHOP counterfactual preference dataset. (2) Policy training: For each observation, the dataset with the most preferred trajectory based on multiple preference labels for the same observation (*), is extracted from the binary rankings and used to fine-tune pretrained visuomotor navigation policies (e.g., OmniVLA hirose2025omnivla, ViNT shah2023vint, GNM shah2022gnm) using Supervised Fine-Tuning (SFT) or Low Rank Adaptation (LoRA), producing a preference-aligned visuomotor policy. (3) Deployment: The resulting preference-aligned policies are deployed on real robot platforms, where predicted paths are executed by a downstream controller. Evaluations on both offline datasets and real-world robots demonstrate the benefits of counterfactual preference supervision for safer and more human aligned navigation.
Figure 3: Counterfactual trajectory generation and human preference annotation under identical egocentric observations. A: Trajectory recorded in the dataset. B--C: Structured counterfactual trajectories generated by rotating the executed trajectory counterclockwise and clockwise respectively. D: Human-guided target trajectory. E: Preference annotation interface, where annotators compare two overlaid trajectories and select the safer or more appropriate one. Trajectories are rendered in the robot’s egocentric frame; colors indicate different candidates and do not imply preference.
Figure 4: Qualitative comparison of baseline and CHOP-finetuned visuomotor policies across three navigation scenarios, with rows corresponding to methods and columns corresponding to scenarios. Green trajectories denote CHOP-finetuned policies, while red dashed trajectories denote baseline behavior. GNM shah2022gnm and OmniVLA hirose2025omnivla exhibit improved obstacle avoidance and increased clearance after fine-tuning, whereas ViNT shah2023vint shows more limited qualitative change.
Figure 5: Offline qualitative comparison on the SCAND test set showing successive frames (left to right) from three navigation scenarios. Each scenario is visualized across three rows corresponding to OmniVLA, GNM, and ViNT. Blue trajectories denote baseline model predictions, while red trajectories denote CHOP-aligned predictions. When a trajectory disappears from view, the policy has slowed or stopped due to conservative planning. Across most scenarios, timestamps, and methods, CHOP-aligned policies exhibit earlier obstacle avoidance, increased clearance, and more cautious behavior in the presence of dynamic agents compared to their baseline counterparts.
...and 2 more figures

CHOP: Counterfactual Human Preference Labels Improve Obstacle Avoidance in Visuomotor Navigation Policies

TL;DR

Abstract

CHOP: Counterfactual Human Preference Labels Improve Obstacle Avoidance in Visuomotor Navigation Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (7)