Table of Contents
Fetching ...

Learning a High-quality Robotic Wiping Policy Using Systematic Reward Analysis and Visual-Language Model Based Curriculum

Yihong Liu, Dongyeop Kang, Sehoon Ha

TL;DR

This work tackles the challenge of learning high-quality robotic wiping policies with reinforcement learning by formalizing wiping as a quality-critical MDP and showing the infeasibility of naive reward designs. It introduces a bounded reward formulation with concentric checkpoint regions and a Visual-Language Model (VLM) curriculum that dynamically adjusts reward weights during training. In MuJoCo-based experiments, the combined approach achieves near-perfect navigation success (up to 98%), improved force tracking around a target of $60\,\mathrm{N}$, and reduced Integral Absolute Error, demonstrating robust generalization across surfaces with varying curvature and friction. The methods promise to reduce manual reward engineering and enable more reliable policy learning for real-world wiping tasks, with future work on hardware validation and autonomous waypoint generation.

Abstract

Autonomous robotic wiping is an important task in various industries, ranging from industrial manufacturing to sanitization in healthcare. Deep reinforcement learning (Deep RL) has emerged as a promising algorithm, however, it often suffers from a high demand for repetitive reward engineering. Instead of relying on manual tuning, we first analyze the convergence of quality-critical robotic wiping, which requires both high-quality wiping and fast task completion, to show the poor convergence of the problem and propose a new bounded reward formulation to make the problem feasible. Then, we further improve the learning process by proposing a novel visual-language model (VLM) based curriculum, which actively monitors the progress and suggests hyperparameter tuning. We demonstrate that the combined method can find a desirable wiping policy on surfaces with various curvatures, frictions, and waypoints, which cannot be learned with the baseline formulation. The demo of this project can be found at: https://sites.google.com/view/highqualitywiping.

Learning a High-quality Robotic Wiping Policy Using Systematic Reward Analysis and Visual-Language Model Based Curriculum

TL;DR

This work tackles the challenge of learning high-quality robotic wiping policies with reinforcement learning by formalizing wiping as a quality-critical MDP and showing the infeasibility of naive reward designs. It introduces a bounded reward formulation with concentric checkpoint regions and a Visual-Language Model (VLM) curriculum that dynamically adjusts reward weights during training. In MuJoCo-based experiments, the combined approach achieves near-perfect navigation success (up to 98%), improved force tracking around a target of , and reduced Integral Absolute Error, demonstrating robust generalization across surfaces with varying curvature and friction. The methods promise to reduce manual reward engineering and enable more reliable policy learning for real-world wiping tasks, with future work on hardware validation and autonomous waypoint generation.

Abstract

Autonomous robotic wiping is an important task in various industries, ranging from industrial manufacturing to sanitization in healthcare. Deep reinforcement learning (Deep RL) has emerged as a promising algorithm, however, it often suffers from a high demand for repetitive reward engineering. Instead of relying on manual tuning, we first analyze the convergence of quality-critical robotic wiping, which requires both high-quality wiping and fast task completion, to show the poor convergence of the problem and propose a new bounded reward formulation to make the problem feasible. Then, we further improve the learning process by proposing a novel visual-language model (VLM) based curriculum, which actively monitors the progress and suggests hyperparameter tuning. We demonstrate that the combined method can find a desirable wiping policy on surfaces with various curvatures, frictions, and waypoints, which cannot be learned with the baseline formulation. The demo of this project can be found at: https://sites.google.com/view/highqualitywiping.

Paper Structure

This paper contains 19 sections, 7 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: The example trajectories of the learned wiping policy on surfaces with different curvatures and frictions.
  • Figure 2: Illustration of Checkpoint Regions.
  • Figure 3: Diagram of the Proposed VLM Algorithms, simulating human decision process on reward scale engineering.
  • Figure 4: Evaluation metrics on 2-points environments (line plots with standard error shadows). Force evaluations exclude episodes where the agent wiped repeatedly for the entire horizon without completion -- primarily in the unbounded reward environment -- to mitigate biased distributions. Each method is assessed over 50 episodes with 5 random seeds.
  • Figure 5: Examples of VLM-based curriculum adjustment based on the training progresses. Each performance segment includes navigation success rate, average landing pressure (up) and navigational pressure (down). The target pressure is 60N.
  • ...and 1 more figures