Table of Contents
Fetching ...

Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models

Pierre Krack, Tobias Jülg, Wolfram Burgard, Florian Walter

Abstract

Well-designed dense reward functions in robot manipulation not only indicate whether a task is completed but also encode progress along the way. Generally, designing dense rewards is challenging and usually requires access to privileged state information available only in simulation, not in real-world experiments. This makes reward prediction models that infer task state information from camera images attractive. A common approach is to predict rewards from expert demonstrations based on visual similarity or sequential frame ordering. However, this biases the resulting reward function towards a specific solution and leaves it undefined in states not covered by the demonstrations. In this work, we introduce Rewarding DINO, a method for language-conditioned reward modeling that learns actual reward functions rather than specific trajectories. The model's compact size allows it to serve as a direct replacement for analytical reward functions with comparatively low computational overhead. We train our model on data sampled from 24 Meta-World+ tasks using a rank-based loss and evaluate pairwise accuracy, rank correlation, and calibration. Rewarding DINO achieves competitive performance in tasks from the training set and generalizes to new settings in simulation and the real world, indicating that it learns task semantics. We also test the model with off-the-shelf reinforcement learning algorithms to solve tasks from our Meta-World+ training set.

Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models

Abstract

Well-designed dense reward functions in robot manipulation not only indicate whether a task is completed but also encode progress along the way. Generally, designing dense rewards is challenging and usually requires access to privileged state information available only in simulation, not in real-world experiments. This makes reward prediction models that infer task state information from camera images attractive. A common approach is to predict rewards from expert demonstrations based on visual similarity or sequential frame ordering. However, this biases the resulting reward function towards a specific solution and leaves it undefined in states not covered by the demonstrations. In this work, we introduce Rewarding DINO, a method for language-conditioned reward modeling that learns actual reward functions rather than specific trajectories. The model's compact size allows it to serve as a direct replacement for analytical reward functions with comparatively low computational overhead. We train our model on data sampled from 24 Meta-World+ tasks using a rank-based loss and evaluate pairwise accuracy, rank correlation, and calibration. Rewarding DINO achieves competitive performance in tasks from the training set and generalizes to new settings in simulation and the real world, indicating that it learns task semantics. We also test the model with off-the-shelf reinforcement learning algorithms to solve tasks from our Meta-World+ training set.
Paper Structure (21 sections, 3 equations, 10 figures)

This paper contains 21 sections, 3 equations, 10 figures.

Figures (10)

  • Figure 1: We collect a dataset with multiple camera views and associated rewards across multiple tasks, formulate reward modeling as a pairwise preference learning problem, and recover dense shaped reward functions.
  • Figure 2: Model architecture overview. The text and image encoders are frozen. We train only the small head, which fuses language and image tokens through . The inputs are two 512$\times$512 images and a textual goal description; the output is a single scalar.
  • Figure 3: Plot of the TCP Cartesian coordinates in the training dataset for the door-open task after binning. The histogram on the right shows the distribution of rewards in the sample.
  • Figure 4: All the scenes and objects encountered in the training set. The objects move around, and the tasks have variations. The training tasks are: assembly, disassemble, button-press-topdown, door-open, door-close, window-close, window-open, drawer-open, faucet-open, faucet-close, handle-press, handle-pull, handle-press-side, handle-pull-side, door-lock, door-unlock, stick-pull, stick-push, plate-slide, plate-slide-side, plate-slide-back, plate-slide-back-side, coffee-pull, coffee-push, and our simulated pick task.
  • Figure 5: Pairwise accuracy across all training tasks on newly collected trajectories. We calculate the pairwise accuracy stratified by reward difference. The y-axis is the likelihood that a pair of samples from the same task but potentially different episodes, using the same prompt and camera configuration, is ranked correctly, given a binned difference in ground-truth rewards on the x-axis.
  • ...and 5 more figures