[Re] FairDICE: A Gap Between Theory And Practice

Peter Adema; Karim Galliamov; Aleksey Evstratovskiy; Ross Geurts

[Re] FairDICE: A Gap Between Theory And Practice

Peter Adema, Karim Galliamov, Aleksey Evstratovskiy, Ross Geurts

TL;DR

It is shown in experiments extending the original paper that FairDICE can scale to complex environments and high-dimensional rewards, though it can be reliant on (online) hyperparameter tuning, and the experimental justification requires significant revision.

Abstract

Offline Reinforcement Learning (RL) is an emerging field of RL in which policies are learned solely from demonstrations. Within offline RL, some environments involve balancing multiple objectives, but existing multi-objective offline RL algorithms do not provide an efficient way to find a fair compromise. FairDICE (see arXiv:2506.08062v2) seeks to fill this gap by adapting OptiDICE (an offline RL algorithm) to automatically learn weights for multiple objectives to e.g.\ incentivise fairness among objectives. As this would be a valuable contribution, this replication study examines the replicability of claims made regarding FairDICE. We find that many theoretical claims hold, but an error in the code reduces FairDICE to standard behaviour cloning in continuous environments, and many important hyperparameters were originally underspecified. After rectifying this, we show in experiments extending the original paper that FairDICE can scale to complex environments and high-dimensional rewards, though it can be reliant on (online) hyperparameter tuning. We conclude that FairDICE is a theoretically interesting method, but the experimental justification requires significant revision.

[Re] FairDICE: A Gap Between Theory And Practice

TL;DR

Abstract

Paper Structure (45 sections, 5 equations, 18 figures, 5 tables)

This paper contains 45 sections, 5 equations, 18 figures, 5 tables.

Introduction
Background
Offline Reinforcement Learning
Multi-objective RL and FairDICE
Claims and extensions
Method
Environments
Datasets
Discrepancies in public FairDICE code
Incorrect policy loss for continuous environments
Additional gradient penalty on critic
Experimental setup and code
Results
Reproductions in discrete environments
Learning from a uniform-random policy in MO-Four-Rooms
...and 30 more sections

Figures (18)

Figure 1: Metrics on MO-FourRooms over a sweep of $\alpha$ and $\beta$ values.
Figure 2: Random MO-MDP metrics over a sweep of $\alpha$ and $\beta$ values.
Figure 3: D4MORL NSW boxplots for various losses and $\beta$, using 10 seeds with 100 evaluations per seed.
Figure 4: NSW scores and raw return trade-offs with Pareto frontiers on two multi-objective datasets. 'Original' graphs in second row reproduced from Fig. 4 and Fig. 5 of kim2025fairdice with permission.
Figure 5: FairDICE performance with/without reward normalisation and with piecewise-log for $u_i$, trained on Hopper and Walker2d from D4MORL. Boxplots are drawn using 10 seeds with 100 evaluations per seed.
...and 13 more figures

[Re] FairDICE: A Gap Between Theory And Practice

TL;DR

Abstract

[Re] FairDICE: A Gap Between Theory And Practice

Authors

TL;DR

Abstract

Table of Contents

Figures (18)