Table of Contents
Fetching ...

Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation

Nathan Heath

Abstract

Myopic Optimization with Non-myopic Approval (MONA) mitigates multi-step reward hacking by restricting the agent's planning horizon while supplying far-sighted approval as a training signal~\cite{farquhar2025mona}. The original paper identifies a critical open question: how the method of constructing approval -- particularly the degree to which approval depends on achieved outcomes -- affects whether MONA's safety guarantees hold. We present a reproduction-first extension of the public MONA Camera Dropbox environment that (i)~repackages the released codebase as a standard Python project with scripted PPO training, (ii)~confirms the published contrast between ordinary RL (91.5\% reward-hacking rate) and oracle MONA (0.0\% hacking rate) using the released reference arrays, and (iii)~introduces a modular learned-approval suite spanning oracle, noisy, misspecified, learned, and calibrated approval mechanisms. In reduced-budget pilot sweeps across approval methods, horizons, dataset sizes, and calibration strategies, the best calibrated learned-overseer run achieves zero observed reward hacking but substantially lower intended-behavior rates than oracle MONA (11.9\% vs.\ 99.9\%), consistent with under-optimization rather than re-emergent hacking. These results operationalize the MONA paper's approval-spectrum conjecture as a runnable experimental object and suggest that the central engineering challenge shifts from proving MONA's concept to building learned approval models that preserve sufficient foresight without reopening reward-hacking channels. Code, configurations, and reproduction commands are publicly available. https://github.com/codernate92/mona-camera-dropbox-repro

Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation

Abstract

Myopic Optimization with Non-myopic Approval (MONA) mitigates multi-step reward hacking by restricting the agent's planning horizon while supplying far-sighted approval as a training signal~\cite{farquhar2025mona}. The original paper identifies a critical open question: how the method of constructing approval -- particularly the degree to which approval depends on achieved outcomes -- affects whether MONA's safety guarantees hold. We present a reproduction-first extension of the public MONA Camera Dropbox environment that (i)~repackages the released codebase as a standard Python project with scripted PPO training, (ii)~confirms the published contrast between ordinary RL (91.5\% reward-hacking rate) and oracle MONA (0.0\% hacking rate) using the released reference arrays, and (iii)~introduces a modular learned-approval suite spanning oracle, noisy, misspecified, learned, and calibrated approval mechanisms. In reduced-budget pilot sweeps across approval methods, horizons, dataset sizes, and calibration strategies, the best calibrated learned-overseer run achieves zero observed reward hacking but substantially lower intended-behavior rates than oracle MONA (11.9\% vs.\ 99.9\%), consistent with under-optimization rather than re-emergent hacking. These results operationalize the MONA paper's approval-spectrum conjecture as a runnable experimental object and suggest that the central engineering challenge shifts from proving MONA's concept to building learned approval models that preserve sufficient foresight without reopening reward-hacking channels. Code, configurations, and reproduction commands are publicly available. https://github.com/codernate92/mona-camera-dropbox-repro

Paper Structure

This paper contains 53 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Taxonomy of AI alignment failure modes, highlighting where MONA and this extension sit. Blue-highlighted nodes indicate the specific failure modes addressed in this work: multi-step reward hacking (the core problem MONA mitigates) and learned-overseer fragility (the central question our extension investigates). The dotted link shows that Camera Dropbox bridges specification problems (reward hacking) and optimization problems (sensor manipulation / reward tampering). This taxonomy is simplified; real systems may exhibit multiple failure modes simultaneously, and categories interact in ways not fully captured by a tree structure.
  • Figure 2: Standard RL vs. MONA in Camera Dropbox. Under standard RL (left), the agent optimizes environment reward over the full horizon and learns sensor tampering (91.5% reward-hacking rate). Under MONA (right), myopic optimization restricts the agent's planning horizon while non-myopic approval provides a training signal aligned with intended behavior (0.0% hacking rate). Numbers are from the released public reference arrays farquhar2025monaheath2026repo.
  • Figure 3: Conceptual approval-reward construction spectrum from Appendix B.3 of farquhar2025mona. Safer constructions (left) are more restrictive and less outcome-dependent; moving right toward achieved-outcome dependence degrades safety until MONA recreates ordinary RL (point 6). The bracket indicates the region addressed by this extension: oracle MONA, noisy and misspecified oracles, and learned outcome classifiers.
  • Figure 4: Reproduction-first workflow. The project starts from the public MONA Camera Dropbox release, repackages it as a Python project, adds modular learned-approval mechanisms, and evaluates safety--capability tradeoffs across the approval-construction space.
  • Figure 5: Behavior-rate comparison across the three main conditions. The public reference preserves the original MONA contrast: ordinary PPO heavily reward-hacks while oracle MONA almost entirely performs the intended behavior. The best local calibrated learned-overseer run shows zero observed hacking but far lower intended-behavior rates, consistent with under-optimization rather than capability recovery heath2026repo.