Training AI to be Loyal
Sewoong Oh, Himanshu Tyagi, Pramod Viswanath
TL;DR
Problem: Open-source AI models lack reliable governance, ownership, and monetization; the paper proposes Open, Monetizable and Loyal models (OML) to reconcile openness with community governance. Approach: it introduces the OML 1.0 protocol with embedded fingerprints and per-query permissions, and advances fingerprinting techniques (e.g., perinucleus sampling) along with alignment defenses such as bilevel optimization and adversarial training. Contributions: a concrete pathway to open, monetizable, and loyal models; robust mechanisms for ownership, alignment, and control; and a governance framework based on smart contracts and stake-based incentives. Significance: enables permissionless yet community-governed AI ecosystems that can scale and sustain communal value while preventing backdoor alignment drift.
Abstract
Loyal AI is loyal to the community that builds it. An AI is loyal to a community if the community has ownership, alignment, and control. Community owned models can only be used with the approval of the community and share the economic rewards communally. Community aligned models have values that are aligned with the consensus of the community. Community controlled models perform functions designed by the community. Since we would like permissionless access to the loyal AI's community, we need the AI to be open source. The key scientific question then is: how can we build models that are openly accessible (open source) and yet are owned and governed by the community. This seeming impossibility is the focus of this paper where we outline a concrete pathway to Open, Monetizable and Loyal models (OML), building on our earlier work on OML, arXiv:2411.03887(1) , and a representation via a cryptographic-ML library http://github.com/sentient-agi/oml-1.0-fingerprinting .
