Table of Contents
Fetching ...

Training AI to be Loyal

Sewoong Oh, Himanshu Tyagi, Pramod Viswanath

TL;DR

Problem: Open-source AI models lack reliable governance, ownership, and monetization; the paper proposes Open, Monetizable and Loyal models (OML) to reconcile openness with community governance. Approach: it introduces the OML 1.0 protocol with embedded fingerprints and per-query permissions, and advances fingerprinting techniques (e.g., perinucleus sampling) along with alignment defenses such as bilevel optimization and adversarial training. Contributions: a concrete pathway to open, monetizable, and loyal models; robust mechanisms for ownership, alignment, and control; and a governance framework based on smart contracts and stake-based incentives. Significance: enables permissionless yet community-governed AI ecosystems that can scale and sustain communal value while preventing backdoor alignment drift.

Abstract

Loyal AI is loyal to the community that builds it. An AI is loyal to a community if the community has ownership, alignment, and control. Community owned models can only be used with the approval of the community and share the economic rewards communally. Community aligned models have values that are aligned with the consensus of the community. Community controlled models perform functions designed by the community. Since we would like permissionless access to the loyal AI's community, we need the AI to be open source. The key scientific question then is: how can we build models that are openly accessible (open source) and yet are owned and governed by the community. This seeming impossibility is the focus of this paper where we outline a concrete pathway to Open, Monetizable and Loyal models (OML), building on our earlier work on OML, arXiv:2411.03887(1) , and a representation via a cryptographic-ML library http://github.com/sentient-agi/oml-1.0-fingerprinting .

Training AI to be Loyal

TL;DR

Problem: Open-source AI models lack reliable governance, ownership, and monetization; the paper proposes Open, Monetizable and Loyal models (OML) to reconcile openness with community governance. Approach: it introduces the OML 1.0 protocol with embedded fingerprints and per-query permissions, and advances fingerprinting techniques (e.g., perinucleus sampling) along with alignment defenses such as bilevel optimization and adversarial training. Contributions: a concrete pathway to open, monetizable, and loyal models; robust mechanisms for ownership, alignment, and control; and a governance framework based on smart contracts and stake-based incentives. Significance: enables permissionless yet community-governed AI ecosystems that can scale and sustain communal value while preventing backdoor alignment drift.

Abstract

Loyal AI is loyal to the community that builds it. An AI is loyal to a community if the community has ownership, alignment, and control. Community owned models can only be used with the approval of the community and share the economic rewards communally. Community aligned models have values that are aligned with the consensus of the community. Community controlled models perform functions designed by the community. Since we would like permissionless access to the loyal AI's community, we need the AI to be open source. The key scientific question then is: how can we build models that are openly accessible (open source) and yet are owned and governed by the community. This seeming impossibility is the focus of this paper where we outline a concrete pathway to Open, Monetizable and Loyal models (OML), building on our earlier work on OML, arXiv:2411.03887(1) , and a representation via a cryptographic-ML library http://github.com/sentient-agi/oml-1.0-fingerprinting .

Paper Structure

This paper contains 13 sections, 4 figures.

Figures (4)

  • Figure 1: A host initiates a download request under the OML 1.0 protocol and receives an OMLized model, $M$.oml, to be used in their services to external users.
  • Figure 2: Each user query, $q$, to the service needs to be accounted for under the Sentient protocol and this is ensured by requiring the host to obtain a signed permission string, $\sigma(q)$, from the Sentient platform. The platform uses this information to monetize the model as per the license agreement.
  • Figure 3: The prover's role is to check if the host is using the OMLized model without signing with the platform as agreed upon, in which case the host will face severe monetary penalty.
  • Figure 4: (Left) Performance measured by OpenLLM benchmark as we add more fingerprints. Random tokens are scalable but out-of-distribution, hence easily detected. English Random is randomly paired english phrases, which is in-distribution but not scalable. Perinucleus sampling is both in-distribution and scalable. (Right) Preinucleus sampling makes the fingerprints significantly more persistent against fine-tuning attacks. Less than 100 fingerprints survive fine-tuning attack when randomly paired English phrases are used as fingerprints (labelled English Random), whereas 4000 fingerprints persist for Perinucleus sampled fingerprints. Persistence is the ratio of fingerprints that survive the fine-tuning attack.