Table of Contents
Fetching ...

Generative Artificial Intelligence Reproducibility and Consensus

Edward Kim, Isamu Isozaki, Naomi Sirkin, Michael Robson

TL;DR

The paper tackles the challenge of reproducibility in generative AI within a decentralized, trustless setting. It introduces a locality-sensitive hashing pipeline to verify outputs from image and language models, and analyzes how tolerance and consensus can be achieved with finite verifier sets. Through large-scale experiments across diffusion models and LLMs, it shows high image consensus (with minimal intra-class collisions) and near-perfect LLM consensus under deterministic decoding, while outlining probabilistic bounds for fraud detection in a distributed network. It also investigates and mitigates stochasticity in training (textual inversion) via controlled experiments and gossip-based synchronization, laying a practical foundation for verifiable GenAI in decentralized systems.

Abstract

We performed a billion locality sensitive hash comparisons between artificially generated data samples to answer the critical question - can we reproduce the results of generative AI models? Reproducibility is one of the pillars of scientific research for verifiability, benchmarking, trust, and transparency. Futhermore, we take this research to the next level by verifying the "correctness" of generative AI output in a non-deterministic, trustless, decentralized network. We generate millions of data samples from a variety of open source diffusion and large language models and describe the procedures and trade-offs between generating more verses less deterministic output. Additionally, we analyze the outputs to provide empirical evidence of different parameterizations of tolerance and error bounds for verification. For our results, we show that with a majority vote between three independent verifiers, we can detect image generated perceptual collisions in generated AI with over 99.89% probability and less than 0.0267% chance of intra-class collision. For large language models (LLMs), we are able to gain 100% consensus using greedy methods or n-way beam searches to generate consensus demonstrated on different LLMs. In the context of generative AI training, we pinpoint and minimize the major sources of stochasticity and present gossip and synchronization training techniques for verifiability. Thus, this work provides a practical, solid foundation for AI verification, reproducibility, and consensus for generative AI applications.

Generative Artificial Intelligence Reproducibility and Consensus

TL;DR

The paper tackles the challenge of reproducibility in generative AI within a decentralized, trustless setting. It introduces a locality-sensitive hashing pipeline to verify outputs from image and language models, and analyzes how tolerance and consensus can be achieved with finite verifier sets. Through large-scale experiments across diffusion models and LLMs, it shows high image consensus (with minimal intra-class collisions) and near-perfect LLM consensus under deterministic decoding, while outlining probabilistic bounds for fraud detection in a distributed network. It also investigates and mitigates stochasticity in training (textual inversion) via controlled experiments and gossip-based synchronization, laying a practical foundation for verifiable GenAI in decentralized systems.

Abstract

We performed a billion locality sensitive hash comparisons between artificially generated data samples to answer the critical question - can we reproduce the results of generative AI models? Reproducibility is one of the pillars of scientific research for verifiability, benchmarking, trust, and transparency. Futhermore, we take this research to the next level by verifying the "correctness" of generative AI output in a non-deterministic, trustless, decentralized network. We generate millions of data samples from a variety of open source diffusion and large language models and describe the procedures and trade-offs between generating more verses less deterministic output. Additionally, we analyze the outputs to provide empirical evidence of different parameterizations of tolerance and error bounds for verification. For our results, we show that with a majority vote between three independent verifiers, we can detect image generated perceptual collisions in generated AI with over 99.89% probability and less than 0.0267% chance of intra-class collision. For large language models (LLMs), we are able to gain 100% consensus using greedy methods or n-way beam searches to generate consensus demonstrated on different LLMs. In the context of generative AI training, we pinpoint and minimize the major sources of stochasticity and present gossip and synchronization training techniques for verifiability. Thus, this work provides a practical, solid foundation for AI verification, reproducibility, and consensus for generative AI applications.
Paper Structure (16 sections, 5 figures, 2 tables)

This paper contains 16 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Identical runs of the same seed and prompt to generative AI does not yield the same bit level result, even on the same machine. Crytographic hashes that are designed for exact matches exhibit the avalanche effect, while perceptual hashes (aHash, pHash, dHash, cHash) are more stable and exhibit locality sensitive hashing.
  • Figure 2: Generated images from SD v1.5 on seven different GPUs (mix of 3060ti, 3070ti, 3080ti, and 3090) using the prompt, "A photo of {class}", where class is from the ImageNet dataset. The first three columns, class id (469) [caldron, cauldron], (361) [skunk, polecat], and (695) [padlock] have identical perceptual hashes. The last three rows, (168) [redbone], (555) [fire engine, fire truck], (686) [oil filter] have perceptual hashes with the most extreme hamming distances we observed in our generated data (up to 5). Visual differences can be seen in the writing, the color of the truck, and the circle in the oil filter.
  • Figure 3: Graphs of probabilities that independent verifiers can spot incorrect or fraudlent behavior given different likelihoods of deterministic generation. In the simple majority case, we assume a majority of nodes are honest, and in the super majority case, we have a stricter assumption that over 2/3s are honest.
  • Figure 4: Six major sources of stocasticity in the fine-tuning process of textual inversion. The randomness is minimized at each of these six points in order to gain verifiability in the training process. Even with control over these parameters via seeds or parameter settings, the process still remains non-deterministic.
  • Figure 5: Textual inversion training results from randomly selected objects in the sd concepts library (https://huggingface.co/sd-concepts-library). The graphs show six runs of 2000 iterations fine-tuning a new token based upon a small set of images (3-10). Three runs minimize the amount of stocasticity within the training process, "DeterR1,2,3". "StocHF0.5" includes a random horizontal flip of the image, "Shuffle" allows the dataset to be shuffled, and "Noseed" does not provide a random seed for the noise. The deterministic runs mostly overlap with some drift over the plot of the 768-dimensional vector projected on the 2 principal components of the training data.