Replicating ReLM Results: Validating Large Language Models with ReLM
Reece Adamson, Erin Song
TL;DR
ReLM reframes LLM evaluation by encoding desired response structures as regular expressions, compiling them into finite automata, and transforming them into LLM-specific token streams for constrained generation. Traversal over the LLM automata uses the additive cost of $log$ probabilities, computed as $log p(x_1,...,x_n) = sum_{i=1}^n log p(x_i|x_1,...,x_{i-1})$, with top-$k$ pruning to prioritize likely sequences. The authors reproduce memorization, zero-shot language understanding, and bias assessments, showing that ReLM can boost throughput, improve accuracy on constrained tasks, and enable principled bias measurement under controlled prompts. The work demonstrates a practical, systems-oriented approach to LLM evaluation, with potential extensions to handle misspellings via Levenshtein automata and broader deployment scenarios.
Abstract
Validating Large Language Models with ReLM explores the application of formal languages to evaluate and control Large Language Models (LLMs) for memorization, bias, and zero-shot performance. Current approaches for evaluating these types behavior are often slow, imprecise, costly, or introduce biases of their own, but are necessary due to the importance of this behavior when productionizing LLMs. This project reproduces key results from the original ReLM paper and expounds on the approach and applications with an emphasis on the relevance to the field of systems for machine learning.
