Table of Contents
Fetching ...

Replicating ReLM Results: Validating Large Language Models with ReLM

Reece Adamson, Erin Song

TL;DR

ReLM reframes LLM evaluation by encoding desired response structures as regular expressions, compiling them into finite automata, and transforming them into LLM-specific token streams for constrained generation. Traversal over the LLM automata uses the additive cost of $log$ probabilities, computed as $log p(x_1,...,x_n) = sum_{i=1}^n log p(x_i|x_1,...,x_{i-1})$, with top-$k$ pruning to prioritize likely sequences. The authors reproduce memorization, zero-shot language understanding, and bias assessments, showing that ReLM can boost throughput, improve accuracy on constrained tasks, and enable principled bias measurement under controlled prompts. The work demonstrates a practical, systems-oriented approach to LLM evaluation, with potential extensions to handle misspellings via Levenshtein automata and broader deployment scenarios.

Abstract

Validating Large Language Models with ReLM explores the application of formal languages to evaluate and control Large Language Models (LLMs) for memorization, bias, and zero-shot performance. Current approaches for evaluating these types behavior are often slow, imprecise, costly, or introduce biases of their own, but are necessary due to the importance of this behavior when productionizing LLMs. This project reproduces key results from the original ReLM paper and expounds on the approach and applications with an emphasis on the relevance to the field of systems for machine learning.

Replicating ReLM Results: Validating Large Language Models with ReLM

TL;DR

ReLM reframes LLM evaluation by encoding desired response structures as regular expressions, compiling them into finite automata, and transforming them into LLM-specific token streams for constrained generation. Traversal over the LLM automata uses the additive cost of probabilities, computed as , with top- pruning to prioritize likely sequences. The authors reproduce memorization, zero-shot language understanding, and bias assessments, showing that ReLM can boost throughput, improve accuracy on constrained tasks, and enable principled bias measurement under controlled prompts. The work demonstrates a practical, systems-oriented approach to LLM evaluation, with potential extensions to handle misspellings via Levenshtein automata and broader deployment scenarios.

Abstract

Validating Large Language Models with ReLM explores the application of formal languages to evaluate and control Large Language Models (LLMs) for memorization, bias, and zero-shot performance. Current approaches for evaluating these types behavior are often slow, imprecise, costly, or introduce biases of their own, but are necessary due to the importance of this behavior when productionizing LLMs. This project reproduces key results from the original ReLM paper and expounds on the approach and applications with an emphasis on the relevance to the field of systems for machine learning.

Paper Structure

This paper contains 8 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: End-to-end ReLM process. Users define queries as regular expressions which are parsed into into a representative automata. A transducer transformers this automata into an LLM specific representation based on the tokenization strategy employed by that LLM. ReLM can then traverse this LLM automata to elicit LLM responses which conform to the user specified regular expression. Kuchnik23
  • Figure 2: LLMs operate on an input sequence of tokens which are used to generate output token conditional probabilities for all tokens in its vocabulary. A token is selected amongst the top-k most probable tokens and appended to the input sequence for use in iteratively generating the output. gpt-viz
  • Figure 3: The LLM specific finite automata representing $The$ when tokens are present for each partition of the word. The numbers on each edge represent the log probability of the subsequent token being produced from each state. The dotted transitions represent unreachable paths when a top-k of 2 is used. The final state indicates a matching sequence of tokens for the regular expression. Kuchnik23
  • Figure 4: Reproduced experiment results illustrating cumulative validated URLs for 1000 samples over the first 5 minutes of execution. A URL is considered valid if it responds with a status code less than 400. Duplicate URLs are only counted once.
  • Figure 5: Reproduced throughput experiment results. Reproduced experiments use $\frac{1}{10}$ of the samples in the original experiment, but are still able to achieve the same general performance trends and order of manitude performance increases. ReLM outperforms all other baselines configurations. Baseline (n=16) performs the best out of the baselines, which is likely due to smaller values of n resulting in shorter and often duplicated URLs while larger values of n produce lengthy token sequences which can result in nonsensical URLs being generated.
  • ...and 3 more figures