Table of Contents
Fetching ...

Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates

Avanika Narayan, Mayee F. Chen, Kush Bhatia, Christopher Ré

TL;DR

This work introduces Cookbook, a framework that programmatically generates training data consisting of simple patterns over random tokens, resulting in a scalable, cost-effective approach that avoids legal and privacy issues and fine-tuning on Cookbook-generated data.

Abstract

Fine-tuning large language models (LLMs) on instruction datasets is a common way to improve their generative capabilities. However, instruction datasets can be expensive and time-consuming to manually curate, and while LLM-generated data is less labor-intensive, it may violate user privacy agreements or terms of service of LLM providers. Therefore, we seek a way of constructing instruction datasets with samples that are not generated by humans or LLMs but still improve LLM generative capabilities. In this work, we introduce Cookbook, a framework that programmatically generates training data consisting of simple patterns over random tokens, resulting in a scalable, cost-effective approach that avoids legal and privacy issues. First, Cookbook uses a template -- a data generating Python function -- to produce training data that encourages the model to learn an explicit pattern-based rule that corresponds to a desired task. We find that fine-tuning on Cookbook-generated data is able to improve performance on its corresponding task by up to 52.7 accuracy points. Second, since instruction datasets improve performance on multiple downstream tasks simultaneously, Cookbook algorithmically learns how to mix data from various templates to optimize performance on multiple tasks. On the standard multi-task GPT4ALL evaluation suite, Mistral-7B fine-tuned using a Cookbook-generated dataset attains the best accuracy on average compared to other 7B parameter instruction-tuned models and is the best performing model on 3 out of 8 tasks. Finally, we analyze when and why Cookbook improves performance and present a metric that allows us to verify that the improvement is largely explained by the model's generations adhering better to template rules.

Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates

TL;DR

This work introduces Cookbook, a framework that programmatically generates training data consisting of simple patterns over random tokens, resulting in a scalable, cost-effective approach that avoids legal and privacy issues and fine-tuning on Cookbook-generated data.

Abstract

Fine-tuning large language models (LLMs) on instruction datasets is a common way to improve their generative capabilities. However, instruction datasets can be expensive and time-consuming to manually curate, and while LLM-generated data is less labor-intensive, it may violate user privacy agreements or terms of service of LLM providers. Therefore, we seek a way of constructing instruction datasets with samples that are not generated by humans or LLMs but still improve LLM generative capabilities. In this work, we introduce Cookbook, a framework that programmatically generates training data consisting of simple patterns over random tokens, resulting in a scalable, cost-effective approach that avoids legal and privacy issues. First, Cookbook uses a template -- a data generating Python function -- to produce training data that encourages the model to learn an explicit pattern-based rule that corresponds to a desired task. We find that fine-tuning on Cookbook-generated data is able to improve performance on its corresponding task by up to 52.7 accuracy points. Second, since instruction datasets improve performance on multiple downstream tasks simultaneously, Cookbook algorithmically learns how to mix data from various templates to optimize performance on multiple tasks. On the standard multi-task GPT4ALL evaluation suite, Mistral-7B fine-tuned using a Cookbook-generated dataset attains the best accuracy on average compared to other 7B parameter instruction-tuned models and is the best performing model on 3 out of 8 tasks. Finally, we analyze when and why Cookbook improves performance and present a metric that allows us to verify that the improvement is largely explained by the model's generations adhering better to template rules.
Paper Structure (52 sections, 2 theorems, 8 equations, 5 figures, 12 tables, 2 algorithms)

This paper contains 52 sections, 2 theorems, 8 equations, 5 figures, 12 tables, 2 algorithms.

Key Result

Proposition 1

Define $A \in \mathbb{R}^{l \times m}$ where $A_{ij} = \mathrm{acc}(f_{G_{T_i}, n}, T^{\mathrm{eval}}_j)$. Let $\sigma_i = \exp(\frac{1}{m\eta} \sum_{j = 1}^m A_{ij})$ for all $i \in [l]$. Then, the $\bm{p}^\star$ that maximizes eq:obj satisfies $p_i^\star = \frac{\sigma_i}{\sum_{k = 1}^l \sigma_k}$

Figures (5)

  • Figure 1: Cookbook. (1) Templates approximate a given task's "rule" and generate data consisting of patterns over random tokens. (2) Template-generated data can be mixed to improve multiple capabilities. (3) The template alignment statistic measures the extent to which the rule learned by the template is responsible for improving LLM performance.
  • Figure 2: Example templates (pseudocode). Templates construct the inputs, outputs, and then return a formatted sample. (Left) template for commonsense reasoning which generates two answer choices, where one choice (the answer) has a greater token overlap to the sentence. (Right) template for entity matching, which generates two entities which are labeled a match if their overlap exceeds a threshold.
  • Figure 3: Evaluating if the linearity assumption for data proportions holds empirically. Left: we measure the interpolation property, how often the mixture model has an accuracy in between the individual Cookbook-tuned models' accuracies. Right: we measure the mixture model deviation, the absolute difference between the mixture model's accuracy and the average of the individual Cookbook-tuned models' accuracies. Measurements are made across $6$ pairs of templates (over the $4$ templates used in Section \ref{['sec:multitask']}), $8$ GPT4ALL evaluation tasks, and the Mistral-7B base model.
  • Figure 4: Effects of pre-training on random token to NL generalization. Performance gains from Cookbook increase with longer pre-training, indicating that maturity of NL understanding is correlated with random-to-NL generalization.
  • Figure 5: Effects of training on more Cookbook data. Training on more Cookbook data slightly impairs performance in some, but not all cases.

Theorems & Definitions (4)

  • Proposition 1
  • Definition 1
  • Proposition 1
  • proof