Table of Contents
Fetching ...

Adam's Law: Textual Frequency Law on Large Language Models

Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong, Qiqi Xiang, Bowen Cao, Wai Lam

Abstract

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.

Adam's Law: Textual Frequency Law on Large Language Models

Abstract

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.

Paper Structure

This paper contains 48 sections, 4 theorems, 37 equations, 6 figures, 21 tables.

Key Result

Theorem 1

Under Assumptions ass:zipf and ass:approx, the marginal token-level NLL loss satisfies where $s > 0$ is the Zipf exponent and $C = \ln Z > 0$. In the semi-log plane ($x$-axis: $\ln r$; $y$-axis: $\ell^{\mathrm{m}}_\theta$), the relationship is linear with slope $s$ and intercept $C$, within a rank-dependent error band of half-width $\varepsilon(r)$. $\blacktriangleleft$$\blacktriangle

Figures (6)

  • Figure 1: Top: A simplified example of use case of Textual Frequency Law, where the prompt contents are rephrased and the prompt contents with higher frequency are selected. Middle: We achieve this by estimating sentence-level frequency with word-level frequency. Bottom: A toy example showing the effectiveness of our framework. Real case studies are available in the Appendix in Figure \ref{['fig:casetable']}. The paraphrasing can lead to semantic drift, which is the reason why human annotation is necessary in this process.
  • Figure 2: The overall accuracy of TFPD on math reasoning for our proposed framework. It is obvious that the high-frequency partition in TFPD has a higher accuracy than the low-frequency partition. High-frequency $\cap$ low-frequency denotes a model that is correct in both low-frequency and high-frequency partitions.
  • Figure 3: The figure demonstrating the performance of our proposed framework in using high-frequency partition for translation. Results are reports on translating from English into other languages. Detailed numbers are reported in Appendix in Table \ref{['tab:language_metrics1']}, \ref{['tab:language_metrics2']}, \ref{['tab:language_metrics3']}, \ref{['tab:language_metrics4']}, \ref{['tab:language_metrics5']}, and \ref{['tab:language_metrics6']}. Synonym is a baseline that replaces words randomly with their higher-frequency rephrases using NLTK: https://www.nltk.org/.
  • Figure 4: The ablation study results of TFD on TFPD. The results are compared on BLEU, chrf and COMET. The bars are plotted in terms of the winning percentages.
  • Figure 5: The figure that demonstrates the relationship between performance percentage and the amount of data used for TFD. We can see that with more data used, the performance improvement increases.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Remark 1: Marginal vs. conditional loss
  • Remark 2: Strength and character of Assumption \ref{['ass:approx']}
  • Remark 3: Nature of $\eta_{x_k}$
  • Remark 4: Role of the training objective
  • Theorem 1: Token-Level Semi-Log Linearity
  • proof
  • Remark 5: Semi-log vs. log-log
  • Theorem 2: Sufficient Condition for Strict Token-Level Monotonicity
  • proof
  • Remark 6: When monotonicity fails
  • ...and 7 more