First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

Naomi Saphra; Eve Fleisig; Kyunghyun Cho; Adam Lopez

First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

Naomi Saphra, Eve Fleisig, Kyunghyun Cho, Adam Lopez

TL;DR

It is argued that disparities in scale are transient and researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many applications; that meaningful realistic evaluation is still an open problem; and that there is still room for speculative approaches.

Abstract

Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large $n$-gram models for machine translation (MT). We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. We argue that disparities in scale are transient and researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many applications; that meaningful realistic evaluation is still an open problem; and that there is still room for speculative approaches.

First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

TL;DR

Abstract

-gram models for machine translation (MT). We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. We argue that disparities in scale are transient and researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many applications; that meaningful realistic evaluation is still an open problem; and that there is still room for speculative approaches.

Paper Structure (14 sections, 2 figures)

This paper contains 14 sections, 2 figures.

Introduction
Scale is supreme.
Follow the hardware.
Remember small-scale problems.
Evaluation is a bottleneck.
Improve the metrics.
There is no gold standard.
Specifying evaluation criteria is hard.
Individual preferences are inconsistent.
Disagreement isn't just noise.
Focus on concrete tasks.
Progress is not continuous.
Shape the hardware.
Conclusion: Do research.

Figures (2)

Figure 1: Results slide och-2005-statistical of Franz Och's keynote talk at the 2005 ACL Workshop on Building and Using Parallel Texts, a predecessor to the Conference on Machine Translation.
Figure 2: Figure from kaplan_scaling_2020 illustrating a power law relationship between dataset size and test loss for LLMs with varying numbers of parameters.

First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

TL;DR

Abstract

First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)