Table of Contents
Fetching ...

Neural Normalized Compression Distance and the Disconnect Between Compression and Classification

John Hurwitz, Charles Nicholas, Edward Raff

TL;DR

This work develops a Neural NCD and compares LLMs to classic general-purpose algorithms like gzip and finds that classification accuracy is not predictable by compression rate alone, among other empirical aberrations not predicted by current understanding.

Abstract

It is generally well understood that predictive classification and compression are intrinsically related concepts in information theory. Indeed, many deep learning methods are explained as learning a kind of compression, and that better compression leads to better performance. We interrogate this hypothesis via the Normalized Compression Distance (NCD), which explicitly relies on compression as the means of measuring similarity between sequences and thus enables nearest-neighbor classification. By turning popular large language models (LLMs) into lossless compressors, we develop a Neural NCD and compare LLMs to classic general-purpose algorithms like gzip. In doing so, we find that classification accuracy is not predictable by compression rate alone, among other empirical aberrations not predicted by current understanding. Our results imply that our intuition on what it means for a neural network to ``compress'' and what is needed for effective classification are not yet well understood.

Neural Normalized Compression Distance and the Disconnect Between Compression and Classification

TL;DR

This work develops a Neural NCD and compares LLMs to classic general-purpose algorithms like gzip and finds that classification accuracy is not predictable by compression rate alone, among other empirical aberrations not predicted by current understanding.

Abstract

It is generally well understood that predictive classification and compression are intrinsically related concepts in information theory. Indeed, many deep learning methods are explained as learning a kind of compression, and that better compression leads to better performance. We interrogate this hypothesis via the Normalized Compression Distance (NCD), which explicitly relies on compression as the means of measuring similarity between sequences and thus enables nearest-neighbor classification. By turning popular large language models (LLMs) into lossless compressors, we develop a Neural NCD and compare LLMs to classic general-purpose algorithms like gzip. In doing so, we find that classification accuracy is not predictable by compression rate alone, among other empirical aberrations not predicted by current understanding. Our results imply that our intuition on what it means for a neural network to ``compress'' and what is needed for effective classification are not yet well understood.

Paper Structure

This paper contains 13 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison of RWKV 169M neural compressor and traditional compressors using $k$NN with NCD across the datasets AGNews, 20News, and DBpedia (NNCD = Neural NCD). Despite the neural compressor achieving superior compression rates, we find cases where Neural NCD outperforms, underperforms, and performs on-par with traditional compressors on the few shot sequence classification task. This calls into question the hypothesis that accuracy of NCD-based methods is predictable solely from compression rates.
  • Figure 2: Test accuracy plotted against compression rate (lower is better compression) for AGNews and DBpedia across different few shot settings. Different shapes indicate different datasets, and each compressor is its own color. If compression rate and predictive performance were correlated, we would expect a diagonal relationship to occur, but none exists.
  • Figure 3: Comparison of RWKV 169M, GPT-2 117M, and OPT 125M, as the neural compressors used for Neural NCD. Neural compressors outperform gzip similarly on AGNews, and underperform it similarly on DBpedia.
  • Figure 4: Test accuracy difference when comparing Neural NCD to Euclidean distance on sequence latent representations with 95% confidence interval. Values above 0 indicate Neural NCD outperforming Euclidean distance. For RWKV, the representation is the final hidden state. For GPT2 and OPT, we average the latent representation of each token. Despite comparable compression rates of each model, the quality and usefulness of distance between latent representations is highly variable across models.