Perplexed: Understanding When Large Language Models are Confused

Nathan Cooper; Torsten Scholak

Perplexed: Understanding When Large Language Models are Confused

Nathan Cooper, Torsten Scholak

TL;DR

The paper introduces Perplexed, a library for per-token perplexity analysis of large language models, enabling researchers to diagnose where models get confused without training probes. It pairs Perplexed with CodeTokenizers to align BPE tokens with Abstract Syntax Tree nodes, enabling fine-grained analysis of code-generation LLMs at the token and structural level. Through a case study on SantaCoder using a Python GPL-3.0 subset, the authors show that non-syntactically correct code yields the worst AST-nodes performance and that internal method invocations are harder to predict than external ones, with external invocations slightly easier by about $0.14$ in cross-entropy. The work highlights practical pitfalls in current code-generation LLMs and provides open-source tools to study token-level perplexity, AST alignment, and invocation context, which can guide future improvements and benchmarking in code-focused language models.

Abstract

Large Language Models (LLMs) have become dominant in the Natural Language Processing (NLP) field causing a huge surge in progress in a short amount of time. However, their limitations are still a mystery and have primarily been explored through tailored datasets to analyze a specific human-level skill such as negation, name resolution, etc. In this paper, we introduce perplexed, a library for exploring where a particular language model is perplexed. To show the flexibility and types of insights that can be gained by perplexed, we conducted a case study focused on LLMs for code generation using an additional tool we built to help with the analysis of code models called codetokenizer. Specifically, we explore success and failure cases at the token level of code LLMs under different scenarios pertaining to the type of coding structure the model is predicting, e.g., a variable name or operator, and how predicting of internal verses external method invocations impact performance. From this analysis, we found that our studied code LLMs had their worst performance on coding structures where the code was not syntactically correct. Additionally, we found the models to generally perform worse at predicting internal method invocations than external ones. We have open sourced both of these tools to allow the research community to better understand LLMs in general and LLMs for code generation.

Perplexed: Understanding When Large Language Models are Confused

TL;DR

in cross-entropy. The work highlights practical pitfalls in current code-generation LLMs and provides open-source tools to study token-level perplexity, AST alignment, and invocation context, which can guide future improvements and benchmarking in code-focused language models.

Abstract

Paper Structure (14 sections, 5 figures, 1 table)

This paper contains 14 sections, 5 figures, 1 table.

Introduction
Perplexed Overview
Implementation Details
Case Study: Analyzing LLMs for Code Generation
Data Collection
Experiments
Results and Discussion
\ref{['rq:bpe']} Worst and Best BPE Tokens
\ref{['rq:ast']} Worst and Best AST Nodes
\ref{['rq:internal_external']} Internal Vs. External Method Invocation
Related Work
LLM Evaluation Works & Tools
Evaluating LLMs for Code
Conclusion and Future Work

Figures (5)

Figure 1: Example of an AST representation for a hello world program
Figure 2: Examples of using Perplexed and CodeTokenizers.
Figure 3: Best and Worst performing BPE tokens with the y-axis being the average cross-entropy and the x-axis being the best or worst performing tokens in terms of their average cross-entropy
Figure 4: Best and Worst performing AST nodes
Figure 5: Word Clouds for Internal and External Method Names

Perplexed: Understanding When Large Language Models are Confused

TL;DR

Abstract

Perplexed: Understanding When Large Language Models are Confused

Authors

TL;DR

Abstract

Table of Contents

Figures (5)