Table of Contents
Fetching ...

Statistical investigations into the geometry and homology of random programs

Jon Sporring, Ken Friis Larsen

TL;DR

The paper investigates how stochastic prompts from large language models influence the distribution of generated programs by modeling the set of programs as a metric space under tree-edit distances on abstract syntax trees. It adopts geometric (geometric median, distance-based dispersion), statistical (Ripley’s $K$-function), and topological (persistent homology, Vietoris–Rips filtrations) analyses, avoiding low-dimensional embeddings to characterize random program configurations. Applying the framework to thresholding-based Python programs produced by ChatGPT-4 and TinyLlama, the study finds ChatGPT-4 outputs are more compact and clustered, whereas TinyLlama exhibits dispersion that grows with temperature; topological descriptors yield limited insights compared to the $K$-function, highlighting the promise and current limits of applying topology to programming languages. The work suggests a path toward language-model diagnostics and prompt design grounded in geometric and topological summaries, with future improvements in distance definitions and scalable analyses for richer language spaces.

Abstract

AI-supported programming has taken giant leaps with tools such as Meta's Llama and openAI's chatGPT. These are examples of stochastic sources of programs and have already greatly influenced how we produce code and teach programming. If we consider input to such models as a stochastic source, a natural question is, what is the relation between the input and the output distributions, between the chatGPT prompt and the resulting program? In this paper, we will show how the relation between random Python programs generated from chatGPT can be described geometrically and topologically using Tree-edit distances between the program's syntax trees and without explicit modeling of the underlying space. A popular approach to studying high-dimensional samples in a metric space is to use low-dimensional embedding using, e.g., multidimensional scaling. Such methods imply errors depending on the data and dimension of the embedding space. In this article, we propose to restrict such projection methods to purely visualization purposes and instead use geometric summary statistics, methods from spatial point statistics, and topological data analysis to characterize the configurations of random programs that do not rely on embedding approximations. To demonstrate their usefulness, we compare two publicly available models: ChatGPT-4 and TinyLlama, on a simple problem related to image processing. Application areas include understanding how questions should be asked to obtain useful programs; measuring how consistently a given large language model answers; and comparing the different large language models as a programming assistant. Finally, we speculate that our approach may in the future give new insights into the structure of programming languages.

Statistical investigations into the geometry and homology of random programs

TL;DR

The paper investigates how stochastic prompts from large language models influence the distribution of generated programs by modeling the set of programs as a metric space under tree-edit distances on abstract syntax trees. It adopts geometric (geometric median, distance-based dispersion), statistical (Ripley’s -function), and topological (persistent homology, Vietoris–Rips filtrations) analyses, avoiding low-dimensional embeddings to characterize random program configurations. Applying the framework to thresholding-based Python programs produced by ChatGPT-4 and TinyLlama, the study finds ChatGPT-4 outputs are more compact and clustered, whereas TinyLlama exhibits dispersion that grows with temperature; topological descriptors yield limited insights compared to the -function, highlighting the promise and current limits of applying topology to programming languages. The work suggests a path toward language-model diagnostics and prompt design grounded in geometric and topological summaries, with future improvements in distance definitions and scalable analyses for richer language spaces.

Abstract

AI-supported programming has taken giant leaps with tools such as Meta's Llama and openAI's chatGPT. These are examples of stochastic sources of programs and have already greatly influenced how we produce code and teach programming. If we consider input to such models as a stochastic source, a natural question is, what is the relation between the input and the output distributions, between the chatGPT prompt and the resulting program? In this paper, we will show how the relation between random Python programs generated from chatGPT can be described geometrically and topologically using Tree-edit distances between the program's syntax trees and without explicit modeling of the underlying space. A popular approach to studying high-dimensional samples in a metric space is to use low-dimensional embedding using, e.g., multidimensional scaling. Such methods imply errors depending on the data and dimension of the embedding space. In this article, we propose to restrict such projection methods to purely visualization purposes and instead use geometric summary statistics, methods from spatial point statistics, and topological data analysis to characterize the configurations of random programs that do not rely on embedding approximations. To demonstrate their usefulness, we compare two publicly available models: ChatGPT-4 and TinyLlama, on a simple problem related to image processing. Application areas include understanding how questions should be asked to obtain useful programs; measuring how consistently a given large language model answers; and comparing the different large language models as a programming assistant. Finally, we speculate that our approach may in the future give new insights into the structure of programming languages.
Paper Structure (12 sections, 5 equations, 11 figures, 2 tables)

This paper contains 12 sections, 5 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: The abstract syntax tree for the program "" as generated by the builtin ast module.
  • Figure 2: A much simplified diagram of \ref{['fig:pythonSyntaxTree']} for illustration purposes.
  • Figure 3: (\ref{['fig:image']}) A microscope image of cells and (\ref{['fig:threstholded']}) its thresholded version. The circular features are cells, and thresholding mainly classifies the pixels into cell walls or not. The image courtesy of Karen Martinez and Gabriella von Scheel von Rosing, University of Copenhagen.
  • Figure 4: A summary of fig:chatGPT4fig:tinyllama09 in the Supplementary Materials section shows the results of analyzing the responses from all the queries in a session. In the embeddings, the colors refer to the question numbers in \ref{['tab:questions']}. ChatGPT-4's responses seem more clustered than TinyLlama's responses when the programs are projected into 2 dimensions using multidimensional scaling. Similarly for the K-functions, ChatGPT-4 responses are more clustered than TinyLlama, and TinyLlama's dispersion increases with temperature. In the persistence diagram, red and blue dots illustrate the birth and death coordinates of the connected components and 1-cycles respectively. For ChatGPT-4 their scales are smaller than TinyLlama's, whose dispersion increases with temperature. The log persistence diagrams show additionally the marginal histograms, and there is perhaps a tendency for the distribution of ChatGPT-4's 1-cycles to be more widely spread than TinyLlama's.
  • Figure 5: A program for adding 3 values.
  • ...and 6 more figures