Table of Contents
Fetching ...

GPTrace: Effective Crash Deduplication Using LLM Embeddings

Patrick Herter, Vincent Ahlrichs, Ridvan Açilan, Julian Horsch

TL;DR

Fuzzing yields massive crash data, demanding effective deduplication to focus debugging. GPTrace introduces an embedding-based workflow that fuses multiple crash data sources (full/coarse stack traces and asan reports) into LLМ-derived vectors and clusters them with a density-based approach, enabling robust grouping by underlying bugs. The authors provide a Python prototype for C/C++ targets, leveraging batch embeddings and dimensionality reduction, and publicly release artifacts. On a large Magma/MoonLight-derived dataset, GPTrace achieves high purity and inverse purity, outperforming stack-trace hashing and several state-of-the-art baselines while requiring less detailed execution data and offering per-target adaptability.

Abstract

Fuzzing is a highly effective method for uncovering software vulnerabilities, but analyzing the resulting data typically requires substantial manual effort. This is amplified by the fact that fuzzing campaigns often find a large number of crashing inputs, many of which share the same underlying bug. Crash deduplication is the task of finding such duplicate crashing inputs and thereby reducing the data that needs to be examined. Many existing deduplication approaches rely on comparing stack traces or other information that is collected when a program crashes. Although various metrics for measuring the similarity of such pieces of information have been proposed, many do not yield satisfactory deduplication results. In this work, we present GPTrace, a deduplication workflow that leverages a large language model to evaluate the similarity of various data sources associated with crashes by computing embedding vectors and supplying those as input to a clustering algorithm. We evaluate our approach on over 300 000 crashing inputs belonging to 50 ground truth labels from 14 different targets. The deduplication results produced by GPTrace show a noticeable improvement over hand-crafted stack trace comparison methods and even more complex state-of-the-art approaches that are less flexible.

GPTrace: Effective Crash Deduplication Using LLM Embeddings

TL;DR

Fuzzing yields massive crash data, demanding effective deduplication to focus debugging. GPTrace introduces an embedding-based workflow that fuses multiple crash data sources (full/coarse stack traces and asan reports) into LLМ-derived vectors and clusters them with a density-based approach, enabling robust grouping by underlying bugs. The authors provide a Python prototype for C/C++ targets, leveraging batch embeddings and dimensionality reduction, and publicly release artifacts. On a large Magma/MoonLight-derived dataset, GPTrace achieves high purity and inverse purity, outperforming stack-trace hashing and several state-of-the-art baselines while requiring less detailed execution data and offering per-target adaptability.

Abstract

Fuzzing is a highly effective method for uncovering software vulnerabilities, but analyzing the resulting data typically requires substantial manual effort. This is amplified by the fact that fuzzing campaigns often find a large number of crashing inputs, many of which share the same underlying bug. Crash deduplication is the task of finding such duplicate crashing inputs and thereby reducing the data that needs to be examined. Many existing deduplication approaches rely on comparing stack traces or other information that is collected when a program crashes. Although various metrics for measuring the similarity of such pieces of information have been proposed, many do not yield satisfactory deduplication results. In this work, we present GPTrace, a deduplication workflow that leverages a large language model to evaluate the similarity of various data sources associated with crashes by computing embedding vectors and supplying those as input to a clustering algorithm. We evaluate our approach on over 300 000 crashing inputs belonging to 50 ground truth labels from 14 different targets. The deduplication results produced by GPTrace show a noticeable improvement over hand-crafted stack trace comparison methods and even more complex state-of-the-art approaches that are less flexible.

Paper Structure

This paper contains 19 sections, 11 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the GPTrace workflow.
  • Figure 2: Examples of data sources used by GPTrace. These were produced by the target char2svg.
  • Figure 3: Two-dimensional projections of the combined vectors for xmllint, obtained using scikit-learn's sklearnsklearn_python truncated singular value decomposition. The colors indicate the eight different ground truth labels.