Table of Contents
Fetching ...

Hey GPT-OSS, Looks Like You Got It -- Now Walk Me Through It! An Assessment of the Reasoning Language Models Chain of Thought Mechanism for Digital Forensics

Gaëtan Michelet, Janine Schneider, Aruna Withanage, Frank Breitinger

TL;DR

<3-5 sentence high-level summary>

Abstract

The use of large language models in digital forensics has been widely explored. Beyond identifying potential applications, research has also focused on optimizing model performance for forensic tasks through fine-tuning. However, limited result explainability reduces their operational and legal usability. Recently, a new class of reasoning language models has emerged, designed to handle logic-based tasks through an `internal reasoning' mechanism. Yet, users typically see only the final answer, not the underlying reasoning. One of these reasoning models is gpt-oss, which can be deployed locally, providing full access to its underlying reasoning process. This article presents the first investigation into the potential of reasoning language models for digital forensics. Four test use cases are examined to assess the usability of the reasoning component in supporting result explainability. The evaluation combines a new quantitative metric with qualitative analysis. Findings show that the reasoning component aids in explaining and validating language model outputs in digital forensics at medium reasoning levels, but this support is often limited, and higher reasoning levels do not enhance response quality.

Hey GPT-OSS, Looks Like You Got It -- Now Walk Me Through It! An Assessment of the Reasoning Language Models Chain of Thought Mechanism for Digital Forensics

TL;DR

<3-5 sentence high-level summary>

Abstract

The use of large language models in digital forensics has been widely explored. Beyond identifying potential applications, research has also focused on optimizing model performance for forensic tasks through fine-tuning. However, limited result explainability reduces their operational and legal usability. Recently, a new class of reasoning language models has emerged, designed to handle logic-based tasks through an `internal reasoning' mechanism. Yet, users typically see only the final answer, not the underlying reasoning. One of these reasoning models is gpt-oss, which can be deployed locally, providing full access to its underlying reasoning process. This article presents the first investigation into the potential of reasoning language models for digital forensics. Four test use cases are examined to assess the usability of the reasoning component in supporting result explainability. The evaluation combines a new quantitative metric with qualitative analysis. Findings show that the reasoning component aids in explaining and validating language model outputs in digital forensics at medium reasoning levels, but this support is often limited, and higher reasoning levels do not enhance response quality.

Paper Structure

This paper contains 38 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Chat template for the gpt-oss reasoning language model (simplified and arranged for readability). Green text represents the context submitted to the model (i.e., the system information). ROLE (sometimes referred to as the model identity) and REASONING LEVEL (low, medium, or high) are manually set, while DATE is automatically computed when the template is applied. Text depicted in blue shows the manually created user prompt. The orange text represents the model's inference (i.e., the text that the model will generate).
  • Figure 2: Averaged metric values for all experiments combined and for all experiments of each separate task. The 'general' line represents the average of every metric value for all the samples. In contrast, each task represents the average of each metric value for the samples of that particular task. The closer a metric value is to one, the better it is.
  • Figure 3: Mean of every metric value in general and for each task.
  • Figure 4: Scatter plot showing the score of the reasoning process (CoT) on the x-axis and the score of the final answer (final) on the y-axis for all experiments.