Table of Contents
Fetching ...

DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

Bilel Cherif, Tamas Bisztray, Richard A. Dubniczky, Aaesha Aldahmani, Saeed Alshehhi, Norbert Tihanyi

TL;DR

DFIR-Metric introduces the first comprehensive, modular benchmark for evaluating Large Language Models in Digital Forensics and Incident Response, spanning knowledge assessment, dynamic CTF-style challenges, and NIST-based string-search tasks. It adds the Task Understanding Score ($TUS$) alongside traditional reliability metrics ($RS@k$, $TSR@k$, $Conf@k$) to capture partial correctness and multi-step reasoning in DFIR workflows. Across 14 models, the results reveal strong capabilities in certifiable knowledge but significant gaps in end-to-end forensic tasks, underscoring the need for careful reliability and reproducibility when applying LLMs to high-stakes DFIR tasks. The framework, data, and baselines are openly published to enable reproducibility and community-driven extension, aiming to accelerate safe and effective AI-assisted DFIR workflows.

Abstract

Digital Forensics and Incident Response (DFIR) involves analyzing digital evidence to support legal investigations. Large Language Models (LLMs) offer new opportunities in DFIR tasks such as log analysis and memory forensics, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts. Despite growing interest, there is no comprehensive benchmark to evaluate LLMs across both theoretical and practical DFIR domains. To address this gap, we present DFIR-Metric, a benchmark with three components: (1) Knowledge Assessment: a set of 700 expert-reviewed multiple-choice questions sourced from industry-standard certifications and official documentation; (2) Realistic Forensic Challenges: 150 CTF-style tasks testing multi-step reasoning and evidence correlation; and (3) Practical Analysis: 500 disk and memory forensics cases from the NIST Computer Forensics Tool Testing Program (CFTT). We evaluated 14 LLMs using DFIR-Metric, analyzing both their accuracy and consistency across trials. We also introduce a new metric, the Task Understanding Score (TUS), designed to more effectively evaluate models in scenarios where they achieve near-zero accuracy. This benchmark offers a rigorous, reproducible foundation for advancing AI in digital forensics. All scripts, artifacts, and results are available on the project website at https://github.com/DFIR-Metric.

DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

TL;DR

DFIR-Metric introduces the first comprehensive, modular benchmark for evaluating Large Language Models in Digital Forensics and Incident Response, spanning knowledge assessment, dynamic CTF-style challenges, and NIST-based string-search tasks. It adds the Task Understanding Score () alongside traditional reliability metrics (, , ) to capture partial correctness and multi-step reasoning in DFIR workflows. Across 14 models, the results reveal strong capabilities in certifiable knowledge but significant gaps in end-to-end forensic tasks, underscoring the need for careful reliability and reproducibility when applying LLMs to high-stakes DFIR tasks. The framework, data, and baselines are openly published to enable reproducibility and community-driven extension, aiming to accelerate safe and effective AI-assisted DFIR workflows.

Abstract

Digital Forensics and Incident Response (DFIR) involves analyzing digital evidence to support legal investigations. Large Language Models (LLMs) offer new opportunities in DFIR tasks such as log analysis and memory forensics, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts. Despite growing interest, there is no comprehensive benchmark to evaluate LLMs across both theoretical and practical DFIR domains. To address this gap, we present DFIR-Metric, a benchmark with three components: (1) Knowledge Assessment: a set of 700 expert-reviewed multiple-choice questions sourced from industry-standard certifications and official documentation; (2) Realistic Forensic Challenges: 150 CTF-style tasks testing multi-step reasoning and evidence correlation; and (3) Practical Analysis: 500 disk and memory forensics cases from the NIST Computer Forensics Tool Testing Program (CFTT). We evaluated 14 LLMs using DFIR-Metric, analyzing both their accuracy and consistency across trials. We also introduce a new metric, the Task Understanding Score (TUS), designed to more effectively evaluate models in scenarios where they achieve near-zero accuracy. This benchmark offers a rigorous, reproducible foundation for advancing AI in digital forensics. All scripts, artifacts, and results are available on the project website at https://github.com/DFIR-Metric.

Paper Structure

This paper contains 15 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Module I: Multiple-Choice Question Dataset Generation and LLM Evaluation
  • Figure 2: Module II: CTF-style Challenges Dataset Generation and LLM Evaluation
  • Figure 3: Module III: NIST String Search Dataset Generation and LLM Evaluation

Theorems & Definitions (4)

  • definition thmcounterdefinition: Reliability Score
  • definition thmcounterdefinition: Task Success Rate
  • definition thmcounterdefinition: Confidence Index
  • definition thmcounterdefinition: Task Understanding Score