Table of Contents
Fetching ...

FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis

Ekta Balkrishna Gavas, Sudipta Banerjee, Chinmay Hegde, Nasir Memon

TL;DR

<3-5 sentence high-level summary> FPBench introduces the first comprehensive benchmark for multimodal LLMs in the fingerprint domain, organizing eight fingerprint-focused tasks into a structured MCQ framework across multiple datasets. The authors evaluate 20 MLLMs (18 open-source, 2 proprietary) under zero-shot and chain-of-thought prompting to probe visual understanding, spatial reasoning, and forensic-style reasoning (ACE-V) within fingerprint analysis. Findings show that while several models achieve above-chance accuracy and tool retrieval tasks reach high performance, many tasks—especially real/synthetic discrimination and ACE-V analysis—remain challenging, with chain-of-thought prompting offering limited or task-dependent gains. The work identifies scaling trends, highlights domain-specific limitations, and proposes future directions (fine-tuning, tool-chaining, and interactive prompts) to advance foundation models for fingerprint forensics and biometrics.

Abstract

Multimodal LLMs (MLLMs) have gained significant traction in complex data analysis, visual question answering, generation, and reasoning. Recently, they have been used for analyzing the biometric utility of iris and face images. However, their capabilities in fingerprint understanding are yet unexplored. In this work, we design a comprehensive benchmark, \textsc{FPBench} that evaluates the performance of 20 MLLMs (open-source and proprietary) across 7 real and synthetic datasets on 8 biometric and forensic tasks using zero-shot and chain-of-thought prompting strategies. We discuss our findings in terms of performance, explainability and share our insights into the challenges and limitations. We establish \textsc{FPBench} as the first comprehensive benchmark for fingerprint domain understanding with MLLMs paving the path for foundation models for fingerprints.

FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis

TL;DR

<3-5 sentence high-level summary> FPBench introduces the first comprehensive benchmark for multimodal LLMs in the fingerprint domain, organizing eight fingerprint-focused tasks into a structured MCQ framework across multiple datasets. The authors evaluate 20 MLLMs (18 open-source, 2 proprietary) under zero-shot and chain-of-thought prompting to probe visual understanding, spatial reasoning, and forensic-style reasoning (ACE-V) within fingerprint analysis. Findings show that while several models achieve above-chance accuracy and tool retrieval tasks reach high performance, many tasks—especially real/synthetic discrimination and ACE-V analysis—remain challenging, with chain-of-thought prompting offering limited or task-dependent gains. The work identifies scaling trends, highlights domain-specific limitations, and proposes future directions (fine-tuning, tool-chaining, and interactive prompts) to advance foundation models for fingerprint forensics and biometrics.

Abstract

Multimodal LLMs (MLLMs) have gained significant traction in complex data analysis, visual question answering, generation, and reasoning. Recently, they have been used for analyzing the biometric utility of iris and face images. However, their capabilities in fingerprint understanding are yet unexplored. In this work, we design a comprehensive benchmark, \textsc{FPBench} that evaluates the performance of 20 MLLMs (open-source and proprietary) across 7 real and synthetic datasets on 8 biometric and forensic tasks using zero-shot and chain-of-thought prompting strategies. We discuss our findings in terms of performance, explainability and share our insights into the challenges and limitations. We establish \textsc{FPBench} as the first comprehensive benchmark for fingerprint domain understanding with MLLMs paving the path for foundation models for fingerprints.

Paper Structure

This paper contains 20 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: FPBench: Overview of proposed benchmark for fingerprint analysis using MLLMs. We present examples of prompts curated for each task to evaluate the vision and language capabilities of MLLMs in fingerprint-based biometric and forensic tasks.
  • Figure 2: Distribution of questions across different categories and tasks in FPBench.
  • Figure 4: Accuracy (%) of all models across various fingerprint tasks presented in the form of heat map. The task 'tool_ retrieval' appears to be the best performing task across a majority of the models whereas, all the models struggle to distinguish between real and generated fingerprint images on the 'real_ vs _ synthetic' task.
  • Figure 5: Accuracy (%) of top-5 best performing models along with mean performance across all tasks
  • Figure 6: Performance (%) of top-5 best performing models across all tasks on zero-shot prompting.
  • ...and 2 more figures