FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis
Ekta Balkrishna Gavas, Sudipta Banerjee, Chinmay Hegde, Nasir Memon
TL;DR
<3-5 sentence high-level summary> FPBench introduces the first comprehensive benchmark for multimodal LLMs in the fingerprint domain, organizing eight fingerprint-focused tasks into a structured MCQ framework across multiple datasets. The authors evaluate 20 MLLMs (18 open-source, 2 proprietary) under zero-shot and chain-of-thought prompting to probe visual understanding, spatial reasoning, and forensic-style reasoning (ACE-V) within fingerprint analysis. Findings show that while several models achieve above-chance accuracy and tool retrieval tasks reach high performance, many tasks—especially real/synthetic discrimination and ACE-V analysis—remain challenging, with chain-of-thought prompting offering limited or task-dependent gains. The work identifies scaling trends, highlights domain-specific limitations, and proposes future directions (fine-tuning, tool-chaining, and interactive prompts) to advance foundation models for fingerprint forensics and biometrics.
Abstract
Multimodal LLMs (MLLMs) have gained significant traction in complex data analysis, visual question answering, generation, and reasoning. Recently, they have been used for analyzing the biometric utility of iris and face images. However, their capabilities in fingerprint understanding are yet unexplored. In this work, we design a comprehensive benchmark, \textsc{FPBench} that evaluates the performance of 20 MLLMs (open-source and proprietary) across 7 real and synthetic datasets on 8 biometric and forensic tasks using zero-shot and chain-of-thought prompting strategies. We discuss our findings in terms of performance, explainability and share our insights into the challenges and limitations. We establish \textsc{FPBench} as the first comprehensive benchmark for fingerprint domain understanding with MLLMs paving the path for foundation models for fingerprints.
