Table of Contents
Fetching ...

Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models

Ria Shekhawat, Hailin Li, Raghavendra Ramachandra, Sushma Venkatesh

TL;DR

This work tackles differential morphing attack detection (D-MAD) in biometric security by leveraging multimodal large language models (LLMs). It proposes a framework using ChatGPT-4o and Gemini with Chain-of-Thought prompts to produce both a binary decision and natural-language explanations when comparing pairs of facial images. A novel morphing dataset with 54 subjects and three morph types (LMA, MIPGAN-2, PIPE) supports a structured evaluation protocol across 150 image-pairs, with three independent inferences per pair and OR-fusion for final decisions, assessed via MACER, BPCER, and HTER. Key findings show ChatGPT-4o generally achieves higher detection accuracy than Gemini—especially against GAN-based morphs—while Gemini offers more consistent explanations; results underscore the potential of multimodal LLMs for D-MAD but also the need for grounded, calibrated, and possibly human-in-the-loop systems for robust deployment.

Abstract

Leveraging the power of multimodal large language models (LLMs) offers a promising approach to enhancing the accuracy and interpretability of morphing attack detection (MAD), especially in real-world biometric applications. This work introduces the use of LLMs for differential morphing attack detection (D-MAD). To the best of our knowledge, this is the first study to employ multimodal LLMs to D-MAD using real biometric data. To effectively utilize these models, we design Chain-of-Thought (CoT)-based prompts to reduce failure-to-answer rates and enhance the reasoning behind decisions. Our contributions include: (1) the first application of multimodal LLMs for D-MAD using real data subjects, (2) CoT-based prompt engineering to improve response reliability and explainability, (3) comprehensive qualitative and quantitative benchmarking of LLM performance using data from 54 individuals captured in passport enrollment scenarios, and (4) comparative analysis of two multimodal LLMs: ChatGPT-4o and Gemini providing insights into their morphing attack detection accuracy and decision transparency. Experimental results show that ChatGPT-4o outperforms Gemini in detection accuracy, especially against GAN-based morphs, though both models struggle under challenging conditions. While Gemini offers more consistent explanations, ChatGPT-4o is more resilient but prone to a higher failure-to-answer rate.

Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models

TL;DR

This work tackles differential morphing attack detection (D-MAD) in biometric security by leveraging multimodal large language models (LLMs). It proposes a framework using ChatGPT-4o and Gemini with Chain-of-Thought prompts to produce both a binary decision and natural-language explanations when comparing pairs of facial images. A novel morphing dataset with 54 subjects and three morph types (LMA, MIPGAN-2, PIPE) supports a structured evaluation protocol across 150 image-pairs, with three independent inferences per pair and OR-fusion for final decisions, assessed via MACER, BPCER, and HTER. Key findings show ChatGPT-4o generally achieves higher detection accuracy than Gemini—especially against GAN-based morphs—while Gemini offers more consistent explanations; results underscore the potential of multimodal LLMs for D-MAD but also the need for grounded, calibrated, and possibly human-in-the-loop systems for robust deployment.

Abstract

Leveraging the power of multimodal large language models (LLMs) offers a promising approach to enhancing the accuracy and interpretability of morphing attack detection (MAD), especially in real-world biometric applications. This work introduces the use of LLMs for differential morphing attack detection (D-MAD). To the best of our knowledge, this is the first study to employ multimodal LLMs to D-MAD using real biometric data. To effectively utilize these models, we design Chain-of-Thought (CoT)-based prompts to reduce failure-to-answer rates and enhance the reasoning behind decisions. Our contributions include: (1) the first application of multimodal LLMs for D-MAD using real data subjects, (2) CoT-based prompt engineering to improve response reliability and explainability, (3) comprehensive qualitative and quantitative benchmarking of LLM performance using data from 54 individuals captured in passport enrollment scenarios, and (4) comparative analysis of two multimodal LLMs: ChatGPT-4o and Gemini providing insights into their morphing attack detection accuracy and decision transparency. Experimental results show that ChatGPT-4o outperforms Gemini in detection accuracy, especially against GAN-based morphs, though both models struggle under challenging conditions. While Gemini offers more consistent explanations, ChatGPT-4o is more resilient but prone to a higher failure-to-answer rate.

Paper Structure

This paper contains 11 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: In a typical D-MAD scenario, facial images captured from a passport and an Automated Border Control (ABC) gate are used to extract facial features. These features are compared and analyzed to detect any signs of morphing.
  • Figure 2: Block diagram of the proposed D-MAD framework using multimodal Large Language Models (LLMs). The model receives a pair of facial images along with carefully designed prompts. These prompts guide the LLM to perform the detection task using a Chain-of-Thought (CoT) reasoning approach by providing structured visual and textual clues.
  • Figure 3: Example facial images corresponding to bona fide and three types of morphing employed in this work.
  • Figure 4: Kernel Density Estimate (KDE) plots illustrating the distribution of vulnerability scores for bona fide and morphed image comparisons. Results from ChatGPT-4o are shown in subfigures (a–c), and the corresponding plots for Gemini are depicted in subfigures (d–f).