Table of Contents
Fetching ...

Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience

Jacob Piland, Byron Dowling, Christopher Sweet, Adam Czajka

Abstract

Iris presentation attack detection (PAD) is critical for secure biometric deployments, yet developing specialized models faces significant practical barriers: collecting data representing future unknown attacks is impossible, and collecting diverse-enough data, yet still limited in terms of its predictive power, is expensive. Additionally, sharing biometric data raises privacy concerns. Due to rapid emergence of new attack vectors demanding adaptable solutions, we thus investigate in this paper whether general-purpose multimodal large language models (MLLMs) can perform iris PAD when augmented with human expert knowledge, operating under strict privacy constraints that prohibit sending biometric data to public cloud MLLM services. Through analysis of vision encoder embeddings applied to our dataset, we demonstrate that pre-trained vision transformers in MLLMs inherently cluster many iris attack types despite never being explicitly trained for this task. However, where clustering shows overlap between attack classes, we find that structured prompts incorporating human salience (verbal descriptions from subjects identifying attack indicators) enable these models to resolve ambiguities. Testing on an IRB-restricted dataset of 224 iris images spanning seven attack types, using only university-approved services (Gemini 2.5 Pro) or locally-hosted models (e.g., Llama 3.2-Vision), we show that Gemini with expert-informed prompts outperforms both a specialized convolutional neural networks (CNN)-based baseline and human examiners, while the locally-deployable Llama achieves near-human performance. Our results establish that MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD.

Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience

Abstract

Iris presentation attack detection (PAD) is critical for secure biometric deployments, yet developing specialized models faces significant practical barriers: collecting data representing future unknown attacks is impossible, and collecting diverse-enough data, yet still limited in terms of its predictive power, is expensive. Additionally, sharing biometric data raises privacy concerns. Due to rapid emergence of new attack vectors demanding adaptable solutions, we thus investigate in this paper whether general-purpose multimodal large language models (MLLMs) can perform iris PAD when augmented with human expert knowledge, operating under strict privacy constraints that prohibit sending biometric data to public cloud MLLM services. Through analysis of vision encoder embeddings applied to our dataset, we demonstrate that pre-trained vision transformers in MLLMs inherently cluster many iris attack types despite never being explicitly trained for this task. However, where clustering shows overlap between attack classes, we find that structured prompts incorporating human salience (verbal descriptions from subjects identifying attack indicators) enable these models to resolve ambiguities. Testing on an IRB-restricted dataset of 224 iris images spanning seven attack types, using only university-approved services (Gemini 2.5 Pro) or locally-hosted models (e.g., Llama 3.2-Vision), we show that Gemini with expert-informed prompts outperforms both a specialized convolutional neural networks (CNN)-based baseline and human examiners, while the locally-deployable Llama achieves near-human performance. Our results establish that MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD.
Paper Structure (30 sections, 7 figures, 3 tables)

This paper contains 30 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Experimental pipeline and paper contributions. We start with the generation of synthetic MESH salience from human annotations and corresponding image dataset. This salience, along with the control and human salience, is then combined with our novel prompts for testing. Results are obtained for both models to which we have ethical access and compared with a human and salience-guided CNN baseline.
  • Figure 2: Uniform Manifold Approximation and Projection (UMAP) visualization of iris samples encoded by (left) SigLIP vision-only embeddings and (right) SigLIP + Gemma multimodal embeddings using a simple binary prompt asking whether the iris is "real and healthy" or "synthetic/unhealthy." Despite never being trained specifically for iris PAD, SigLIP alone achieves partial separation of attack types. However, adding even minimal semantic guidance through Gemma shows much improved visual separation between live vs. spoof discrimination, with clearer cluster boundaries and reduced overlap between classes. This visual separation motivates our investigation into whether general-purpose MLLMs can address specialized biometric security tasks through appropriate prompting.
  • Figure 3: From left to right (with third-party dataset sources, where appropriate): live iris (with no abnormalities) Kohli_BTAS_2016, StyleGAN2-generated sample, StyleGAN3-generated sample, iris wearing textured contact lens, then printed and re-captured in near infrared light Kohli_BTAS_2016, synthetic sample generated by a non deep learning-based algorithm CASIA_Synth, diseased eye Trokielewicz_BTAS_2015, glass prosthesis, post-mortem sample Trokielewicz_TIFS_2019, iris printout Czajka_MMAR_2013, iris wearing textured contact lens Doyle_ICB_2013, and artificial eye Kim_ESA_2016.
  • Figure 4: Authentic (left) and synthetically generated by a StyleGAN2 model (right) iris images along with the expert and non-expert descriptions ("Human Text Annotations" in Fig. \ref{['fig:teaser']}).
  • Figure 5: Learning curve analysis of the eight Gemini experiments showing the MSE scores converge before even 25 samples from each attack type.
  • ...and 2 more figures