Table of Contents
Fetching ...

Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque, Sunzida Siddique

TL;DR

PCMDE introduces a physics-constrained multimodal evaluation metric for synthetic images. It combines a custom CNN detector, multiple vision-language models, a confidence-weighted fusion module, and large-language-model reasoning with physics-guided rules to assess whether an image-caption pair satisfies structural and semantic constraints. The score $S_{final} \in [0,100]$ is derived from $S_{LLM}$ and $S_{rules}$ through $S_{final} = \tfrac{1}{2}(S_{LLM} + S_{rules})$, enabling PASS/FAIL decisions via thresholds $\tau$ and $\tau_c$ and generating interpretable diagnostics. On a domain of synthetic aircraft images, PCMDE shows higher discriminative power than embedding-based metrics and identifies specific component-level violations such as engine placement and wing counts. The approach supports domain-aware benchmarking and can be extended to other object classes and viewpoints, advancing grounded evaluation of next-generation generative models.

Abstract

Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.

Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

TL;DR

PCMDE introduces a physics-constrained multimodal evaluation metric for synthetic images. It combines a custom CNN detector, multiple vision-language models, a confidence-weighted fusion module, and large-language-model reasoning with physics-guided rules to assess whether an image-caption pair satisfies structural and semantic constraints. The score is derived from and through , enabling PASS/FAIL decisions via thresholds and and generating interpretable diagnostics. On a domain of synthetic aircraft images, PCMDE shows higher discriminative power than embedding-based metrics and identifies specific component-level violations such as engine placement and wing counts. The approach supports domain-aware benchmarking and can be extended to other object classes and viewpoints, advancing grounded evaluation of next-generation generative models.

Abstract

Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.

Paper Structure

This paper contains 27 sections, 13 equations, 4 figures, 3 tables, 2 algorithms.

Figures (4)

  • Figure 1: An aircraft image with flipped engines relative to the wings. Multiple metrics (CLIPScore, VQA Score, and SigLIP etc.) demonstrate the similarity of image and text features. While image and text features belong to different domains. We do not directly compare image and text; a rule-based approach is used to refine the final result using VLM with LLM reasoning.
  • Figure 2: Overview of the proposed hybrid evaluation pipeline combining CNN-based detection, vision-language models, and LLM reasoning with physical consistency rules.
  • Figure 3: Sample images from both datasets: the left panel shows custom images used for training, and the right panel shows synthetic images used for evaluation.
  • Figure 4: Examples of PASS and FAIL configurations.