Physics-Based Benchmarking Metrics for Multimodal Synthetic Images
Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque, Sunzida Siddique
TL;DR
PCMDE introduces a physics-constrained multimodal evaluation metric for synthetic images. It combines a custom CNN detector, multiple vision-language models, a confidence-weighted fusion module, and large-language-model reasoning with physics-guided rules to assess whether an image-caption pair satisfies structural and semantic constraints. The score $S_{final} \in [0,100]$ is derived from $S_{LLM}$ and $S_{rules}$ through $S_{final} = \tfrac{1}{2}(S_{LLM} + S_{rules})$, enabling PASS/FAIL decisions via thresholds $\tau$ and $\tau_c$ and generating interpretable diagnostics. On a domain of synthetic aircraft images, PCMDE shows higher discriminative power than embedding-based metrics and identifies specific component-level violations such as engine placement and wing counts. The approach supports domain-aware benchmarking and can be extended to other object classes and viewpoints, advancing grounded evaluation of next-generation generative models.
Abstract
Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.
