Table of Contents
Fetching ...

An Enhanced Large Language Model For Cross Modal Query Understanding System Using DL-KeyBERT Based CAZSSCL-MPGPT

Shreya Singh

TL;DR

The paper targets cross-modal query understanding by addressing echo chamber biases in image processing. It introduces an integrated pipeline that combines E-YOLO-based segmentation, object skeletonization, a CRKG-driven knowledge graph, FOA-based feature selection, and DL-KeyBERT word embeddings to feed into CAZSSCL-MPGPT, a cross-modal large language model with mixup regularization and zero-shot semantic consistency learning. Empirical results on COCO2017 and vqav2-val show near-perfect captioning and VQA performance (e.g., ~99% accuracy, high BLEU/METEOR, and low false rates), driven by components like PG-CLAHE for contrast, E-YOLO for segmentation, CRKG for bias mitigation, and strong cross-attention in CAZSSCL-MPGPT. The work demonstrates significant improvements in cross-modal captioning and VQA, with practical implications for search, AI assistants, and multimodal understanding in real-world settings, while highlighting future directions toward domain-specific adaptations.

Abstract

Large Language Models (LLMs) are advanced deep-learning models designed to understand and generate human language. They work together with models that process data like images, enabling cross-modal understanding. However, existing approaches often suffer from the echo chamber effect, where redundant visual patterns reduce model generalization and accuracy. Thus, the proposed system considered this limitation and developed an enhanced LLM-based framework for cross-modal query understanding using DL-KeyBERT-based CAZSSCL-MPGPT. The collected dataset consists of pre-processed images and texts. The preprocessed images then undergo object segmentation using Easom-You Only Look Once (E-YOLO). The object skeleton is generated, along with the knowledge graph using a Conditional Random Knowledge Graph (CRKG) technique. Further, features are extracted from the knowledge graph, generated skeletons, and segmented objects. The optimal features are then selected using the Fossa Optimization Algorithm (FOA). Meanwhile, the text undergoes word embedding using DL-KeyBERT. Finally, the cross-modal query understanding system utilizes CAZSSCL-MPGPT to generate accurate and contextually relevant image descriptions as text. The proposed CAZSSCL-MPGPT achieved an accuracy of 99.14187362% in the COCO dataset 2017 and 98.43224393% in the vqav2-val dataset.

An Enhanced Large Language Model For Cross Modal Query Understanding System Using DL-KeyBERT Based CAZSSCL-MPGPT

TL;DR

The paper targets cross-modal query understanding by addressing echo chamber biases in image processing. It introduces an integrated pipeline that combines E-YOLO-based segmentation, object skeletonization, a CRKG-driven knowledge graph, FOA-based feature selection, and DL-KeyBERT word embeddings to feed into CAZSSCL-MPGPT, a cross-modal large language model with mixup regularization and zero-shot semantic consistency learning. Empirical results on COCO2017 and vqav2-val show near-perfect captioning and VQA performance (e.g., ~99% accuracy, high BLEU/METEOR, and low false rates), driven by components like PG-CLAHE for contrast, E-YOLO for segmentation, CRKG for bias mitigation, and strong cross-attention in CAZSSCL-MPGPT. The work demonstrates significant improvements in cross-modal captioning and VQA, with practical implications for search, AI assistants, and multimodal understanding in real-world settings, while highlighting future directions toward domain-specific adaptations.

Abstract

Large Language Models (LLMs) are advanced deep-learning models designed to understand and generate human language. They work together with models that process data like images, enabling cross-modal understanding. However, existing approaches often suffer from the echo chamber effect, where redundant visual patterns reduce model generalization and accuracy. Thus, the proposed system considered this limitation and developed an enhanced LLM-based framework for cross-modal query understanding using DL-KeyBERT-based CAZSSCL-MPGPT. The collected dataset consists of pre-processed images and texts. The preprocessed images then undergo object segmentation using Easom-You Only Look Once (E-YOLO). The object skeleton is generated, along with the knowledge graph using a Conditional Random Knowledge Graph (CRKG) technique. Further, features are extracted from the knowledge graph, generated skeletons, and segmented objects. The optimal features are then selected using the Fossa Optimization Algorithm (FOA). Meanwhile, the text undergoes word embedding using DL-KeyBERT. Finally, the cross-modal query understanding system utilizes CAZSSCL-MPGPT to generate accurate and contextually relevant image descriptions as text. The proposed CAZSSCL-MPGPT achieved an accuracy of 99.14187362% in the COCO dataset 2017 and 98.43224393% in the vqav2-val dataset.

Paper Structure

This paper contains 41 sections, 2 theorems, 48 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

Example theorem text. Example theorem text. Example theorem text. Example theorem text. Example theorem text. Example theorem text. Example theorem text. Example theorem text. Example theorem text. Example theorem text. Example theorem text.

Figures (8)

  • Figure 1: Structural Diagram of the Proposed Framework
  • Figure 2: Diagrammatic Representation of the Proposed CAZSSCL-MPGPT
  • Figure 3: Analysis of MAE, MSE, and RMSE.
  • Figure 4: IOU Analysis
  • Figure 5: Analysis of Graph Generation Time
  • ...and 3 more figures

Theorems & Definitions (7)

  • Theorem 1: Theorem subhead
  • Proposition 2
  • Example 1
  • Remark 1
  • Definition 1: Definition sub head
  • proof
  • proof : Proof of Theorem \ref{['thm1']}