Table of Contents
Fetching ...

ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers

Aristi Papastavrou, Maria Lymperaiou, Giorgos Stamou

TL;DR

This paper tackles Visual Word Sense Disambiguation by mapping textual inputs to embeddings and visual inputs to a joint multimodal space for cross-modal reasoning. ARPA combines a language-model based encoder, a Swin Transformer visual encoder, and a Graph Convolutional Network to refine multimodal representations and capture relational structure across modalities. Key contributions include a GCN-based fusion mechanism, a comprehensive preprocessing and augmentation pipeline, and thorough ablations showing the practical efficacy of RoBERTa+Swin+GCN on SemEval 2023 Task 1, with LLaMA offering marginal gains at higher compute. On SemEval 2023 Task 1, ARPA achieves state-of-the-art accuracy and MRR, demonstrating a robust and scalable approach to multimodal word sense disambiguation with strong potential for real-world VL applications.

Abstract

In the rapidly evolving fields of natural language processing and computer vision, Visual Word Sense Disambiguation (VWSD) stands as a critical, yet challenging task. The quest for models that can seamlessly integrate and interpret multimodal data is more pressing than ever. Imagine a system that can understand language with the depth and nuance of human cognition, while simultaneously interpreting the rich visual context of the world around it. We present ARPA, an architecture that fuses the unparalleled contextual understanding of large language models with the advanced feature extraction capabilities of transformers, which then pass through a custom Graph Neural Network (GNN) layer to learn intricate relationships and subtle nuances within the data. This innovative architecture not only sets a new benchmark in visual word disambiguation but also introduces a versatile framework poised to transform how linguistic and visual data interact by harnessing the synergistic strengths of its components, ensuring robust performance even in the most complex disambiguation scenarios. Through a series of experiments and comparative analysis, we reveal the substantial advantages of our model, underscoring its potential to redefine standards in the field. Beyond its architectural prowess, our architecture excels through experimental enrichments, including sophisticated data augmentation and multi-modal training techniques. ARPA's introduction marks a significant milestone in visual word disambiguation, offering a compelling solution that bridges the gap between linguistic and visual modalities. We invite researchers and practitioners to explore the capabilities of our model, envisioning a future where such hybrid models drive unprecedented advancements in artificial intelligence.

ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers

TL;DR

This paper tackles Visual Word Sense Disambiguation by mapping textual inputs to embeddings and visual inputs to a joint multimodal space for cross-modal reasoning. ARPA combines a language-model based encoder, a Swin Transformer visual encoder, and a Graph Convolutional Network to refine multimodal representations and capture relational structure across modalities. Key contributions include a GCN-based fusion mechanism, a comprehensive preprocessing and augmentation pipeline, and thorough ablations showing the practical efficacy of RoBERTa+Swin+GCN on SemEval 2023 Task 1, with LLaMA offering marginal gains at higher compute. On SemEval 2023 Task 1, ARPA achieves state-of-the-art accuracy and MRR, demonstrating a robust and scalable approach to multimodal word sense disambiguation with strong potential for real-world VL applications.

Abstract

In the rapidly evolving fields of natural language processing and computer vision, Visual Word Sense Disambiguation (VWSD) stands as a critical, yet challenging task. The quest for models that can seamlessly integrate and interpret multimodal data is more pressing than ever. Imagine a system that can understand language with the depth and nuance of human cognition, while simultaneously interpreting the rich visual context of the world around it. We present ARPA, an architecture that fuses the unparalleled contextual understanding of large language models with the advanced feature extraction capabilities of transformers, which then pass through a custom Graph Neural Network (GNN) layer to learn intricate relationships and subtle nuances within the data. This innovative architecture not only sets a new benchmark in visual word disambiguation but also introduces a versatile framework poised to transform how linguistic and visual data interact by harnessing the synergistic strengths of its components, ensuring robust performance even in the most complex disambiguation scenarios. Through a series of experiments and comparative analysis, we reveal the substantial advantages of our model, underscoring its potential to redefine standards in the field. Beyond its architectural prowess, our architecture excels through experimental enrichments, including sophisticated data augmentation and multi-modal training techniques. ARPA's introduction marks a significant milestone in visual word disambiguation, offering a compelling solution that bridges the gap between linguistic and visual modalities. We invite researchers and practitioners to explore the capabilities of our model, envisioning a future where such hybrid models drive unprecedented advancements in artificial intelligence.
Paper Structure (24 sections, 2 equations, 3 figures, 4 tables)

This paper contains 24 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: ARPA Architecture
  • Figure 2: MRR measurements for different published Models
  • Figure 3: Accuracy Comparison of Different Enrichment Techniques