EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

Mingjie Ma; Zhihuan Yu; Yichao Ma; Guohui Li

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

Mingjie Ma, Zhihuan Yu, Yichao Ma, Guohui Li

TL;DR

This work tackles Visual Commonsense Reasoning (VCR), where models must both answer complex visual questions and provide justifications, requiring external knowledge and precise cross-modal grounding. It introduces EventLens, which combines Event-Aware Pretraining to cultivate dynamic scene understanding with a Cross-modal Local Linking mechanism that ties textual references to specific image regions, aided by instruct-style prompts and adapters. The approach achieves strong results on the VCR dataset, outperforming many task-specific models and remaining competitive with Vision-Language Transformer baselines while keeping trainable parameters modest. This demonstrates a scalable path for integrating large language models with fine-grained visual reasoning and suggests practical benefits for multimodal AI systems with limited fine-tuning resources.

Abstract

Visual Commonsense Reasoning (VCR) is a cognitive task, challenging models to answer visual questions requiring human commonsense, and to provide rationales explaining why the answers are correct. With emergence of Large Language Models (LLMs), it is natural and imperative to explore their applicability to VCR. However, VCR task demands more external knowledge to tackle its challenging questions, necessitating special designs to activate LLMs' commonsense reasoning abilities. Also, most existing Multimodal LLMs adopted an abstraction of entire input image, which makes it difficult to comprehend VCR's unique co-reference tags between image regions and text, posing challenges for fine-grained alignment. To address these issues, we propose EventLens that leverages Event-Aware Pretraining and Cross-modal Linking and EnhanceS VCR. First, by emulating the cognitive process of human reasoning, an Event-Aware Pretraining auxiliary task is introduced to better activate LLM's global comprehension of intricate scenarios. Second, during fine-tuning, we further utilize reference tags to bridge RoI features with texts, while preserving both modality semantics. Finally, we use instruct-style prompts to narrow the gap between pretraining and fine-tuning, and task-specific adapters to better integrate LLM's inherent knowledge with new commonsense. Experimental results show the effectiveness of our proposed auxiliary task and fine-grained linking strategy.

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

TL;DR

Abstract

Paper Structure (20 sections, 5 equations, 4 figures, 3 tables)

This paper contains 20 sections, 5 equations, 4 figures, 3 tables.

Introduction
Related Works
Visual Commonsense Reasoning
Multimodal Large Language Models
Methodology
Preliminaries
Problem Formulation
Intuitive Observations
Event-Aware Pretraining
Supervised Fine-tuning for VCR
Global Abstraction module
Cross-modal Local Linking module
Reasoning Core LLM
Experiments
Datasets
...and 5 more sections

Figures (4)

Figure 1: An example of VCR with two subtasks.
Figure 2: EventLens Architecture for Event-Aware Pretraining.
Figure 3: Proposed EventLens Architecture, including (1) a Global Abstraction module to extract instruction-related image features; (2) a Cross-modal Local Linking module, to better solve the VCR co-reference lacking problem and better instruct LLM, and (3) downstream reasoning core LLM to predict instruction-following answers. The numbered marks denote the steps of EventLens architecture workflow.
Figure 4: Hyper-parameter analysis

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

TL;DR

Abstract

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)