Table of Contents
Fetching ...

Integrating Query-aware Segmentation and Cross-Attention for Robust VQA

Wonjun Choi, Sangbeom Lee, Seungyeon Lee, Heechul Jung, Dong-Gyu Lee

TL;DR

Addressing VizWiz-VQA robustness, the paper proposes a parameter-efficient LVLM-based approach with trainable cross-attention and LoRA fine-tuning. It adds query-aware segmentation via CLIPSeg to focus on question-relevant image regions and integrates ViT features with CLIPSeg outputs to enrich visual representations. An Levenshtein-distance ensemble selects the final answer from multiple predictions, improving accuracy. Experiments on VizWiz-VQA demonstrate performance gains over baselines and show the method ranking third on the public leaderboard, underscoring practical impact for visually impaired users.

Abstract

This paper introduces a method for VizWiz-VQA using LVLM with trainable cross-attention and LoRA finetuning. We train the model with the following conditions: 1) Training with original images. 2) Training with enhanced images using CLIPSeg to highlight or contrast the original image. 3) Training with integrating the output features of Vision Transformer (ViT) and CLIPSeg features of the original images. Then, we ensemble the results based on Levenshtein distance to enhance the prediction of the final answer. In the experiments, we demonstrate and analyze the proposed method's effectiveness.

Integrating Query-aware Segmentation and Cross-Attention for Robust VQA

TL;DR

Addressing VizWiz-VQA robustness, the paper proposes a parameter-efficient LVLM-based approach with trainable cross-attention and LoRA fine-tuning. It adds query-aware segmentation via CLIPSeg to focus on question-relevant image regions and integrates ViT features with CLIPSeg outputs to enrich visual representations. An Levenshtein-distance ensemble selects the final answer from multiple predictions, improving accuracy. Experiments on VizWiz-VQA demonstrate performance gains over baselines and show the method ranking third on the public leaderboard, underscoring practical impact for visually impaired users.

Abstract

This paper introduces a method for VizWiz-VQA using LVLM with trainable cross-attention and LoRA finetuning. We train the model with the following conditions: 1) Training with original images. 2) Training with enhanced images using CLIPSeg to highlight or contrast the original image. 3) Training with integrating the output features of Vision Transformer (ViT) and CLIPSeg features of the original images. Then, we ensemble the results based on Levenshtein distance to enhance the prediction of the final answer. In the experiments, we demonstrate and analyze the proposed method's effectiveness.
Paper Structure (5 sections, 3 figures, 2 tables)

This paper contains 5 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Our proposed fine-tuning using LoRA and cross-attention (FT-L/CA) method and three types of ViT configurations.
  • Figure A1: Example of an original input image (bottle) and the result of applying CLIPSeg.
  • Figure A2: Example of an original input image (flower) and the result of applying CLIPSeg.