Table of Contents
Fetching ...

Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024

Jinwoo Ahn, Junhyeok Park, Min-Jun Kim, Kang-Hyeon Kim, So-Yeong Sohn, Yun-Ji Lee, Du-Seong Chang, Yu-Jung Heo, Eun-Sol Kim

TL;DR

This work tackles multimodal reasoning on SMART-101 puzzle images, which require abstract, deductive, and generalizable visio-linguistic understanding for children’s diagrams. It proposes a two-pronged grounding strategy: (i) ground visual cues into richly detailed text captions to leverage large language models, and (ii) augment captioning with SAM-derived geometry-aware features to preserve fine-grained visual patterns. The approach is further strengthened by training on extensive additional datasets and by a Multi-VLM inference framework that selects appropriate key or value models per puzzle category. Empirically, the method achieves a puzzle-split $O_{acc}$ of $29.5$ on the test set and a challenge-set $WOSA$ of $27.1$, illustrating strong gains from text grounding and geometry-aware visual features in diagrammatic reasoning. This work demonstrates that carefully engineered text-based grounding and structured visual features can significantly enhance multimodal reasoning on synthetic puzzle datasets, with potential broader impact on multimodal understanding tasks that involve diagrams and geometric reasoning.

Abstract

In this paper, the solution of HYU MLLAB KT Team to the Multimodal Algorithmic Reasoning Task: SMART-101 CVPR 2024 Challenge is presented. Beyond conventional visual question-answering problems, the SMART-101 challenge aims to achieve human-level multimodal understanding by tackling complex visio-linguistic puzzles designed for children in the 6-8 age group. To solve this problem, we suggest two main ideas. First, to utilize the reasoning ability of a large-scale language model (LLM), the given visual cues (images) are grounded in the text modality. For this purpose, we generate highly detailed text captions that describe the context of the image and use these captions as input for the LLM. Second, due to the nature of puzzle images, which often contain various geometric visual patterns, we utilize an object detection algorithm to ensure these patterns are not overlooked in the captioning process. We employed the SAM algorithm, which can detect various-size objects, to capture the visual features of these geometric patterns and used this information as input for the LLM. Under the puzzle split configuration, we achieved an option selection accuracy Oacc of 29.5 on the test set and a weighted option selection accuracy (WOSA) of 27.1 on the challenge set.

Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024

TL;DR

This work tackles multimodal reasoning on SMART-101 puzzle images, which require abstract, deductive, and generalizable visio-linguistic understanding for children’s diagrams. It proposes a two-pronged grounding strategy: (i) ground visual cues into richly detailed text captions to leverage large language models, and (ii) augment captioning with SAM-derived geometry-aware features to preserve fine-grained visual patterns. The approach is further strengthened by training on extensive additional datasets and by a Multi-VLM inference framework that selects appropriate key or value models per puzzle category. Empirically, the method achieves a puzzle-split of on the test set and a challenge-set of , illustrating strong gains from text grounding and geometry-aware visual features in diagrammatic reasoning. This work demonstrates that carefully engineered text-based grounding and structured visual features can significantly enhance multimodal reasoning on synthetic puzzle datasets, with potential broader impact on multimodal understanding tasks that involve diagrams and geometric reasoning.

Abstract

In this paper, the solution of HYU MLLAB KT Team to the Multimodal Algorithmic Reasoning Task: SMART-101 CVPR 2024 Challenge is presented. Beyond conventional visual question-answering problems, the SMART-101 challenge aims to achieve human-level multimodal understanding by tackling complex visio-linguistic puzzles designed for children in the 6-8 age group. To solve this problem, we suggest two main ideas. First, to utilize the reasoning ability of a large-scale language model (LLM), the given visual cues (images) are grounded in the text modality. For this purpose, we generate highly detailed text captions that describe the context of the image and use these captions as input for the LLM. Second, due to the nature of puzzle images, which often contain various geometric visual patterns, we utilize an object detection algorithm to ensure these patterns are not overlooked in the captioning process. We employed the SAM algorithm, which can detect various-size objects, to capture the visual features of these geometric patterns and used this information as input for the LLM. Under the puzzle split configuration, we achieved an option selection accuracy Oacc of 29.5 on the test set and a weighted option selection accuracy (WOSA) of 27.1 on the challenge set.
Paper Structure (19 sections, 1 equation, 3 figures, 1 table)

This paper contains 19 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overall framework of proposed pipeline. In the framework, we extract image captions using a two-stage mechanism. To enhance the quality of the caption, we first generate three sets of visual question and answer (VQA). The results of these VQA generations are then used as history and included as additional prompts when generating the caption of the image. The generated caption, along with the question, is used as a prompt for the backbone model, InstructBLIP. We enhanced the visual understanding ability by concatenating features from ViT and SAM. Finally, the image and text embeddings are processed through Q-Former and LLM with a specific ensemble strategy by classifying puzzle categories.
  • Figure 2: Result of text enhancement module. The left box is the target puzzle to augment text information through Qwen-VL-Chat, and the middle box is the generated visual question-answer pairs. The right box is the result of generating captions using VQA pairs as history.
  • Figure 3: Inference workflow for Multi-VLM. A zero-shot classifier determines the puzzle category. Based on the classified puzzle type, either the key prediction model or the value prediction model is selected, each of which is specifically trained for the corresponding answer type.