Smart Vision-Language Reasoners

Denisa Roberts; Lucas Roberts

Smart Vision-Language Reasoners

Denisa Roberts, Lucas Roberts

TL;DR

This work addresses multimodal reasoning for Math AI by evaluating vision-language models as problem solvers on the SMART benchmark. It introduces SmarterVLM, a framework that freezes frozen backbones (DinoV2+SigLIP for vision and SigLIP for language) and adds a novel QF-layer that enables cross-modal attention and adaptive image representations to form a robust composite representation. The approach yields substantial performance gains across eight reasoning Skill classes, with up to $48\%$ accuracy improvement over strong baselines, and demonstrates the importance of cross-attention and multimodal fusion for visual grounding in reasoning tasks. The study highlights practical implications for scalable, high-performing multimodal reasoning systems in Math AI and outlines concrete directions for future enhancements, including multitask learning and more efficient training paradigms.

Abstract

In this article, we investigate vision-language models (VLM) as reasoners. The ability to form abstractions underlies mathematical reasoning, problem-solving, and other Math AI tasks. Several formalisms have been given to these underlying abstractions and skills utilized by humans and intelligent systems for reasoning. Furthermore, human reasoning is inherently multimodal, and as such, we focus our investigations on multimodal AI. In this article, we employ the abstractions given in the SMART task (Simple Multimodal Algorithmic Reasoning Task) introduced in \cite{cherian2022deep} as meta-reasoning and problem-solving skills along eight axes: math, counting, path, measure, logic, spatial, and pattern. We investigate the ability of vision-language models to reason along these axes and seek avenues of improvement. Including composite representations with vision-language cross-attention enabled learning multimodal representations adaptively from fused frozen pretrained backbones for better visual grounding. Furthermore, proper hyperparameter and other training choices led to strong improvements (up to $48\%$ gain in accuracy) on the SMART task, further underscoring the power of deep multimodal learning. The smartest VLM, which includes a novel QF multimodal layer, improves upon the best previous baselines in every one of the eight fundamental reasoning skills. End-to-end code is available at https://github.com/smarter-vlm/smarter.

Smart Vision-Language Reasoners

TL;DR

accuracy improvement over strong baselines, and demonstrates the importance of cross-attention and multimodal fusion for visual grounding in reasoning tasks. The study highlights practical implications for scalable, high-performing multimodal reasoning systems in Math AI and outlines concrete directions for future enhancements, including multitask learning and more efficient training paradigms.

Abstract

gain in accuracy) on the SMART task, further underscoring the power of deep multimodal learning. The smartest VLM, which includes a novel QF multimodal layer, improves upon the best previous baselines in every one of the eight fundamental reasoning skills. End-to-end code is available at https://github.com/smarter-vlm/smarter.

Paper Structure (7 sections, 2 equations, 4 figures, 8 tables)

This paper contains 7 sections, 2 equations, 4 figures, 8 tables.

Introduction
Related Work
Benchmark, Dataset, and Challenges
Methodology
Experiments and Results
Discussion and Future Work
Further Experimental Results.

Figures (4)

Figure 1: The smarterVLM reasoner architecture (right) and the novel QF layer (left). Vision (DinoV2+SigLIP) and language (SigLIP) backbones are frozen. All other layers are trained from scratch.
Figure 2: Math Question: What do we need to put in the square to get a correct diagram?Answer Options: A: -3; B: /9; C: x6; D: x2; E: 2; Path Question with Sequence Answer: You have to block some locations in the maze so that the feline cannot reach the bird. Which of the following options to block will fail?Answer Options: A: 1, 2, and 3; B: 4; C: 5, 6, and 7; D: 8 and 9 ; E: 10, 11, and 12. Counting Question: The entire pie is divided among several children. Each child receives a piece of pie, and each piece of pie looks identical. The maximum possible number of children there is:Answer Options: A: 7; B: 2; C: 1; D: 4; E: 3. Algebra Question: The entire pie is divided among several children. Each child receives a piece of pie, and each piece of pie looks identical. The maximum possible number of children there is:Answer Options: A: 5; B: 4; C: 2; D: 0; E: 6. Measure Question:A student had a few canes with a height of 1 cm and a length of 5 cm. Using the canes, she built the arrangement illustrated. What is the width of the arrangement?Answer Options:A: 20; B: 30; C: 15; D: 5; E: 35. Spatial Question: Cristina made a setup using some green blocks and 94 white blocks. How many of these white blocks are not visible in the figure?Answer Options: A: 28; B: 61; C: 64; D: 90; E: 79. Logic Question: Emily has 7 toy items: a remote, a hair brush, a truck, an eraser, a rubber duck, carrots, and a toe ring. She keeps each toy at a different row of the shelf. The carrots lower to toe ring. Remote lower to truck and toe ring higher to truck. Toe ring higher to rubber duck. She keeps carrots as shown. On which row can the rubber duck not be placed?Answer Options: A: 4; B: 3; C: 7; D: 5; E: 6. Pattern Question: Which picture on the right matches with the left, if we invert the colors?Answer Options: A; B; C; D; E.
Figure 3: Epoch Train Loss, Validation Loss, and Validation Accuracy for five different learning rates. From CometML https://www.comet.com/droberts308/multimodalai/view/QF0ah3akqYB6IiNuyVXuRchlh/panels.
Figure 4: Validation accuracy curves per skill class (counting, math, spatial, logic, pattern, measure) for five different learning rates.

Smart Vision-Language Reasoners

TL;DR

Abstract

Smart Vision-Language Reasoners

Authors

TL;DR

Abstract

Table of Contents

Figures (4)