MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models
Sayak Chakrabarty, Souradip Pal
TL;DR
The paper tackles the challenge of visual-question answering with multiple choices by introducing MM-PoE, a two-step process of elimination and targeted re-scoring in a multimodal setting. By first masking out implausible options and then re-evaluating the remaining ones using a masked context, MM-PoE aligns model reasoning with human test-taking strategies. Empirical results show improved zero-shot and few-shot performance on ScienceQA and AI2D across multiple VLM backbones, illustrating the approach's robustness and generality. The work extends the PoE paradigm to multimodal tasks and provides open-source code to facilitate adoption and further research in interpretable, reliable visual reasoning.
Abstract
This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, herein referred to as Multi-Modal Process of Elimination (MM-PoE). This novel methodology is engineered to augment the efficacy of Vision-Language Models (VLMs) in multiple-choice visual reasoning tasks. Diverging from conventional approaches that evaluate each option independently, MM-PoE employs a dual-step scoring paradigm that initially identifies and excludes implausible choices, subsequently concentrating on the most probable remaining options. This method emulates human test-taking strategies, where individuals typically eliminate clearly incorrect answers prior to selecting the optimal response. Our empirical evaluations, conducted across three benchmark datasets, reveal that MM-PoE significantly improves both zero-shot and few-shot performance of contemporary state-of-the-art VLMs. Critically, this approach not only broadens the application of the elimination process to multi-modal contexts but also allows few-shot experiments, thereby addressing two principal limitations concerning usage of PoE only in zero-shot settings and only with a language-only framework. As a result, MM-PoE not only refines the reasoning capabilities of VLMs but also broadens their applicability to complex visual question-answering scenarios. All code and documentation supporting our work are available at https://pypi.org/project/mm-poe/, enabling researchers and practitioners to easily integrate and further develop these techniques.
