Don't Learn, Ground: A Case for Natural Language Inference with Visual Grounding
Daniil Ignatev, Ayman Santeer, Albert Gatt, Denis Paperno
TL;DR
The paper demonstrates that natural language inference can be performed in a zero-shot setting by grounding premises in visual representations generated from text-to-image models, and inferring labels from the visual-hypothesis relation using CSS or VQA. This multimodal approach achieves meaningful accuracy without fine-tuning and shows robustness against text-derived biases, as validated on SNLI and a synthetic adversarial dataset. Across experiments, VQA-based inference generally outperforms cosine similarity, while multi-image aggregation improves certain classes, though neutral cases remain challenging. The work highlights the potential of grounded meaning representations for robust language understanding and points to future directions in broader grounding modalities and fairness-oriented evaluations.
Abstract
We propose a zero-shot method for Natural Language Inference (NLI) that leverages multimodal representations by grounding language in visual contexts. Our approach generates visual representations of premises using text-to-image models and performs inference by comparing these representations with textual hypotheses. We evaluate two inference techniques: cosine similarity and visual question answering. Our method achieves high accuracy without task-specific fine-tuning, demonstrating robustness against textual biases and surface heuristics. Additionally, we design a controlled adversarial dataset to validate the robustness of our approach. Our findings suggest that leveraging visual modality as a meaning representation provides a promising direction for robust natural language understanding.
