Table of Contents
Fetching ...

MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Kai Sun, Yushi Bai, Zhen Yang, Jiajie Zhang, Ji Qi, Lei Hou, Juanzi Li

TL;DR

This work tackles the challenge of fine-grained geometric reasoning in large multimodal models by introducing hard negative contrastive learning for the vision encoder. It combines image-based negatives generated via LLM-driven diagram code perturbations with text-based negatives from retrieval and rule-based caption perturbations, integrated through a unified MMCLIP framework. The resulting MMGeoLM, built on a LLaVA-inspired architecture and trained in three stages, achieves state-of-the-art results on MathVista and MM-Math and rivals strong closed models on GeoQA, validating the approach's effectiveness and data efficiency. Key findings show that diverse hard negatives improve geometric understanding, 4K image-based negatives can outperform large retrieval-based negatives, and there are meaningful bounds to adding more negatives, with robustness to benign image transformations demonstrated.

Abstract

Large Multimodal Models (LMMs) typically build on ViTs (e.g., CLIP), yet their training with simple random in-batch negatives limits the ability to capture fine-grained visual differences, particularly in geometric scenarios. To address this challenge, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train a vision encoder (CLIP) using our hard negative training method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further conduct ablation studies to analyze three key factors: hard negative types, the efficiency of image-based negatives, and training configurations. These analyses yield important insights into optimizing the training pipeline of vision encoder for fine-grained geometric reasoning tasks. https://github.com/THU-KEG/MMGeoLM.

MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

TL;DR

This work tackles the challenge of fine-grained geometric reasoning in large multimodal models by introducing hard negative contrastive learning for the vision encoder. It combines image-based negatives generated via LLM-driven diagram code perturbations with text-based negatives from retrieval and rule-based caption perturbations, integrated through a unified MMCLIP framework. The resulting MMGeoLM, built on a LLaVA-inspired architecture and trained in three stages, achieves state-of-the-art results on MathVista and MM-Math and rivals strong closed models on GeoQA, validating the approach's effectiveness and data efficiency. Key findings show that diverse hard negatives improve geometric understanding, 4K image-based negatives can outperform large retrieval-based negatives, and there are meaningful bounds to adding more negatives, with robustness to benign image transformations demonstrated.

Abstract

Large Multimodal Models (LMMs) typically build on ViTs (e.g., CLIP), yet their training with simple random in-batch negatives limits the ability to capture fine-grained visual differences, particularly in geometric scenarios. To address this challenge, we propose a novel hard negative contrastive learning framework for the vision encoder, which combines image-based contrastive learning using generation-based hard negatives created by perturbing diagram generation code, and text-based contrastive learning using rule-based negatives derived from modified geometric descriptions and retrieval-based negatives selected based on caption similarity. We train a vision encoder (CLIP) using our hard negative training method, namely MMCLIP (Multimodal Math CLIP), and subsequently train an LMM for geometric problem-solving. Experiments show that our trained model, MMGeoLM, significantly outperforms other open-source models on three geometric reasoning benchmarks. Even with a size of 7B, it can rival powerful closed-source models like GPT-4o. We further conduct ablation studies to analyze three key factors: hard negative types, the efficiency of image-based negatives, and training configurations. These analyses yield important insights into optimizing the training pipeline of vision encoder for fine-grained geometric reasoning tasks. https://github.com/THU-KEG/MMGeoLM.

Paper Structure

This paper contains 29 sections, 4 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: Examples of hallucination: non-existent elements and relation misinterpretation.
  • Figure 2: Image-based and text-based hard negative construction and the corresponding MMCLIP training method.
  • Figure 3: Overview of the MMGeoLM training pipeline, including main and each ablation experiment configurations, training strategies, and datasets.
  • Figure 4: An example of vision encoder training data with or without numeric markings.
  • Figure 5: MMGeoLM performance with varying numbers of hard negative ratio.
  • ...and 12 more figures