Table of Contents
Fetching ...

On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

Yannic Neuhaus, Nicolas Flammarion, Matthias Hein, Francesco Croce

TL;DR

This work presents an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task, and finds that reasoning traces which combine multiple text formats yield the best OOD generalization.

Abstract

Integrating reasoning in large language models and large vision-language models has recently led to significant improvement of their capabilities. However, the generalization of reasoning models is still vaguely defined and poorly understood. In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task. Specifically, we consider a grid-based navigation task in which a model is provided with a map and must output a sequence of moves that guides a player from a start position to a goal while avoiding obstacles. The versatility of the task and its data allows us to fine-tune model variants using different input representations (visual and textual) and CoT reasoning strategies, and systematically evaluate them under both in-distribution (ID) and out-of-distribution (OOD) test conditions. Our experiments show that, while CoT reasoning improves in-distribution generalization across all representations, out-of-distribution generalization (e.g., to larger maps) remains very limited in most cases when controlling for trivial matches with the ID data. Surprisingly, we find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization. Finally, purely text-based models consistently outperform those utilizing image-based inputs, including a recently proposed approach relying on latent space reasoning.

On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

TL;DR

This work presents an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task, and finds that reasoning traces which combine multiple text formats yield the best OOD generalization.

Abstract

Integrating reasoning in large language models and large vision-language models has recently led to significant improvement of their capabilities. However, the generalization of reasoning models is still vaguely defined and poorly understood. In this work, we present an evaluation framework to rigorously examine how well chain-of-thought (CoT) approaches generalize on a simple planning task. Specifically, we consider a grid-based navigation task in which a model is provided with a map and must output a sequence of moves that guides a player from a start position to a goal while avoiding obstacles. The versatility of the task and its data allows us to fine-tune model variants using different input representations (visual and textual) and CoT reasoning strategies, and systematically evaluate them under both in-distribution (ID) and out-of-distribution (OOD) test conditions. Our experiments show that, while CoT reasoning improves in-distribution generalization across all representations, out-of-distribution generalization (e.g., to larger maps) remains very limited in most cases when controlling for trivial matches with the ID data. Surprisingly, we find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization. Finally, purely text-based models consistently outperform those utilizing image-based inputs, including a recently proposed approach relying on latent space reasoning.
Paper Structure (15 sections, 7 figures, 10 tables)

This paper contains 15 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Chain-of-thought (CoT) format impacts out-of-distribution (OOD) generalization. We fine-tune Qwen2.5-VL-7B-Instruct on the FrozenLake datasets with maps of size up to 6×6, in grid format. The models trained without CoT or with CoT in description format lead to no OOD (larger maps) generalization, while combining grid and description reasoning steps yield non-trivial results until 10×10 maps. This shows that generalization of reasoning is influenced by the format of the CoT traces.
  • Figure 2: Maze representations. In wu2024vsp, the planning task is introduced with several representations of the maze: as an image, a text description and a table. We also introduce an ASCII-based grid representation requiring less tokens to encode the map.
  • Figure 3: Reasoning traces in different formats. We illustrate an example reasoning trace, from first step to the solution, in the various formats. While xu2025visual use the sequence of maze representations as images, we generate the corresponding steps as text-only descriptions, tables and grids. In the description format at each step we formulate a discussion of which the next step should be, while for the other formats we have the visual representation of the map after the next move.
  • Figure 4: OOD generalization w.r.t. optimal solution length.Top: we show the distribution of the length of the optimal (shortest) solution paths for both training and test maps (aggregated across map sizes). Bottom: we show success rate of models using grid as input representation and various CoT formats. Even when using only training maps with optimal solution length $\leq 10$ (dashed line), the grid + description CoT yields non-trivial success rate on maps with solutions of length $11$ and $12$.
  • Figure 5: Example of the CoT reasoning of our models on OOD maps. We show an example of the reasoning trace produced by the model train on grid input and grid + description CoT. Even on an OOD map, the model first reasons in natural language on the next move, then produces the map after such move.
  • ...and 2 more figures