Phi-4-reasoning-vision-15B Technical Report

Jyoti Aneja; Michael Harrison; Neel Joshi; Tyler LaBonte; John Langford; Eduardo Salinas

Phi-4-reasoning-vision-15B Technical Report

Jyoti Aneja, Michael Harrison, Neel Joshi, Tyler LaBonte, John Langford, Eduardo Salinas

TL;DR

This work presents Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model that is good at common vision and language tasks and excels at scientific and mathematical reasoning and understanding user interfaces.

Abstract

We present Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model, and share the motivations, design choices, experiments, and learnings that informed its development. Our goal is to contribute practical insight to the research community on building smaller, efficient multimodal reasoning models and to share the result of these learnings as an open-weight model that is good at common vision and language tasks and excels at scientific and mathematical reasoning and understanding user interfaces. Our contributions include demonstrating that careful architecture choices and rigorous data curation enable smaller, open-weight multimodal models to achieve competitive performance with significantly less training and inference-time compute and tokens. The most substantial improvements come from systematic filtering, error correction, and synthetic augmentation -- reinforcing that data quality remains the primary lever for model performance. Systematic ablations show that high-resolution, dynamic-resolution encoders yield consistent improvements, as accurate perception is a prerequisite for high-quality reasoning. Finally, a hybrid mix of reasoning and non-reasoning data with explicit mode tokens allows a single model to deliver fast direct answers for simpler tasks and chain-of-thought reasoning for complex problems.

Phi-4-reasoning-vision-15B Technical Report

TL;DR

Abstract

Paper Structure (30 sections, 8 figures, 7 tables)

This paper contains 30 sections, 8 figures, 7 tables.

Introduction
Focus on Smaller and Faster Vision--Language Models
Architecture and Training
Early vs. Mid Fusion
Vision Encoder and Image Processing
Open research questions.
Training Recipe
Stage 1: MLP Pretraining.
Stage 2: Instruction Tuning.
Stage 3: Long Context, Multi-Image, and RAI.
Training Data
Data Quality
Coordinate normalization.
Mathematics and Science vs. Computer-Use Data Proportion
Open research questions.
...and 15 more sections

Figures (8)

Figure 1: Phi-4-reasoning-vision-15B can help with a wide range of everyday tasks, from writing travel captions and interpreting receipts to reading garment care instructions.
Figure 2: Phi-4-reasoning-vision-15B presents a compelling option compared to existing models, pushing the Pareto frontier of the trade-off between accuracy and compute costs. We achieve competitive performance with much slower models that require more time and tokens, and higher accuracy than similarly fast models. These values were computed by averaging accuracy, time, and output token counts for a subset of 4 benchmarks: ChartQATEST, MathVistaMINI, MMMUVAL, and ScreenSpot_v2.
Figure 3: Overview of the Phi-4-reasoning-vision-15B mid-fusion architecture. Images are processed by a SigLIP-2 vision encoder and projected into the language embedding space via a cross-modality projector (MLP). The resulting visual "soft" tokens are interleaved with text tokens and fed into the Phi-4-Reasoning language model.
Figure 4: Training data composition and examples for the Stage 2 training of Phi-4-reasoning-vision-15B. The Stage 3 data is designed to have a similar composition.
Figure 5: Phi-4-reasoning-vision-15B can interpret sequences of images, here reasoning about the changing appearance of Saturn's rings across multiple frames.
...and 3 more figures

Phi-4-reasoning-vision-15B Technical Report

TL;DR

Abstract

Phi-4-reasoning-vision-15B Technical Report

Authors

TL;DR

Abstract

Table of Contents

Figures (8)