GENIUS: Generative Fluid Intelligence Evaluation Suite

Ruichuan An; Sihan Yang; Ziyu Guo; Wei Dai; Zijun Shen; Haodong Li; Renrui Zhang; Xinyu Wei; Guopeng Li; Wenshan Wu; Wentao Zhang

GENIUS: Generative Fluid Intelligence Evaluation Suite

Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, Wentao Zhang

TL;DR

GENIUS formalizes Generative Fluid Intelligence ($GFI$) within the CHC framework and presents the first multimodal benchmark to quantify dynamic, rule-driven visual generation in novel contexts. It operationalizes $GFI$ into three primitives and evaluates 12 models across 510 expert-curated samples using a hybrid, model-judge pipeline with Rule Compliance, Visual Consistency, and Aesthetic Quality metrics. The study reveals a substantial gap between state-of-the-art models and true fluid intelligence, driven by an execution gap where priors overpower context, and shows that attention misalignment during inference contributes to failures. As a remedy, the authors propose a training-free Attention Adjustment Mechanism that improves performance by reweighting context signals, suggesting a viable path toward more robust, context-aware generation without additional training. GENIUS thus acts as a rigorous standard to push multimodal models from crystallized recall toward adaptive, reasoning-driven generalization, with dataset and code released for community use.

Abstract

Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks $\textit{Generative Fluid Intelligence (GFI)}$: the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce $\textbf{GENIUS}$ ($\textbf{GEN}$ Fluid $\textbf{I}$ntelligence Eval$\textbf{U}$ation $\textbf{S}$uite). We formalize $\textit{GFI}$ as a synthesis of three primitives. These include $\textit{Inducing Implicit Patterns}$ (e.g., inferring personalized visual preferences), $\textit{Executing Ad-hoc Constraints}$ (e.g., visualizing abstract metaphors), and $\textit{Adapting to Contextual Knowledge}$ (e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately, $\textbf{GENIUS}$ establishes a rigorous standard for $\textit{GFI}$, guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at: $\href{https://github.com/arctanxarc/GENIUS}{https://github.com/arctanxarc/GENIUS}$.

GENIUS: Generative Fluid Intelligence Evaluation Suite

TL;DR

GENIUS formalizes Generative Fluid Intelligence (

) within the CHC framework and presents the first multimodal benchmark to quantify dynamic, rule-driven visual generation in novel contexts. It operationalizes

into three primitives and evaluates 12 models across 510 expert-curated samples using a hybrid, model-judge pipeline with Rule Compliance, Visual Consistency, and Aesthetic Quality metrics. The study reveals a substantial gap between state-of-the-art models and true fluid intelligence, driven by an execution gap where priors overpower context, and shows that attention misalignment during inference contributes to failures. As a remedy, the authors propose a training-free Attention Adjustment Mechanism that improves performance by reweighting context signals, suggesting a viable path toward more robust, context-aware generation without additional training. GENIUS thus acts as a rigorous standard to push multimodal models from crystallized recall toward adaptive, reasoning-driven generalization, with dataset and code released for community use.

Abstract

Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess

, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks

: the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce

(

Fluid

ntelligence Eval

ation

uite). We formalize

as a synthesis of three primitives. These include

(e.g., inferring personalized visual preferences),

(e.g., visualizing abstract metaphors), and

(e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately,

establishes a rigorous standard for

, guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at:

Paper Structure (31 sections, 2 theorems, 30 equations, 12 figures, 3 tables)

This paper contains 31 sections, 2 theorems, 30 equations, 12 figures, 3 tables.

Introduction
GENIUS
Benchmark Overview
Benchmark Construction
Evaluation Metric
Experiment
Main Results
Discussion and Analysis
Validity of LMM-as-a-Judge
A Potential Solution
Experimental Observation
Theoretical Analysis
Attention Adjustment Mechanism
Experimental Results
Conclusion
...and 16 more sections

Key Result

Theorem 4.1

The layer update satisfies following property: where the bias perturbation is defined as: and the upsampling operator perturbation be defined as: And the normalized attention difference is given by:

Figures (12)

Figure 1: An overview of GENIUS benchmark. It is hierarchically structured into three dimensions, five tasks, and diverse sub-tasks.
Figure 2: Diagnostic analysis and metric validation. (a) Performance comparison across different context settings. (b) Analysis of the gap between context comprehension (VQA) and generation capabilities. (c) Correlation analysis validating the LMM-as-a-Judge metric.
Figure 3: Visualization of attention scores (range [0, 1]). Left: Existing models. Right: Ours.
Figure 4: Method overview. Guided by the theoretical insight that attention magnitude dictates gradient norms (a), we implement a three-stage pipeline (b) to explicitly suppress noise tokens and rectify the implicit optimization direction.
Figure 5: Data composition pie chart.GENIUS comprises 3 dimensions, 5 tasks, and 20 sub-tasks.
...and 7 more figures

Theorems & Definitions (2)

Theorem 4.1
Theorem 4.2

GENIUS: Generative Fluid Intelligence Evaluation Suite

TL;DR

Abstract

GENIUS: Generative Fluid Intelligence Evaluation Suite

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (2)