Table of Contents
Fetching ...

MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?

Kai Yan, Zhan Ling, Kang Liu, Yifan Yang, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen

TL;DR

MIR-Bench introduces a large-scale, many-shot in-context reasoning benchmark for pattern recognition, leveraging an automatic data-generation pipeline to create MIR-Extended and MIR-Core datasets that stress long-context inductive/transductive reasoning. Systematic experiments across 15 LLMs reveal substantial saturation in many-shot gains, with transductive reasoning often outperforming inductive approaches and retrieval-based fixes providing limited benefit. The work uncovers robust behavior to erroneous examples and mixed results for coding-based and meta-shot interventions, offering concrete insights for designing future long-context reasoning benchmarks and guiding the development of generalist AI agents. Overall, MIR-Bench highlights core challenges in scaling in-context pattern recognition beyond classification, motivating further research into structured reasoning, memory, and data-efficient long-context inference.

Abstract

The ability to recognize patterns from examples and apply them to new ones is a primal ability for general intelligence, and is widely studied by psychology and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually <10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations often focus on classification, and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context reasoning benchmark for pattern recognition that asks LLM to predict output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for many-shot in-context reasoning, and acquired many insightful findings including scaling effect, robustness, inductive vs. transductive reasoning, retrieval Augmented Generation (RAG), coding for inductive reasoning, cross-domain generalizability, etc.

MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?

TL;DR

MIR-Bench introduces a large-scale, many-shot in-context reasoning benchmark for pattern recognition, leveraging an automatic data-generation pipeline to create MIR-Extended and MIR-Core datasets that stress long-context inductive/transductive reasoning. Systematic experiments across 15 LLMs reveal substantial saturation in many-shot gains, with transductive reasoning often outperforming inductive approaches and retrieval-based fixes providing limited benefit. The work uncovers robust behavior to erroneous examples and mixed results for coding-based and meta-shot interventions, offering concrete insights for designing future long-context reasoning benchmarks and guiding the development of generalist AI agents. Overall, MIR-Bench highlights core challenges in scaling in-context pattern recognition beyond classification, motivating further research into structured reasoning, memory, and data-efficient long-context inference.

Abstract

The ability to recognize patterns from examples and apply them to new ones is a primal ability for general intelligence, and is widely studied by psychology and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually <10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations often focus on classification, and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context reasoning benchmark for pattern recognition that asks LLM to predict output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for many-shot in-context reasoning, and acquired many insightful findings including scaling effect, robustness, inductive vs. transductive reasoning, retrieval Augmented Generation (RAG), coding for inductive reasoning, cross-domain generalizability, etc.

Paper Structure

This paper contains 40 sections, 1 equation, 37 figures, 13 tables.

Figures (37)

  • Figure 1: A high-level illustration of our data generation pipeline. We first collect functions from existing coding benchmarks, then let GPT-4o-0806 write data generator for each function; we then run the data generator to produce input shots, and combine them with ground truth function to produce output shots. With input and output shots, we concatenate them and build MIR-extended; then, with initial tests on several models, we study the factor for what makes a pattern recognition problem benefit from many-shot, and build MIR-core based on selection with the factors.
  • Figure 3: The coefficients of the quadratic function fitting $D$ with the aforementioned factors normalized between $[0, 1]$. The blank row and column are for constant factors. LLM-labeled difficulty is the leading factor for $D$, while answer diversity and shot length are less important.
  • Figure 8: The performance of $5$ cutting-edge LLM models on MIR-Extended with temperature $0.7$ across $5$ runs. The result clearly shows that the standard deviation of accuracy is always below $0.01$, and thus the evaluation is highly stable.
  • Figure 10: Performance difference for $16$ LLMs on MIR-Core between forced CoT and no CoT. For long-CoT models (o1 series and DeepSeek-R1), forced CoT works similar or slightly better than no CoT, but the gain diminishes with more shots. For the rest of the models, forced CoT almost always works worse (with the exception of GPT4o-mini-0718), and such gap increases with the number of shots. Mistral-Large-2's gap decreases dramatically at 2048-shot as such setting often exceeds its context length and the performance is low under both settings.
  • Figure : a) MIR-Extended
  • ...and 32 more figures