Table of Contents
Fetching ...

CryptoX : Compositional Reasoning Evaluation of Large Language Models

Jiajun Shi, Chaoren Wei, Liqun Yang, Zekun Moore Wang, Chenghao Yang, Ge Zhang, Stephen Huang, Tao Peng, Jian Yang, Zhoufutu Wen

TL;DR

This work addresses the challenge of quantifying compositional reasoning (CR) in large language models (LLMs) by introducing CryptoX and CryptoBench, a framework that cryptographically transforms existing benchmarks to force CR and decoding steps. It combines instruction encryption and instruction transformation to create diverse, multi-hop CR tasks and uses EM, LLM-as-judge, UnitTest, and a trapezoid-rule-based AUC to measure CR across encodings. Comprehensive experiments across 20+ open- and closed-source LLMs on five NLP benchmarks reveal a large CR gap between model families and show that AUC is a more informative CR metric than accuracy alone. Mechanistic analyses (logit lens, neuron activation, and reasoning-stage mapping) illuminate how LLMs decompose problems, decode encoded prompts, and summarize subtask results, providing guidance for targeted CR improvements and future multimodal CR extensions.

Abstract

The compositional reasoning capacity has long been regarded as critical to the generalization and intelligence emergence of large language models LLMs. However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce CryptoX, an evaluation framework that, for the first time, combines existing benchmarks and cryptographic, to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct CryptoBench, which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. We further conduct thorough mechanical interpretability experiments to reveal the inner mechanism of LLMs' compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning capabilities of LLMs.

CryptoX : Compositional Reasoning Evaluation of Large Language Models

TL;DR

This work addresses the challenge of quantifying compositional reasoning (CR) in large language models (LLMs) by introducing CryptoX and CryptoBench, a framework that cryptographically transforms existing benchmarks to force CR and decoding steps. It combines instruction encryption and instruction transformation to create diverse, multi-hop CR tasks and uses EM, LLM-as-judge, UnitTest, and a trapezoid-rule-based AUC to measure CR across encodings. Comprehensive experiments across 20+ open- and closed-source LLMs on five NLP benchmarks reveal a large CR gap between model families and show that AUC is a more informative CR metric than accuracy alone. Mechanistic analyses (logit lens, neuron activation, and reasoning-stage mapping) illuminate how LLMs decompose problems, decode encoded prompts, and summarize subtask results, providing guidance for targeted CR improvements and future multimodal CR extensions.

Abstract

The compositional reasoning capacity has long been regarded as critical to the generalization and intelligence emergence of large language models LLMs. However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce CryptoX, an evaluation framework that, for the first time, combines existing benchmarks and cryptographic, to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct CryptoBench, which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. We further conduct thorough mechanical interpretability experiments to reveal the inner mechanism of LLMs' compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning capabilities of LLMs.

Paper Structure

This paper contains 70 sections, 2 equations, 38 figures, 8 tables, 1 algorithm.

Figures (38)

  • Figure 1: (a) shows the concept of compositional reasoning (CR) which involves combining different abilities in one single model run (e.g., A+B) instead of reasoning via individual ability (e.g., A or B). (b) shows the evaluation result of CryptoX. Some LLMs with strong reasoning abilities on the original benchmark have the low-CR abilities on CryptoBench.
  • Figure 2: Overview of the CryptoBench Construction Process. We apply instruction encryption and transformation to the tasks from common NLP benchmarks and combine them to construct our CryptoBench Task. Then we use Exact Match, LLM as judge, UnitTest and AUC as our Evaluation Metrics to judge LLM's performance.
  • Figure 3: The performance of different models on Crypto-HighResolution with 10 varying degrees of encoding.
  • Figure 4: The performance of models with different model size on Crypto-MMLU-Alpha. In Crypto-MMLU-Alpha_Words, Words denotes the number of words encoded in the given question.
  • Figure 5: The comparison of the performance of multi-turn and single-turn dialogues. In Crypto-MMLU-Words-Type, Words denotes the number of words encoded in the given question. Type denotes multi-turn(mt) or single-turn(sg).
  • ...and 33 more figures