Table of Contents
Fetching ...

SortBench: Benchmarking LLMs based on their ability to sort lists

Steffen Herbold

TL;DR

SortBench provides a zero-shot, scalable benchmark to probe LLMs on list sorting, emphasizing faithfulness to input, lexical versus semantic sorting, and adherence to simple Python-list formatting within a constrained context window. By organizing tasks into Basic, Advanced, and Debug across numeric and lexical domains, and evaluating through Output validity, Sorting correctness, and Faithfulness, the study reveals that the o3-mini model often dominates while long sequences and semantic confusion (e.g., number words treated as numbers) degrade faithfulness. Large proprietary models perform well but are not immune to parsing errors and overthinking, particularly on longer lists, whereas smaller models show similar qualitative patterns. The results underscore the trade-offs of test-time reasoning and highlight the need for harder, multi-modal, and linguistically diverse variants to stress-test sorting in future LLM generations, with SortBench offering a practical framework for ongoing model evaluation and robust benchmarking.

Abstract

Sorting is a tedious but simple task for human intelligence and can be solved fairly easily algorithmically. However, for Large Language Models (LLMs) this task is surprisingly hard, as some properties of sorting are among known weaknesses of LLMs: being faithful to the input data, logical comparisons between values, and strictly differentiating between syntax (used for sorting) and semantics (typically learned by embeddings). Within this paper, we describe the new SortBench benchmark for LLMs that comes with different difficulties and that can be easily scaled in terms of difficulty. We apply this benchmark to seven state-of-the-art LLMs, including current test-time reasoning models. Our results show that while the o3-mini model is very capable at sorting in general, even this can be fooled if strings are defined to mix syntactical and semantical aspects, e.g., by asking to sort numbers written-out as word. Furthermore, all models have problems with the faithfulness to the input of long lists, i.e., they drop items and add new ones. Our results also show that test-time reasoning has a tendency to overthink problems which leads to performance degradation. Finally, models without test-time reasoning like GPT-4o are not much worse than reasoning models.

SortBench: Benchmarking LLMs based on their ability to sort lists

TL;DR

SortBench provides a zero-shot, scalable benchmark to probe LLMs on list sorting, emphasizing faithfulness to input, lexical versus semantic sorting, and adherence to simple Python-list formatting within a constrained context window. By organizing tasks into Basic, Advanced, and Debug across numeric and lexical domains, and evaluating through Output validity, Sorting correctness, and Faithfulness, the study reveals that the o3-mini model often dominates while long sequences and semantic confusion (e.g., number words treated as numbers) degrade faithfulness. Large proprietary models perform well but are not immune to parsing errors and overthinking, particularly on longer lists, whereas smaller models show similar qualitative patterns. The results underscore the trade-offs of test-time reasoning and highlight the need for harder, multi-modal, and linguistically diverse variants to stress-test sorting in future LLM generations, with SortBench offering a practical framework for ongoing model evaluation and robust benchmarking.

Abstract

Sorting is a tedious but simple task for human intelligence and can be solved fairly easily algorithmically. However, for Large Language Models (LLMs) this task is surprisingly hard, as some properties of sorting are among known weaknesses of LLMs: being faithful to the input data, logical comparisons between values, and strictly differentiating between syntax (used for sorting) and semantics (typically learned by embeddings). Within this paper, we describe the new SortBench benchmark for LLMs that comes with different difficulties and that can be easily scaled in terms of difficulty. We apply this benchmark to seven state-of-the-art LLMs, including current test-time reasoning models. Our results show that while the o3-mini model is very capable at sorting in general, even this can be fooled if strings are defined to mix syntactical and semantical aspects, e.g., by asking to sort numbers written-out as word. Furthermore, all models have problems with the faithfulness to the input of long lists, i.e., they drop items and add new ones. Our results also show that test-time reasoning has a tendency to overthink problems which leads to performance degradation. Finally, models without test-time reasoning like GPT-4o are not much worse than reasoning models.

Paper Structure

This paper contains 31 sections, 4 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Boxplot of the number of tokens during the reasoning process for the different values of the $ValidityScore$.
  • Figure 2: Different types of lists we found in the output of the LLMs that were not valid Python. The plot is split, since the first two types appear hundreds of times, while the others are rather corner cases with at most twenty instances.
  • Figure 3: Aggregated results for all basic tasks by list size
  • Figure 4: $ValidityScore$ for all basic tasks by list size
  • Figure 5: $SortingScore$ for all basic tasks by list size
  • ...and 9 more figures