Table of Contents
Fetching ...

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Pengfei Hong, Navonil Majumder, Deepanway Ghosal, Somak Aditya, Rada Mihalcea, Soujanya Poria

TL;DR

This work investigates the robustness of large language models in mathematical and coding tasks by introducing an ontology-guided perturbation framework. It creates two datasets, GSMore and HumanEval-Core, by semi-automatically perturbing seed questions from GSM8K and HumanEval and validating them through a human-in-the-loop process. The authors systematically evaluate a range of closed- and open-source LLMs, revealing significant performance drops under perturbations and highlighting that current systems struggle with deeper conceptual reasoning and format changes. The proposed ontology provides a structured, extensible framework for robustness testing and data augmentation, and the open-source datasets offer a resource for ongoing evaluation of structured reasoning in LLMs.

Abstract

Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness in reasoning tasks remains an open question. To this end, in this paper, we focus on two popular reasoning tasks: arithmetic reasoning and code generation. Particularly, we introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets, GSMORE and HUMANEVAL-CORE, respectively, of perturbed math and coding problems to probe LLM capabilities in numeric reasoning and coding tasks. Through comprehensive evaluations of both closed-source and open-source LLMs, we show a significant performance drop across all the models against the perturbed questions, suggesting that the current LLMs lack robust problem solving skills and structured reasoning abilities in many areas, as defined by our ontology. We open-source the datasets and source codes at: https://github.com/declare-lab/LLM-ReasoningTest.

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

TL;DR

This work investigates the robustness of large language models in mathematical and coding tasks by introducing an ontology-guided perturbation framework. It creates two datasets, GSMore and HumanEval-Core, by semi-automatically perturbing seed questions from GSM8K and HumanEval and validating them through a human-in-the-loop process. The authors systematically evaluate a range of closed- and open-source LLMs, revealing significant performance drops under perturbations and highlighting that current systems struggle with deeper conceptual reasoning and format changes. The proposed ontology provides a structured, extensible framework for robustness testing and data augmentation, and the open-source datasets offer a resource for ongoing evaluation of structured reasoning in LLMs.

Abstract

Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness in reasoning tasks remains an open question. To this end, in this paper, we focus on two popular reasoning tasks: arithmetic reasoning and code generation. Particularly, we introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets, GSMORE and HUMANEVAL-CORE, respectively, of perturbed math and coding problems to probe LLM capabilities in numeric reasoning and coding tasks. Through comprehensive evaluations of both closed-source and open-source LLMs, we show a significant performance drop across all the models against the perturbed questions, suggesting that the current LLMs lack robust problem solving skills and structured reasoning abilities in many areas, as defined by our ontology. We open-source the datasets and source codes at: https://github.com/declare-lab/LLM-ReasoningTest.
Paper Structure (63 sections, 4 figures, 11 tables)

This paper contains 63 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: A semi-automated pipeline of creating GSMore, from five simple questions from GSM8k. An analogous pipeline is used to create the perturbations of the coding questions from HumanEval, named HumanEval-Core.
  • Figure 2: Examples of the original questions and perturbed questions with Logic, Concept and Format as Targets. The targeted change for each question is highlighted in yellow background.
  • Figure 3: Performance drop of LLMs for various dimensions (Level III) of perturbations in (a) GSMore and (b) HumanEval-Core as compared to the performance on the original GSM8K and HumanEval test datasets.
  • Figure 4: Model performance for each question. The blue color indicates the model predicted correctly for the original question, and orange means the opposite. '3', '4', '5', '7', '8' stands for the number of steps in the gold answer for the perturbed question.