CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text

Zhenru Lin; Yiqun Yao; Yang Yuan

CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text

Zhenru Lin, Yiqun Yao, Yang Yuan

TL;DR

CatCode proposes a category-theory–driven evaluation framework for LLMs on tasks that mix code and natural language, aiming to address fragmentation and lack of standardization in prior methods. It formalizes programming languages and natural languages as categories, with objects as functionally equivalent programs, morphisms as edits, and functors as cross-category mappings capturing translation, generation, explanation, and reproduction. The authors implement a standardized evaluation platform and empirically compare ChatGPT, Text-Davinci, and CodeGeeX across morphism identification, code translation, and explanation/reproduction tasks, revealing strengths in local morphism reasoning and translation but gaps in preserving functional equivalence across NL/code. The approach is open-source and scalable, providing a principled basis for broader, more robust evaluation of mixed NL/code capabilities in LLMs and guiding future development of code-aware AI systems.

Abstract

Large language models (LLMs) such as ChatGPT are increasingly proficient in understanding and generating a mixture of code and text. Evaluation based on such $\textit{mixture}$ can lead to a more comprehensive understanding of the models' abilities in solving coding problems. However, in this context, current evaluation methods are either limited in task coverage or lack standardization. To address this issue, we propose using category theory as a framework for evaluation. Specifically, morphisms within a code category can represent code debugging and transformation, functors between two categories represent code translation, and functors between a code category and a natural language category represent code generation, explanation, and reproduction. We present an automatic evaluation framework called $\textbf{CatCode}$ ($\textbf{Cat}$egory $\textbf{Code}$) that can comprehensively assess the coding abilities of LLMs, including ChatGPT, Text-Davinci, and CodeGeeX.

CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text

TL;DR

Abstract

Large language models (LLMs) such as ChatGPT are increasingly proficient in understanding and generating a mixture of code and text. Evaluation based on such

can lead to a more comprehensive understanding of the models' abilities in solving coding problems. However, in this context, current evaluation methods are either limited in task coverage or lack standardization. To address this issue, we propose using category theory as a framework for evaluation. Specifically, morphisms within a code category can represent code debugging and transformation, functors between two categories represent code translation, and functors between a code category and a natural language category represent code generation, explanation, and reproduction. We present an automatic evaluation framework called

(

egory

) that can comprehensively assess the coding abilities of LLMs, including ChatGPT, Text-Davinci, and CodeGeeX.

Paper Structure (36 sections, 2 equations, 9 figures, 7 tables)

This paper contains 36 sections, 2 equations, 9 figures, 7 tables.

Introduction
Methods
Comprehensive Categorical Perspective
Standardized Evaluation Platform
Experiments
Research Questions and Basic Settings
Experiment 1: Morphism Identification Within a Code Category
Categorical Perspective Settings
Implementation
Results
Experiment 2: Translation Functor Between Different PL Categories
Categorical Perspective Settings
Implementation
Results
Experiment 3: Explanation Functor and Reproduction Functor Between PL and NL Categories
...and 21 more sections

Figures (9)

Figure 1: The overall evaluation framework. We use category perspectives to reorganize and transform data, formulate different coding tasks, and conduct model evaluations.
Figure 2: Categorical framework for a mixture of code and NL. $A$, $B$ and $C$ represent different objects, $A$ and $A"$ represent the equivalent object of $A$ in other categories.
Figure 3: Standardized evaluation platform. The central pipeline offers a consistent approach for all evaluations. Behind the pipeline, we provide a variety of functions to automatically conduct the most important steps. With our platform released, the pipeline can easily accommodate novel datasets, tasks, and models by following the instructions outlined alongside the grey lines.
Figure 4: Morphism Identification Experiment. "1", "2" and "global" stands for the distance of the code."Eq" and "neq" indicates whether the morphism is self-morphism. (Left) An illustration of morphisms and the definition of object distance. (Right) Comparison of Text-Davinci and ChatGPT for morphism identification.
Figure 5: Comparison of model performance. (Left) Model as a translation functor. (Right) Model as the combination of explanation functor and reproduction functor
...and 4 more figures

Theorems & Definitions (2)

Definition 2.1
Definition 2.2

CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text

TL;DR

Abstract

CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (2)