CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text
Zhenru Lin, Yiqun Yao, Yang Yuan
TL;DR
CatCode proposes a category-theory–driven evaluation framework for LLMs on tasks that mix code and natural language, aiming to address fragmentation and lack of standardization in prior methods. It formalizes programming languages and natural languages as categories, with objects as functionally equivalent programs, morphisms as edits, and functors as cross-category mappings capturing translation, generation, explanation, and reproduction. The authors implement a standardized evaluation platform and empirically compare ChatGPT, Text-Davinci, and CodeGeeX across morphism identification, code translation, and explanation/reproduction tasks, revealing strengths in local morphism reasoning and translation but gaps in preserving functional equivalence across NL/code. The approach is open-source and scalable, providing a principled basis for broader, more robust evaluation of mixed NL/code capabilities in LLMs and guiding future development of code-aware AI systems.
Abstract
Large language models (LLMs) such as ChatGPT are increasingly proficient in understanding and generating a mixture of code and text. Evaluation based on such $\textit{mixture}$ can lead to a more comprehensive understanding of the models' abilities in solving coding problems. However, in this context, current evaluation methods are either limited in task coverage or lack standardization. To address this issue, we propose using category theory as a framework for evaluation. Specifically, morphisms within a code category can represent code debugging and transformation, functors between two categories represent code translation, and functors between a code category and a natural language category represent code generation, explanation, and reproduction. We present an automatic evaluation framework called $\textbf{CatCode}$ ($\textbf{Cat}$egory $\textbf{Code}$) that can comprehensively assess the coding abilities of LLMs, including ChatGPT, Text-Davinci, and CodeGeeX.
