Code LLMs: A Taxonomy-based Survey
Nishat Raihan, Christian Newman, Marcos Zampieri
TL;DR
This taxonomy-based survey analyzes Code LLMs through five interrelated areas— Tasks, Corpora, Models, Benchmarks, and Challenges—providing a unified framework to map encoder-only, encoder-decoder, and decoder-only approaches, including foundational and finetuned variants. It discusses representative models (e.g., BERT-based encoders, CodeT5, GPT-4, LLaMA-based decoders), training data strategies (code-focused vs general corpora, synthetic data), and architecture innovations (AST/CFG integration, RoPE, MQA, SwiGLU, Ki-caching). The paper also reviews current benchmarks (HumanEval, MBPP, etc.), highlights critical challenges (data quality, evaluation bias, multilingual coverage, and resource constraints), and outlines open problems and potential solutions to guide future research. Overall, it provides a comprehensive, future-oriented perspective intended to support researchers and practitioners in building robust, scalable, and fair Code LLMs with realistic benchmarks and broader language coverage.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across various NLP tasks and have recently expanded their impact to coding tasks, bridging the gap between natural languages (NL) and programming languages (PL). This taxonomy-based survey provides a comprehensive analysis of LLMs in the NL-PL domain, investigating how these models are utilized in coding tasks and examining their methodologies, architectures, and training processes. We propose a taxonomy-based framework that categorizes relevant concepts, providing a unified classification system to facilitate a deeper understanding of this rapidly evolving field. This survey offers insights into the current state and future directions of LLMs in coding tasks, including their applications and limitations.
