Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models
Gunjan Das, Paheli Bhattacharya, Rishabh Gupta
TL;DR
The paper tackles the lack of cross-domain evaluation by conducting a unified benchmarking study of general-purpose and code-specific LLMs across linguistic, reasoning, and trustworthiness tasks, plus a code-explanation analysis with CoNaLa. It demonstrates that code-tuned models such as CodeLlama-34B often deliver stronger cross-domain performance than general-purpose models, highlighting the cross-domain benefits of domain-specific pretraining. Key findings show predictable gains from model scale in NL tasks, but code-focused models also excel in reasoning and textual precision beyond coding. The results inform practical model selection for real-world tasks, suggesting when to prioritize trustworthiness, complex reasoning, or code-to-text generation, and point to future work expanding benchmarks and task types.
Abstract
Large Language Models (LLMs) have revolutionized both general natural language processing and domain-specific applications such as code synthesis, legal reasoning, and finance. However, while prior studies have explored individual model capabilities, a systematic cross-domain comparison that unifies linguistic, reasoning, and code understanding abilities remains underexplored. In this work, we present a comprehensive evaluation of five general-purpose and three code-specific state-of-the-art LLMs across six diverse benchmarks encompassing linguistic competence, mathematical reasoning, and trustworthiness. Additionally, we analyze model behavior on the CoNaLa dataset for code explanation, comparing natural language and code-specialized LLMs. Our findings reveal that models optimized for code (e.g., CodeLLaMA variants) exhibit strong reasoning and syntactic precision, that even for non-coding tasks can show measurable performance gains, in contrast to general-purpose models like Mistral-7B and Llama-3-8B.
