Table of Contents
Fetching ...

Multi Language Models for On-the-Fly Syntax Highlighting

Marco Edoardo Palma, Pooja Rani, Harald C. Gall

TL;DR

This paper addresses real-time on-the-fly syntax highlighting across multiple programming languages by introducing unified multi-language CNN-based highlighters and a token normalization strategy. The approach maintains near-parity with single-language baselines while reducing deployment complexity, data requirements, and maintenance costs. Token normalization enables cross-language generalization and enables effective few-shot adaptation, though fully replacing large multi-language models with few-shot techniques remains challenging. Collectively, the results demonstrate scalable, fast, and accurate syntax highlighting across languages, with practical implications for online development tools and collaborative coding platforms.

Abstract

Syntax highlighting is a critical feature in modern software development environments, enhancing code readability and developer productivity. However, delivering accurate highlighting in real time remains challenging for online and web-based development tools due to strict time and memory constraints on backend services. These systems must serve highlights rapidly and frequently, even when code is partially valid or invalid. This has led to on-the-fly syntax highlighting, where visual annotations are generated just before content is served, often at high request rates and under incomplete input conditions. To meet these demands efficiently, state-of-the-art models use deep learning to learn the behavior of brute-force syntax highlighting resolvers, tools that are easy to implement but too slow for production. Through the Deep Abstraction process, brute-force strategies are encoded into fast statistical models that achieve both high accuracy and low-latency inference. Despite their success, such models face key challenges: they support only one programming language per model, require large datasets from slow brute-force generators, and involve resource-intensive training. In multi-language environments, this means maintaining multiple independent models, increasing system complexity and operational cost. This work addresses these issues by introducing a unified model capable of highlighting up to six mainstream programming languages, reducing deployment complexity by a factor of six and improving performance on unseen languages. A novel normalization technique significantly enhances model generalization, while few-shot learning experiments show that a small number of oracle samples can replace large datasets, minimizing dependence on brute-force generators. Combined, these innovations enable efficient, scalable, and cost-effective syntax highlighting across diverse programming languages.

Multi Language Models for On-the-Fly Syntax Highlighting

TL;DR

This paper addresses real-time on-the-fly syntax highlighting across multiple programming languages by introducing unified multi-language CNN-based highlighters and a token normalization strategy. The approach maintains near-parity with single-language baselines while reducing deployment complexity, data requirements, and maintenance costs. Token normalization enables cross-language generalization and enables effective few-shot adaptation, though fully replacing large multi-language models with few-shot techniques remains challenging. Collectively, the results demonstrate scalable, fast, and accurate syntax highlighting across languages, with practical implications for online development tools and collaborative coding platforms.

Abstract

Syntax highlighting is a critical feature in modern software development environments, enhancing code readability and developer productivity. However, delivering accurate highlighting in real time remains challenging for online and web-based development tools due to strict time and memory constraints on backend services. These systems must serve highlights rapidly and frequently, even when code is partially valid or invalid. This has led to on-the-fly syntax highlighting, where visual annotations are generated just before content is served, often at high request rates and under incomplete input conditions. To meet these demands efficiently, state-of-the-art models use deep learning to learn the behavior of brute-force syntax highlighting resolvers, tools that are easy to implement but too slow for production. Through the Deep Abstraction process, brute-force strategies are encoded into fast statistical models that achieve both high accuracy and low-latency inference. Despite their success, such models face key challenges: they support only one programming language per model, require large datasets from slow brute-force generators, and involve resource-intensive training. In multi-language environments, this means maintaining multiple independent models, increasing system complexity and operational cost. This work addresses these issues by introducing a unified model capable of highlighting up to six mainstream programming languages, reducing deployment complexity by a factor of six and improving performance on unseen languages. A novel normalization technique significantly enhances model generalization, while few-shot learning experiments show that a small number of oracle samples can replace large datasets, minimizing dependence on brute-force generators. Combined, these innovations enable efficient, scalable, and cost-effective syntax highlighting across diverse programming languages.

Paper Structure

This paper contains 27 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Illustration of how the original dataset of 20k samples per language is structured for training and validation within a single fold: Single-Language Task A CNN model is trained on a single language and tested on its respective test set, repeated for each coverage task and language. Multi-Language Task: A model is trained on a merged dataset of all six languages and evaluated on each language’s test set, repeated for each coverage task. Few-Shot Task: A model is trained on a single language (Ln), fine-tuned on a small sample from other languages (FS), and tested on the same test sets as the other tasks.