Table of Contents
Fetching ...

On-the-Fly Syntax Highlighting: Generalisation and Speed-ups

Marco Edoardo Palma, Alex Wolf, Pasquale Salza, Harald C. Gall

TL;DR

This work tackles the challenge of on-the-fly syntax highlighting by generalising the approach to six mainstream languages and introducing a CNN-based model that delivers substantial GPU-accelerated speedups without sacrificing accuracy. It combines an oracle-generation pipeline that reuses language lexers/parsers with a neural predictor, formulating the task as a sequence-to-sequence translation from token rules to highlighting classes. The study demonstrates near-perfect accuracy across languages and coverage tasks for CNN models, while also showing that BRNN/RNN baselines retain high performance; GPUs yield especially large gains for CNNs, making real-time highlighting feasible at scale. The findings support adopting CNN-based sh as a fast, scalable alternative to RNN/BRNN-based systems in practical online coding environments, with replication resources provided for reproducibility.

Abstract

On-the-fly syntax highlighting is the task of rapidly associating visual secondary notation values with each character of a language derivation. Research in this domain is driven by the prevalence of online software development tools, which frequently display source code on screen and heavily rely on syntax highlighting mechanisms. In this context, three contrasting demands confront resolvers in this space: speed, accuracy, and development costs. Speed constraints are essential to ensure tool usability, manifesting as responsiveness for end users accessing online source code and minimising system overhead. Simultaneously, achieving precise highlighting is critical for enhancing code comprehensibility. Nevertheless, obtaining accurate results necessitates the capacity to perform grammatical analysis on the code under consideration, even in cases of varying grammatical correctness. Furthermore, addressing the development costs of such resolvers is imperative, given the multitude of programming language versions. The current state-of-the-art approach in this field leverages the original lexer and parser of programming languages to create syntax highlighting oracles, subsequently used for training base Recurrent Neural Network models. As the question of the generalisation of such a solution persists, this paper addresses this aspect by extending the original work to three additional mainstream programming languages and conducting a comprehensive review of the outcomes. Moreover, the original limitations in evaluation performance and training costs are mitigated through the introduction of a novel Convolutional based Neural Network model. This study examines the performance gains of running models on GPUs, finding that the new CNN implementation is much faster than previous methods while maintaining high accuracy.

On-the-Fly Syntax Highlighting: Generalisation and Speed-ups

TL;DR

This work tackles the challenge of on-the-fly syntax highlighting by generalising the approach to six mainstream languages and introducing a CNN-based model that delivers substantial GPU-accelerated speedups without sacrificing accuracy. It combines an oracle-generation pipeline that reuses language lexers/parsers with a neural predictor, formulating the task as a sequence-to-sequence translation from token rules to highlighting classes. The study demonstrates near-perfect accuracy across languages and coverage tasks for CNN models, while also showing that BRNN/RNN baselines retain high performance; GPUs yield especially large gains for CNNs, making real-time highlighting feasible at scale. The findings support adopting CNN-based sh as a fast, scalable alternative to RNN/BRNN-based systems in practical online coding environments, with replication resources provided for reproducibility.

Abstract

On-the-fly syntax highlighting is the task of rapidly associating visual secondary notation values with each character of a language derivation. Research in this domain is driven by the prevalence of online software development tools, which frequently display source code on screen and heavily rely on syntax highlighting mechanisms. In this context, three contrasting demands confront resolvers in this space: speed, accuracy, and development costs. Speed constraints are essential to ensure tool usability, manifesting as responsiveness for end users accessing online source code and minimising system overhead. Simultaneously, achieving precise highlighting is critical for enhancing code comprehensibility. Nevertheless, obtaining accurate results necessitates the capacity to perform grammatical analysis on the code under consideration, even in cases of varying grammatical correctness. Furthermore, addressing the development costs of such resolvers is imperative, given the multitude of programming language versions. The current state-of-the-art approach in this field leverages the original lexer and parser of programming languages to create syntax highlighting oracles, subsequently used for training base Recurrent Neural Network models. As the question of the generalisation of such a solution persists, this paper addresses this aspect by extending the original work to three additional mainstream programming languages and conducting a comprehensive review of the outcomes. Moreover, the original limitations in evaluation performance and training costs are mitigated through the introduction of a novel Convolutional based Neural Network model. This study examines the performance gains of running models on GPUs, finding that the new CNN implementation is much faster than previous methods while maintaining high accuracy.
Paper Structure (33 sections, 1 equation, 3 figures, 4 tables)

This paper contains 33 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Accuracy values comparison for T4.
  • Figure 2: Execution time (ms) values trends comparison for T4.
  • Figure 3: Accuracy values comparison for incomplete language derivations.