SCALAR: A Part-of-speech Tagger for Identifiers
Christian D. Newman, Brandon Scholten, Sophia Testa, Joshua A. C. Behler, Syreen Banabilah, Michael L. Collard, Michael J. Decker, Mohamed Wiem Mkaouer, Marcos Zampieri, Eman Abdullah AlOmar, Reem Alsuhaibani, Anthony Peruma, Jonathan I. Maletic
TL;DR
This work tackles the challenge of understanding and improving the semantics of source code identifiers by introducing SCALAR, a specialized part-of-speech tagger that maps identifier names to grammar-pattern sequences. SCALAR uses a GradientBoostingClassifier trained on combined General and Closed Grammar Datasets, leveraging embedding-based and lexical features with limited reliance on external taggers to capture code-specific POS usage. The approach demonstrates faster and often more accurate tagging of identifiers compared to off-the-shelf taggers and prior taggers, aided by caching and a focus on grammar-pattern generation. The tool is deployable via Docker and exposes a REST API, enabling researchers and developers to analyze, critique, and refine identifier naming in real workflows.
Abstract
The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github
