Table of Contents
Fetching ...

Embedding Software Intent: Lightweight Java Module Recovery

Yirui He, Yuqi Huai, Xingyu Chen, Joshua Garcia

TL;DR

The paper tackles architectural decay in large Java systems by leveraging the Java Platform Module System (JPMS) ground truth to recover modules from monolithic code. It introduces ClassLAR, a lightweight, LM-based approach that uses fully-qualified class names as input, embedding them with CodeBERT-java, and clustering with UMAP-HDBSCAN followed by undersized-module repair to produce well-encapsulated Java modules. Across 20 popular Java projects, ClassLAR outperforms four state-of-the-art architecture recovery techniques on architecture resemblance (a2a) and encapsulation (MQ), while also offering substantial runtime efficiency (3.99× to 10.50× faster). The work provides a scalable benchmark dataset of ground-truth architectures and demonstrates that FQCN-based, language-model embeddings can effectively infer modular structures without code-level dependencies. Future work includes incorporating runtime dynamics and interface design considerations to balance encapsulation with practical inter-module interactions.

Abstract

As an increasing number of software systems reach unprecedented scale, relying solely on code-level abstractions is becoming impractical. While architectural abstractions offer a means to manage these systems, maintaining their consistency with the actual code has been problematic. The Java Platform Module System (JPMS), introduced in Java 9, addresses this limitation by enabling explicit module specification at the language level. JPMS enhances architectural implementation through improved encapsulation and direct specification of ground-truth architectures within Java projects. Although many projects are written in Java, modularizing existing monolithic projects to JPMS modules is an open challenge due to ineffective module recovery by existing architecture recovery techniques. To address this challenge, this paper presents ClassLAR (Class-and Language model-based Architectural Recovery), a novel, lightweight, and efficient approach that recovers Java modules from monolithic Java systems using fully-qualified class names. ClassLAR leverages language models to extract semantic information from package and class names, capturing both structural and functional intent. In evaluations across 20 popular Java projects, ClassLAR outperformed all state-of-the-art techniques in architectural-level similarity metrics while achieving execution times that were 3.99 to 10.50 times faster.

Embedding Software Intent: Lightweight Java Module Recovery

TL;DR

The paper tackles architectural decay in large Java systems by leveraging the Java Platform Module System (JPMS) ground truth to recover modules from monolithic code. It introduces ClassLAR, a lightweight, LM-based approach that uses fully-qualified class names as input, embedding them with CodeBERT-java, and clustering with UMAP-HDBSCAN followed by undersized-module repair to produce well-encapsulated Java modules. Across 20 popular Java projects, ClassLAR outperforms four state-of-the-art architecture recovery techniques on architecture resemblance (a2a) and encapsulation (MQ), while also offering substantial runtime efficiency (3.99× to 10.50× faster). The work provides a scalable benchmark dataset of ground-truth architectures and demonstrates that FQCN-based, language-model embeddings can effectively infer modular structures without code-level dependencies. Future work includes incorporating runtime dynamics and interface design considerations to balance encapsulation with practical inter-module interactions.

Abstract

As an increasing number of software systems reach unprecedented scale, relying solely on code-level abstractions is becoming impractical. While architectural abstractions offer a means to manage these systems, maintaining their consistency with the actual code has been problematic. The Java Platform Module System (JPMS), introduced in Java 9, addresses this limitation by enabling explicit module specification at the language level. JPMS enhances architectural implementation through improved encapsulation and direct specification of ground-truth architectures within Java projects. Although many projects are written in Java, modularizing existing monolithic projects to JPMS modules is an open challenge due to ineffective module recovery by existing architecture recovery techniques. To address this challenge, this paper presents ClassLAR (Class-and Language model-based Architectural Recovery), a novel, lightweight, and efficient approach that recovers Java modules from monolithic Java systems using fully-qualified class names. ClassLAR leverages language models to extract semantic information from package and class names, capturing both structural and functional intent. In evaluations across 20 popular Java projects, ClassLAR outperformed all state-of-the-art techniques in architectural-level similarity metrics while achieving execution times that were 3.99 to 10.50 times faster.

Paper Structure

This paper contains 31 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: JPMS Structure
  • Figure 2: ClassLAR Recovery Process
  • Figure 3: Undersized Modules: A clustering with singleton modules (p2.A2 and p3.A5) and split packages p2 and p3.
  • Figure 4: Runtime Evaluation