Analysing the Behaviour of Tree-Based Neural Networks in Regression Tasks

Peter Samoaa; Mehrdad Farahani; Antonio Longa; Philipp Leitner; Morteza Haghir Chehreghani

Analysing the Behaviour of Tree-Based Neural Networks in Regression Tasks

Peter Samoaa, Mehrdad Farahani, Antonio Longa, Philipp Leitner, Morteza Haghir Chehreghani

TL;DR

This work investigates how tree-based neural networks perform in regression tasks on source code, focusing on predicting execution time from AST representations. It introduces a Dual-Transformer architecture with separate NL-Encoder and AST-Encoder modules that interact via cross-attention to fuse lexical and syntactic information for prediction. Across two real-world datasets (OSSBuilds and HadoopTests), the Dual-Transformer consistently outperforms GNN and traditional Tree-Based Neural Network baselines in MSE, MAE, and Pearson correlation, demonstrating robustness to varying data sizes and cross-dataset transfer. The study also provides an open-source framework and datasets to facilitate broader evaluation of tree-based models in regression tasks, signaling a promising direction for reliable performance estimation in software engineering analyses.

Abstract

The landscape of deep learning has vastly expanded the frontiers of source code analysis, particularly through the utilization of structural representations such as Abstract Syntax Trees (ASTs). While these methodologies have demonstrated effectiveness in classification tasks, their efficacy in regression applications, such as execution time prediction from source code, remains underexplored. This paper endeavours to decode the behaviour of tree-based neural network models in the context of such regression challenges. We extend the application of established models--tree-based Convolutional Neural Networks (CNNs), Code2Vec, and Transformer-based methods--to predict the execution time of source code by parsing it to an AST. Our comparative analysis reveals that while these models are benchmarks in code representation, they exhibit limitations when tasked with regression. To address these deficiencies, we propose a novel dual-transformer approach that operates on both source code tokens and AST representations, employing cross-attention mechanisms to enhance interpretability between the two domains. Furthermore, we explore the adaptation of Graph Neural Networks (GNNs) to this tree-based problem, theorizing the inherent compatibility due to the graphical nature of ASTs. Empirical evaluations on real-world datasets showcase that our dual-transformer model outperforms all other tree-based neural networks and the GNN-based models. Moreover, our proposed dual transformer demonstrates remarkable adaptability and robust performance across diverse datasets.

Analysing the Behaviour of Tree-Based Neural Networks in Regression Tasks

TL;DR

Abstract

Paper Structure (49 sections, 10 equations, 4 figures, 5 tables)

This paper contains 49 sections, 10 equations, 4 figures, 5 tables.

Introduction
Background
Abstract syntax trees
Motivation Example
Related Work
Analytic Framework
Dual Transformer Model
NL-Encoder
AST-Encoder
Attention Mechanisms
Regression Head
Other GNN and TBNN Models
Graph Learning Approach
GCN (Graph Convolutional Network)
GAT (Graph Attention Network)
...and 34 more sections

Figures (4)

Figure 1: Simplified abstract syntax tree (AST) representing the illustrative example presented in Listing \ref{['java:example']}. Package declarations, import statements, as well as the declaration in Line 15 are skipped for brevity.
Figure 2: Abstracted General Code Representation and DL Models in Software Engineering.
Figure 3: The architecture of the Dual-Transformer model. The framework features two transformer encoders: NLEncoder for source code tokens and ASTEncoder for AST nodes, each with layers for embedding, multi-head attention, and feed-forward networks, complemented by add & norm layers for stabilization. Their outputs are merged via cross-attention and passed to a linear regressor for error prediction, leveraging both textual and syntactical insights.
Figure 4: Real vs predicted values Each panel reports the real (y-axes) and predicted (x-axes) values for each model. Each pair that is real-predicted is represented as a blue point, while the dashed red line shows a linear regression model fitted to the data.

Analysing the Behaviour of Tree-Based Neural Networks in Regression Tasks

TL;DR

Abstract

Analysing the Behaviour of Tree-Based Neural Networks in Regression Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)