Table of Contents
Fetching ...

Representation Learning for Stack Overflow Posts: How Far are We?

Junda He, Zhou Xin, Bowen Xu, Ting Zhang, Kisub Kim, Zhou Yang, Ferdian Thung, Ivana Irsan, David Lo

TL;DR

This paper probes how best to represent Stack Overflow posts for downstream software engineering tasks. It benchmarkst eleven representations, including Stack Overflow–specific Post2Vec and BERTOverflow, SE-domain models (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, CodeGen), and general-domain models (RoBERTa, Longformer, GPT-2) across tag recommendation, API recommendation, and relatedness prediction. The study finds no single model consistently wins, and that continuing pre-training on Stack Overflow data yields consistent gains, leading to the proposed SOBERT model, which achieves state-of-the-art performance on all tasks. These results offer practical insights into model selection for Stack Overflow–driven SE tasks and motivate broader, domain-informed pre-training strategies. The work also provides lessons on when external embeddings help and the value of domain breadth over domain-specific vocabulary, with implications for future SE NLP research and tooling.

Abstract

The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content.The performance of such solutions hinges significantly on the selection of representation model for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers' interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon trendy neural networks such as convolutional neural network (CNN) and Transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks, i.e., tag recommendation, relatedness prediction, and API recommendation. To find more suitable representation models for the posts, we further explore a diverse set of BERT-based models, including (1) general domain language models (RoBERTa and Longformer) and (2) language models built with software engineering-related textual artifacts (CodeBERT, GraphCodeBERT, and seBERT). However, it also illustrates the ``No Silver Bullet'' concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple-yet-effective strategy to improve the best-performing model by continuing the pre-training phase with the textual artifact from Stack Overflow.

Representation Learning for Stack Overflow Posts: How Far are We?

TL;DR

This paper probes how best to represent Stack Overflow posts for downstream software engineering tasks. It benchmarkst eleven representations, including Stack Overflow–specific Post2Vec and BERTOverflow, SE-domain models (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, CodeGen), and general-domain models (RoBERTa, Longformer, GPT-2) across tag recommendation, API recommendation, and relatedness prediction. The study finds no single model consistently wins, and that continuing pre-training on Stack Overflow data yields consistent gains, leading to the proposed SOBERT model, which achieves state-of-the-art performance on all tasks. These results offer practical insights into model selection for Stack Overflow–driven SE tasks and motivate broader, domain-informed pre-training strategies. The work also provides lessons on when external embeddings help and the value of domain breadth over domain-specific vocabulary, with implications for future SE NLP research and tooling.

Abstract

The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content.The performance of such solutions hinges significantly on the selection of representation model for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers' interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon trendy neural networks such as convolutional neural network (CNN) and Transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks, i.e., tag recommendation, relatedness prediction, and API recommendation. To find more suitable representation models for the posts, we further explore a diverse set of BERT-based models, including (1) general domain language models (RoBERTa and Longformer) and (2) language models built with software engineering-related textual artifacts (CodeBERT, GraphCodeBERT, and seBERT). However, it also illustrates the ``No Silver Bullet'' concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple-yet-effective strategy to improve the best-performing model by continuing the pre-training phase with the textual artifact from Stack Overflow.
Paper Structure (32 sections, 8 equations, 9 tables)