RETROcode: Leveraging a Code Database for Improved Natural Language to Code Generation
Nathanaël Beau, Benoît Crabbé
TL;DR
RETROcode introduces a retrieval-augmented memory mechanism for NL-to-code generation by integrating a large code database into a seq2seq transformer. The model retrieves neighbouring code chunks via a frozen codeBERT encoder and merges them into the decoder through either sequential or parallel cross-attention, with a hybrid database further guiding initial decoding. Across CoNaLa and CodeXGlue, RETROcode with a classic database improves over a strong baseline, and the hybrid database configuration achieves state-of-the-art-like BLEU scores on CoNaLa, approaching Codex with far fewer parameters and training data. The work demonstrates that memory-augmented decoding can significantly boost code generation efficiency and quality, while also highlighting considerations around initialization, chunk size, and database quality for practical deployment.
Abstract
As text and code resources have expanded, large-scale pre-trained models have shown promising capabilities in code generation tasks, typically employing supervised fine-tuning with problem statement-program pairs. However, increasing model size and data volume for performance gains also raises computational demands and risks of overfitting. Addressing these challenges, we present RETROcode, a novel adaptation of the RETRO architecture \cite{RETRO} for sequence-to-sequence models, utilizing a large code database as an auxiliary scaling method. This approach, diverging from simply enlarging model and dataset sizes, allows RETROcode to leverage a vast code database for prediction, enhancing the model's efficiency by integrating extensive memory. Our findings indicate that RETROcode not only outperforms similar-sized traditional architectures on test sets but also approaches the effectiveness of the much larger Codex model, despite being trained from scratch on a substantially smaller dataset.
