Table of Contents
Fetching ...

Isotropy Matters: Soft-ZCA Whitening of Embeddings for Semantic Code Search

Andor Diera, Lukas Galke, Ansgar Scherp

TL;DR

This study investigates the impact of isotropy on semantic code search performance and explores post-processing techniques to mitigate this issue, and proposes a modified ZCA whitening technique to control isotropy levels in embeddings.

Abstract

Low isotropy in an embedding space impairs performance on tasks involving semantic inference. Our study investigates the impact of isotropy on semantic code search performance and explores post-processing techniques to mitigate this issue. We analyze various code language models, examine isotropy in their embedding spaces, and its influence on search effectiveness. We propose a modified ZCA whitening technique to control isotropy levels in embeddings. Our results demonstrate that Soft-ZCA whitening improves the performance of pre-trained code language models and can complement contrastive fine-tuning.

Isotropy Matters: Soft-ZCA Whitening of Embeddings for Semantic Code Search

TL;DR

This study investigates the impact of isotropy on semantic code search performance and explores post-processing techniques to mitigate this issue, and proposes a modified ZCA whitening technique to control isotropy levels in embeddings.

Abstract

Low isotropy in an embedding space impairs performance on tasks involving semantic inference. Our study investigates the impact of isotropy on semantic code search performance and explores post-processing techniques to mitigate this issue. We analyze various code language models, examine isotropy in their embedding spaces, and its influence on search effectiveness. We propose a modified ZCA whitening technique to control isotropy levels in embeddings. Our results demonstrate that Soft-ZCA whitening improves the performance of pre-trained code language models and can complement contrastive fine-tuning.

Paper Structure

This paper contains 12 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Average IsoScore (left) and MRR measures (right) at different epsilon values on the CodeSearchNet Python dataset