Table of Contents
Fetching ...

A Gradient Accumulation Method for Dense Retriever under Memory Constraint

Jaehee Kim, Yukyung Lee, Pilsung Kang

TL;DR

Experiments indicate that ContAccum can surpass not only existing memory reduction methods but also high-resource scenario, and theoretical analysis and experimental results confirm that ContAccum provides more stable dual-encoder training than current memory bank utilization methods.

Abstract

InfoNCE loss is commonly used to train dense retriever in information retrieval tasks. It is well known that a large batch is essential to stable and effective training with InfoNCE loss, which requires significant hardware resources. Due to the dependency of large batch, dense retriever has bottleneck of application and research. Recently, memory reduction methods have been broadly adopted to resolve the hardware bottleneck by decomposing forward and backward or using a memory bank. However, current methods still suffer from slow and unstable training. To address these issues, we propose Contrastive Accumulation (ContAccum), a stable and efficient memory reduction method for dense retriever trains that uses a dual memory bank structure to leverage previously generated query and passage representations. Experiments on widely used five information retrieval datasets indicate that ContAccum can surpass not only existing memory reduction methods but also high-resource scenario. Moreover, theoretical analysis and experimental results confirm that ContAccum provides more stable dual-encoder training than current memory bank utilization methods.

A Gradient Accumulation Method for Dense Retriever under Memory Constraint

TL;DR

Experiments indicate that ContAccum can surpass not only existing memory reduction methods but also high-resource scenario, and theoretical analysis and experimental results confirm that ContAccum provides more stable dual-encoder training than current memory bank utilization methods.

Abstract

InfoNCE loss is commonly used to train dense retriever in information retrieval tasks. It is well known that a large batch is essential to stable and effective training with InfoNCE loss, which requires significant hardware resources. Due to the dependency of large batch, dense retriever has bottleneck of application and research. Recently, memory reduction methods have been broadly adopted to resolve the hardware bottleneck by decomposing forward and backward or using a memory bank. However, current methods still suffer from slow and unstable training. To address these issues, we propose Contrastive Accumulation (ContAccum), a stable and efficient memory reduction method for dense retriever trains that uses a dual memory bank structure to leverage previously generated query and passage representations. Experiments on widely used five information retrieval datasets indicate that ContAccum can surpass not only existing memory reduction methods but also high-resource scenario. Moreover, theoretical analysis and experimental results confirm that ContAccum provides more stable dual-encoder training than current memory bank utilization methods.
Paper Structure (16 sections, 9 equations, 7 figures, 3 tables)

This paper contains 16 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustrations of ContAccum and Comparative Methods. The illustrations show a total batch size ($N_\text{total}$) of 4, a local batch size ($N_\text{local}$) of 2, and a memory bank size ($N_\text{memory}$) of 4. (a) GradCache uses $N_\text{total} - 1$ negative passages. (b) GradAccum uses $N_\text{local} - 1$ negative passages. (c) ContAccum leverages $N_\text{local} + N_\text{memory} - 1$ negative samples, more than $N_\text{total} - 1$.
  • Figure 2: Training process of ContAccum at each accumulation step. The illustration shows a total batch size ($N_\text{total}$) of 4, an accumulation step ($K$) of 2, and a memory bank size ($N_\text{memory}$) of 4. The dual memory bank caches both query and passage representations. New representations are enqueued, and the oldest are dequeued at each step, maintaining the similarity matrix ($S_k$) size at ($N_\text{local} + N_\text{memory}, N_\text{local} + N_\text{memory}$).
  • Figure 3: Analysis of accumulation step and memory bank size. DPR performance in low-resource (BSZ=8) and high-resource (BSZ=128) settings is shown as baselines, along with the performance of gradient accumulation for each total batch size ($N_\text{total}$).
  • Figure 4: Comparison of the speed of one weight update for different methods as the total batch size ($N_\text{total}$) changes.
  • Figure 5: Analysis of GradNormRatio throughout the training process on the NQ dataset.
  • ...and 2 more figures