Table of Contents
Fetching ...

Rethinking the adaptive relationship between Encoder Layers and Decoder Layers

Yubo Song

TL;DR

This study probes the adaptive interaction between Encoder and Decoder in a transformer-based German-to-English model (Helsinki-NLP/opus-mt-de-en) by inserting a bias-free fully connected layer between them. It compares two weight initialization schemes (original-connection and Granularity Context Attention, GCA) under fine-tuning and retraining, using the wmt16/de-en dataset. The findings show that fine-tuning with the modified structure can degrade performance due to misalignment with pretrained weights, while retraining demonstrates substantial potential and reveals that encoder layers positively inform decoder layers. The work provides reproducible code and actionable insights for architecture adaptation in neural machine translation and related NLP tasks.

Abstract

This article explores the adaptive relationship between Encoder Layers and Decoder Layers using the SOTA model Helsinki-NLP/opus-mt-de-en, which translates German to English. The specific method involves introducing a bias-free fully connected layer between the Encoder and Decoder, with different initializations of the layer's weights, and observing the outcomes of fine-tuning versus retraining. Four experiments were conducted in total. The results suggest that directly modifying the pre-trained model structure for fine-tuning yields suboptimal performance. However, upon observing the outcomes of the experiments with retraining, this structural adjustment shows significant potential.

Rethinking the adaptive relationship between Encoder Layers and Decoder Layers

TL;DR

This study probes the adaptive interaction between Encoder and Decoder in a transformer-based German-to-English model (Helsinki-NLP/opus-mt-de-en) by inserting a bias-free fully connected layer between them. It compares two weight initialization schemes (original-connection and Granularity Context Attention, GCA) under fine-tuning and retraining, using the wmt16/de-en dataset. The findings show that fine-tuning with the modified structure can degrade performance due to misalignment with pretrained weights, while retraining demonstrates substantial potential and reveals that encoder layers positively inform decoder layers. The work provides reproducible code and actionable insights for architecture adaptation in neural machine translation and related NLP tasks.

Abstract

This article explores the adaptive relationship between Encoder Layers and Decoder Layers using the SOTA model Helsinki-NLP/opus-mt-de-en, which translates German to English. The specific method involves introducing a bias-free fully connected layer between the Encoder and Decoder, with different initializations of the layer's weights, and observing the outcomes of fine-tuning versus retraining. Four experiments were conducted in total. The results suggest that directly modifying the pre-trained model structure for fine-tuning yields suboptimal performance. However, upon observing the outcomes of the experiments with retraining, this structural adjustment shows significant potential.
Paper Structure (9 sections, 4 figures, 1 table)

This paper contains 9 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Weight Initialization Methods
  • Figure 2: Initialized to the weight of the original connection method
  • Figure 3: Initialized to the weight of GCA method
  • Figure 4: Training Loss of different experiments