Table of Contents
Fetching ...

Revisiting Funnel Transformers for Modern LLM Architectures with Comprehensive Ablations in Training and Inference Configurations

DongHyun Choi, Lucas Spangher, Chris Hidey, Peter Grabowski, Ramy Eskander

TL;DR

This work scrutinizes funnel based compression in modern Gemma2 transformer architectures to quantify accuracy versus efficiency trade offs. Through systematic ablations on where to pool intermediate representations, whether to pretrain and fine tune funnel aware models, and how to recover full sequence length, the study reveals that information bottlenecks can significantly hurt performance, especially in larger models. However, with careful selection of the funneling layer and a robust recovery method such as averaging, substantial latency reductions up to 44% are achievable while preserving acceptable accuracy. The results provide actionable guidance for deploying funnel based approaches in large scale NLP applications and highlight that design choices must balance compute savings with potential performance losses.

Abstract

Transformer-based Large Language Models, which suffer from high computational costs, advance so quickly that techniques proposed to streamline earlier iterations are not guaranteed to benefit more modern models. Building upon the Funnel Transformer proposed by Dai and Le (2020), which progressively compresses intermediate representations, we investigate the impact of funneling in contemporary Gemma2 Transformer architectures. We systematically evaluate various funnel configurations and recovery methods, comparing: (1) standard pretraining to funnel-aware pretraining strategies, (2) the impact of funnel-aware fine-tuning, and (3) the type of sequence recovery operation. Our results demonstrate that funneling creates information bottlenecks that propagate through deeper network layers, particularly in larger models (e.g., Gemma 7B), leading to at times unmanageable performance lost. However, carefully selecting the funneling layer and employing effective recovery strategies, can substantially mitigate performance losses, achieving up to a 44\% reduction in latency. Our findings highlight key trade-offs between computational efficiency and model accuracy, providing practical guidance for deploying funnel-based approaches in large-scale natural language applications.

Revisiting Funnel Transformers for Modern LLM Architectures with Comprehensive Ablations in Training and Inference Configurations

TL;DR

This work scrutinizes funnel based compression in modern Gemma2 transformer architectures to quantify accuracy versus efficiency trade offs. Through systematic ablations on where to pool intermediate representations, whether to pretrain and fine tune funnel aware models, and how to recover full sequence length, the study reveals that information bottlenecks can significantly hurt performance, especially in larger models. However, with careful selection of the funneling layer and a robust recovery method such as averaging, substantial latency reductions up to 44% are achievable while preserving acceptable accuracy. The results provide actionable guidance for deploying funnel based approaches in large scale NLP applications and highlight that design choices must balance compute savings with potential performance losses.

Abstract

Transformer-based Large Language Models, which suffer from high computational costs, advance so quickly that techniques proposed to streamline earlier iterations are not guaranteed to benefit more modern models. Building upon the Funnel Transformer proposed by Dai and Le (2020), which progressively compresses intermediate representations, we investigate the impact of funneling in contemporary Gemma2 Transformer architectures. We systematically evaluate various funnel configurations and recovery methods, comparing: (1) standard pretraining to funnel-aware pretraining strategies, (2) the impact of funnel-aware fine-tuning, and (3) the type of sequence recovery operation. Our results demonstrate that funneling creates information bottlenecks that propagate through deeper network layers, particularly in larger models (e.g., Gemma 7B), leading to at times unmanageable performance lost. However, carefully selecting the funneling layer and employing effective recovery strategies, can substantially mitigate performance losses, achieving up to a 44\% reduction in latency. Our findings highlight key trade-offs between computational efficiency and model accuracy, providing practical guidance for deploying funnel-based approaches in large-scale natural language applications.

Paper Structure

This paper contains 22 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: A representation of a funnel architecture with seven encoder layers and two funneling operations -- one at the "last" pre-funnling layer, layer 3, and one at the recovery sequence layer, layer 7. (For the purposes of this caption, layers are 1-indexed.)
  • Figure 2: Performance on (a) GLUE benchmark (Average GLUE Score) and (b) WebAnswers ROC AUC as a function of the funnel recovery layer. The x=0 point corresponds to the model without funneling. Solid lines represent models trained with normal pretraining ("Without Funnel Aware Pretraining") and funnel-aware pretraining ("With Funnel Aware Pretraining").
  • Figure 3: Performance of Gemma2 2B (left) and Gemma2 7B (right) on GLUE (top row) and WebAnswers ROC AUC (bottom row) are plotted against successive layers at which 2-token funnel is applied within each architecture. Solid lines correspond to models performance with different numbers of fine tuning aware steps, whereas dotted lines correspond to baselines in which no funneling is applied.
  • Figure 4: Comparison of latency versus performance gains.
  • Figure 5: The effect of different recovery operations on NER performance.