Table of Contents
Fetching ...

Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks

Bo-Ru Lu, Nikita Haduong, Chien-Yu Lin, Hao Cheng, Noah A. Smith, Mari Ostendorf

Abstract

Transformer-based NLP models are powerful but have high computational costs that limit deployment. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as GPT-4. We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks where multiple outputs are required for a single shared input. Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency by avoiding duplicate input encoding and increasing the operational intensity (ratio of numbers of arithmetic operation to memory access) of decoding process by sharing the input key-value cache. We achieve computation reduction that roughly scales with the number of subtasks, gaining up to 4.6x speed-up over state-of-the-art models for dialogue state tracking, summarization, and question-answering tasks, with comparable or better performance.

Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks

Abstract

Transformer-based NLP models are powerful but have high computational costs that limit deployment. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as GPT-4. We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks where multiple outputs are required for a single shared input. Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency by avoiding duplicate input encoding and increasing the operational intensity (ratio of numbers of arithmetic operation to memory access) of decoding process by sharing the input key-value cache. We achieve computation reduction that roughly scales with the number of subtasks, gaining up to 4.6x speed-up over state-of-the-art models for dialogue state tracking, summarization, and question-answering tasks, with comparable or better performance.
Paper Structure (47 sections, 9 equations, 2 figures, 11 tables)

This paper contains 47 sections, 9 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Given a task where a single input document $\mathcal{X}$ is used to generate multiple outputs $Y_u$ associated with different prompts $Z_u$, PiE creates unique encodings $\mathbf{M}_u$ for every prompt $Z_u$. In contrast, PiD uses a single shared $\mathbf{M}$ for each prompt. thus requiring less memory access and resulting in higher computational efficiency.
  • Figure 2: An illustration of cross-attention dot product operations ($\mathbf{Q} \mathbf{K}^\top$ in \ref{['eq:attn']}) for PiE and PiD for a single inference step. $U$, $d$, $n_s$ are the number of prompts, hidden layer dimension, and input length, respectively. $\odot$ is the dot product operation, and the resulting scalars of $\mathbf{Q} \mathbf{K}^\top$ are $\alpha^u_\tau$, where $\tau = \{1,\ldots, n_s\}$, at the decoding step $\tau$ w.r.t. the prompt $Z_u$.