Table of Contents
Fetching ...

TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

Hyunwoo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Suyeon Jang, Behnam Khaleghi, Fei Wen, Mohsen Imani

Abstract

Multimodal stacks that mix ViTs, CNNs, GNNs, and transformer NLP strain embedded platforms because their compute/memory patterns diverge and hard real-time targets leave little slack. TRINE is a single-bitstream FPGA accelerator and compiler that executes end-to-end multimodal inference without reconfiguration. Layers are unified as DDMM/SDDMM/SpMM and mapped to a mode-switchable engine that toggles at runtime among weight/output-stationary systolic, 1xCS SIMD, and a routable adder tree (RADT) on a shared PE array. A width-matched, two-stage top-k unit enables in-stream token pruning, while dependency-aware layer offloading (DALO) overlaps independent kernels across reconfigurable processing units to sustain utilization. Evaluated on Alveo U50 and ZCU104, TRINE reduces latency by up to 22.57x vs. RTX 4090 and 6.86x vs. Jetson Orin Nano at 20-21 W; token pruning alone yields up to 7.8x on ViT-heavy pipelines, and DALO contributes up to 79% throughput improvement. With int8 quantization, accuracy drops remain <2.5% across representative tasks, delivering state-of-the-art latency and energy efficiency for unified vision, language, and graph workloads-in one bitstream.

TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

Abstract

Multimodal stacks that mix ViTs, CNNs, GNNs, and transformer NLP strain embedded platforms because their compute/memory patterns diverge and hard real-time targets leave little slack. TRINE is a single-bitstream FPGA accelerator and compiler that executes end-to-end multimodal inference without reconfiguration. Layers are unified as DDMM/SDDMM/SpMM and mapped to a mode-switchable engine that toggles at runtime among weight/output-stationary systolic, 1xCS SIMD, and a routable adder tree (RADT) on a shared PE array. A width-matched, two-stage top-k unit enables in-stream token pruning, while dependency-aware layer offloading (DALO) overlaps independent kernels across reconfigurable processing units to sustain utilization. Evaluated on Alveo U50 and ZCU104, TRINE reduces latency by up to 22.57x vs. RTX 4090 and 6.86x vs. Jetson Orin Nano at 20-21 W; token pruning alone yields up to 7.8x on ViT-heavy pipelines, and DALO contributes up to 79% throughput improvement. With int8 quantization, accuracy drops remain <2.5% across representative tasks, delivering state-of-the-art latency and energy efficiency for unified vision, language, and graph workloads-in one bitstream.
Paper Structure (25 sections, 7 figures, 5 tables)

This paper contains 25 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: TRINE overview and mode-switchable engine (MSE). (a) Accelerator with an RPU grid and local inter-RPU buffers to localize traffic and pipeline inter-tile exchange. (b) Each RPU integrates a shared-datapath MSE, a width-matched two-stage top-$k$, compact nonlinear units, and a lightweight feed scheduler. (c) One PE array time-shares four dataflows via small interconnect muxes and per-PE op control: (1) Systolic (WS/OS) for dense DDMM; (2) $1{\times}C_S$ SIMD for moderately sparse SDDMM/SpMM; (3) RADT for highly sparse/irregular reductions; (4) normal SIMD for element-wise ops. Selection policy (runtime): DDMM$\rightarrow$WS/OS (WS for small-token/high weight reuse; OS for wide feature maps/many tokens). SDDMM/SpMM$\rightarrow$$1{\times}C_S$ when active operands per row/col $\lesssim C_S$ and fairly uniform; switch to RADT as sparsity grows or degree skews. Feed scheduling provides systolic delay insertion and sparsity-aware indexed reads without host-side packing.
  • Figure 2: Two-stage, width-matched top-$k$. A fixed-width (up to $C_S$) pipelined bitonic stage matches array width; a lightweight $C_S{\rightarrow}k$ merge completes selection. Values/indices stream from MSE into a center buffer (CB) and sparse queue buffer (SQB), avoiding off-chip detours and scaling better than single large bitonic networks.
  • Figure 3: Feed scheduler. (a) Pipelined delay insertion aligns rows/columns for WS/OS systolic waves with no host-side data reshaping. (b) Sparsity-aware indexed reads via a small BRAM-backed sparse queue (SQB) and address generator feed only active pairs to $1{\times}C_S$ SIMD and RADT, eliminating bubbles and DRAM detours after top-$k$ pruning.
  • Figure 4: TRINE compile–run flow. The compiler parses a structured model, classifies layers (predictable vs. fuzzy), maps them to DDMM/SDDMM/SpMM, selects MSE modes via a sparsity/shape policy, and emits compact instruction blocks plus a dependency DAG. At runtime, an APU-backed controller fills fuzzy templates (e.g., token counts), configures top-$k$, and schedules ready blocks across RPUs using dependency-aware layer offloading (DALO). Pruning indices flow forward to shrink subsequent kernels.
  • Figure 5: End-to-end latency with/without DALO; labels refer to \ref{['tab:model_hw_eval']}. Token pruning disabled to isolate scheduling effects.
  • ...and 2 more figures