Table of Contents
Fetching ...

TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

Prabhu Vellaisamy, Shreesh Tripathi, Vignesh Natarajan, Surya Santhan Thenarasu, Shawn Blanton, John P. Shen

Abstract

Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and decode, we show that aggregate latency, GPU inactivity, or boundedness ratios alone can obscure the dominant optimization target. TaxBreak instead distinguishes cases where optimization should reduce software-stack overhead from cases where the primary win comes from reducing device-side work. We further show that MoE models dispatch 8-11x more kernels per output token than dense models, and that for such host-bound workloads, CPU single-thread performance is a first-order parameter: a faster host CPU reduces orchestration overhead by 10-29% and improves end-to-end latency by up to 14%, even when paired with a slower-clocked GPU. These results position TaxBreak as a diagnostic tool for assessing whether optimization effort should target the software stack or the device-side workload execution.

TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

Abstract

Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and decode, we show that aggregate latency, GPU inactivity, or boundedness ratios alone can obscure the dominant optimization target. TaxBreak instead distinguishes cases where optimization should reduce software-stack overhead from cases where the primary win comes from reducing device-side work. We further show that MoE models dispatch 8-11x more kernels per output token than dense models, and that for such host-bound workloads, CPU single-thread performance is a first-order parameter: a faster host CPU reduces orchestration overhead by 10-29% and improves end-to-end latency by up to 14%, even when paired with a slower-clocked GPU. These results position TaxBreak as a diagnostic tool for assessing whether optimization effort should target the software stack or the device-side workload execution.
Paper Structure (21 sections, 9 equations, 11 figures, 4 tables)

This paper contains 21 sections, 9 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: TaxBreak Methodology.TaxBreak decomposes overall host-side orchestration overhead into three components: (i) framework execution, (ii) CUDA library front-end execution, and (iii) kernel launch invocation. We also introduce a new Host-Device Balance Index (HDBI) to characterize the relative boundedness between the host (CPU) and the device (GPU). Prior work A refers to aggregate framework tax fernandez2023framework; B refers to kernel launch/queue tax (TKLQT) vellaisamy2025characterizing.
  • Figure 2: Previous characterizations of GPT-2 inference across batch sizes. Left: end-to-end latency (ms) shows a transition from framework-bound to compute-bound at small batch sizes fernandez2023framework. Right: TKLQT in $\mu$s highlights similar CPU-bound to GPU-bound transition as kernel queuing increases with batch size and utilization vellaisamy2025characterizing.
  • Figure 3: Dispatch chains for library-mediated and framework-native kernels. For library mediated, cuBLAS front-end contributes $\Delta{CT}$ for $\mathbb{I}_{lib}=1$.
  • Figure 4: Annotated NVTX ranges around an operator dispatch and kernel execution.
  • Figure 5: End-to-end latency across dense and MoE LLM workloads. Heatmaps show end-to-end latency across batch sizes and input sequence lengths during prefill ($m=1$) and decode ($m=10$) on H100/H200 systems, where $m$ denotes the number of generated tokens. The decode heatmaps report total latency aggregated over a 10-token decode window. OLMoE-1B/7B does not support SL=8192 context length.
  • ...and 6 more figures