CUCo: An Agentic Framework for Compute and Communication Co-design

Bodun Hu; Yoga Sri Varshan; Saurabh Agarwal; Aditya Akella

CUCo: An Agentic Framework for Compute and Communication Co-design

Bodun Hu, Yoga Sri Varshan, Saurabh Agarwal, Aditya Akella

TL;DR

CUCo is introduced, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication that unlocks new optimization opportunities unavailable to existing approaches.

Abstract

Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation, leaving communication kernels largely untouched even though they constitute a significant share of total execution time. We introduce CUCo, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks new optimization opportunities unavailable to existing approaches, outperforming state-of-the-art baselines and reducing end-to-end latency by up to $1.57\times$.

CUCo: An Agentic Framework for Compute and Communication Co-design

TL;DR

Abstract

Paper Structure (17 sections, 1 equation, 10 figures, 4 tables, 2 algorithms)

This paper contains 17 sections, 1 equation, 10 figures, 4 tables, 2 algorithms.

Introduction
Related Work
CUDA Kernel Fusion
Device-initiated Communication
Agents for Kernel Generation
CULink
Design Space Specification
Fast-Path Agent
Slow-Path Agent
Evaluation
CUCo's End-to-End Evaluation
Case Study: Flash Attention with Context Parallelism
Ablation Studies
Conclusion
Static Analysis Example
...and 2 more sections

Figures (10)

Figure 1: NCCL device-initiated API: The code above shows how to implement an All-to-All CUDA kernel using the GPU-Initiated Networking (GIN) and Load/Store Accessible (LSA) API.
Figure 2: Overall workflow of CUCo
Figure 3: Flash Attention with Context Parallelism with varying sequence length and attention head dimension.
Figure 4: DeepSeek-V3 MoE layer across inter-node RoCE links with expert skewness.
Figure 5: Intra-node KV cache transfer latency across varying sequence lengths and KV dimensions.
...and 5 more figures

CUCo: An Agentic Framework for Compute and Communication Co-design

TL;DR

Abstract

CUCo: An Agentic Framework for Compute and Communication Co-design

Authors

TL;DR

Abstract

Table of Contents

Figures (10)