Table of Contents
Fetching ...

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

Ahmed Heakl, Sarim Hashmi, Gustavo Bertolo Stahl, Seung Hun Eddie Han, Salman Khan, Abdulrahman Mahmoud

TL;DR

CASS tackles the GPU portability problem by delivering a large-scale dataset and model suite for cross-architecture translation between Nvidia CUDA and AMD HIP, plus low-level Nvidia SASS↔AMD RDNA3 assembly mappings. The approach builds a fully open data-to-model pipeline, including 70,694 aligned CUDA↔HIP source samples and compiled SASS↔RDNA3 assemblies, and introduces CASS-Bench for execution-verified evaluation across 16 GPU domains. The CASS-Instruct models demonstrate state-of-the-art performance, achieving up to $95\%$ source-translation accuracy and $37.5\%$ assembly-translation accuracy, with over $85\%$ of translated assemblies preserving runtime and memory behavior relative to native code. This work enables rigorous, open research into cross-vendor GPU tooling and paves the way for practical, performance-preserving hardware translation and interoperability.

Abstract

We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA <--> HIP) and assembly-level (Nvidia SASS <--> AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

TL;DR

CASS tackles the GPU portability problem by delivering a large-scale dataset and model suite for cross-architecture translation between Nvidia CUDA and AMD HIP, plus low-level Nvidia SASS↔AMD RDNA3 assembly mappings. The approach builds a fully open data-to-model pipeline, including 70,694 aligned CUDA↔HIP source samples and compiled SASS↔RDNA3 assemblies, and introduces CASS-Bench for execution-verified evaluation across 16 GPU domains. The CASS-Instruct models demonstrate state-of-the-art performance, achieving up to source-translation accuracy and assembly-translation accuracy, with over of translated assemblies preserving runtime and memory behavior relative to native code. This work enables rigorous, open research into cross-vendor GPU tooling and paves the way for practical, performance-preserving hardware translation and interoperability.

Abstract

We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA <--> HIP) and assembly-level (Nvidia SASS <--> AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.

Paper Structure

This paper contains 34 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: CASS Pipeline: We collect CUDA code from public repositories and synthesize additional samples via prompt-based LLM generation. After filtering and deduplication, all CUDA files are translated to HIP using HIPIFY, then compiled to extract host and device assembly. Matched outputs form the CASS dataset with aligned source and assembly pairs across Nvidia and AMD stacks.
  • Figure 2: The Nvidia (left) and AMD (right) stacks illustrate the compilation process for CUDA and HIP. Blue denotes device-side steps; green denotes host-side steps. Nvidia’s stack is opaque; accessing device assembly (SASS) requires first compiling to binary, then using cuobjdump. In contrast, AMD’s process is transparent, allowing direct inspection and modification of device assembly (RDNA3) before host integration.
  • Figure 3: CASS coverage across dataset and benchmark (left) domain distribution of training samples (right) category distribution in CASS-Bench.
  • Figure 4: Comparison of structural and syntactic patterns in CASS: (a) verbosity across subsets and backends; (b) syntactic similarity of translated code.
  • Figure 5: Source and assembly-level accuracy across categories.
  • ...and 6 more figures