CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
Ahmed Heakl, Sarim Hashmi, Gustavo Bertolo Stahl, Seung Hun Eddie Han, Salman Khan, Abdulrahman Mahmoud
TL;DR
CASS tackles the GPU portability problem by delivering a large-scale dataset and model suite for cross-architecture translation between Nvidia CUDA and AMD HIP, plus low-level Nvidia SASS↔AMD RDNA3 assembly mappings. The approach builds a fully open data-to-model pipeline, including 70,694 aligned CUDA↔HIP source samples and compiled SASS↔RDNA3 assemblies, and introduces CASS-Bench for execution-verified evaluation across 16 GPU domains. The CASS-Instruct models demonstrate state-of-the-art performance, achieving up to $95\%$ source-translation accuracy and $37.5\%$ assembly-translation accuracy, with over $85\%$ of translated assemblies preserving runtime and memory behavior relative to native code. This work enables rigorous, open research into cross-vendor GPU tooling and paves the way for practical, performance-preserving hardware translation and interoperability.
Abstract
We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA <--> HIP) and assembly-level (Nvidia SASS <--> AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific language models, achieving 95% source translation accuracy and 37.5% assembly translation accuracy, substantially outperforming commercial baselines such as GPT-4o, Claude, and Hipify. Our generated code matches native performance in over 85% of test cases, preserving runtime and memory behavior. To support rigorous evaluation, we introduce CASS-Bench, a curated benchmark spanning 16 GPU domains with ground-truth execution. All data, models, and evaluation tools are released as open source to foster progress in GPU compiler tooling, binary compatibility, and LLM-guided hardware translation.
