Table of Contents
Fetching ...

TFHE-Coder: Evaluating LLM-agentic Fully Homomorphic Encryption Code Generation

Mayank Kumar, Jiaqi Xue, Mengxin Zheng, Qian Lou

TL;DR

This work addresses the barrier of generating correct TFHE code from natural language by introducing a compiler-in-the-loop framework that iteratively refines LLM outputs using compiler feedback. It compares baseline and agentic workflows, notably incorporating Retrieval-Augmented Generation (RAG) and few-shot prompting to improve API usage and structural fidelity for gate-level TFHE operations and ReLU activation. The study finds that GPT-4o consistently outperforms open-source models, while few-shot prompting substantially enhances correctness, and combining RAG with few-shot prompting yields the strongest results for capable models; ReLU remains the most challenging task. Overall, the paper provides the first benchmark for TFHE-code generation and demonstrates that domain-specific feedback can bridge much of the expertise gap in secure computation code synthesis, with implications for broader adoption of privacy-preserving technologies.

Abstract

Fully Homomorphic Encryption over the torus (TFHE) enables computation on encrypted data without decryption, making it a cornerstone of secure and confidential computing. Despite its potential in privacy preserving machine learning, secure multi party computation, private blockchain transactions, and secure medical diagnostics, its adoption remains limited due to cryptographic complexity and usability challenges. While various TFHE libraries and compilers exist, practical code generation remains a hurdle. We propose a compiler integrated framework to evaluate LLM inference and agentic optimization for TFHE code generation, focusing on logic gates and ReLU activation. Our methodology assesses error rates, compilability, and structural similarity across open and closedsource LLMs. Results highlight significant limitations in off-the-shelf models, while agentic optimizations such as retrieval augmented generation (RAG) and few-shot prompting reduce errors and enhance code fidelity. This work establishes the first benchmark for TFHE code generation, demonstrating how LLMs, when augmented with domain-specific feedback, can bridge the expertise gap in FHE code generation.

TFHE-Coder: Evaluating LLM-agentic Fully Homomorphic Encryption Code Generation

TL;DR

This work addresses the barrier of generating correct TFHE code from natural language by introducing a compiler-in-the-loop framework that iteratively refines LLM outputs using compiler feedback. It compares baseline and agentic workflows, notably incorporating Retrieval-Augmented Generation (RAG) and few-shot prompting to improve API usage and structural fidelity for gate-level TFHE operations and ReLU activation. The study finds that GPT-4o consistently outperforms open-source models, while few-shot prompting substantially enhances correctness, and combining RAG with few-shot prompting yields the strongest results for capable models; ReLU remains the most challenging task. Overall, the paper provides the first benchmark for TFHE-code generation and demonstrates that domain-specific feedback can bridge much of the expertise gap in secure computation code synthesis, with implications for broader adoption of privacy-preserving technologies.

Abstract

Fully Homomorphic Encryption over the torus (TFHE) enables computation on encrypted data without decryption, making it a cornerstone of secure and confidential computing. Despite its potential in privacy preserving machine learning, secure multi party computation, private blockchain transactions, and secure medical diagnostics, its adoption remains limited due to cryptographic complexity and usability challenges. While various TFHE libraries and compilers exist, practical code generation remains a hurdle. We propose a compiler integrated framework to evaluate LLM inference and agentic optimization for TFHE code generation, focusing on logic gates and ReLU activation. Our methodology assesses error rates, compilability, and structural similarity across open and closedsource LLMs. Results highlight significant limitations in off-the-shelf models, while agentic optimizations such as retrieval augmented generation (RAG) and few-shot prompting reduce errors and enhance code fidelity. This work establishes the first benchmark for TFHE code generation, demonstrating how LLMs, when augmented with domain-specific feedback, can bridge the expertise gap in FHE code generation.

Paper Structure

This paper contains 14 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: Overview of the compiler-in-the-loop evaluator. The LLM generates TFHE code based on a user prompt, which is then compiled. If compilation fails, the model receives a compile report and revises its output iteratively until a compilable solution is produced.
  • Figure 2: Agentic-optimized evaluator loop incorporating RAG and few-shot prompting. The LLM receives both retrieval-augmented documentation and few-shot examples alongside the user prompt, refining its output iteratively based on compile reports until a compilable solution is achieved.
  • Figure 3: Baseline performance comparison of all models across four tasks (NOT, AND, OR, ReLU) using: (a) CrystalBLEU, (b) Pass@k (comp), and (c) Pass@k (func). Higher values indicate better alignment with reference implementations, with GPT-4o consistently outperforming other models across all tasks.
  • Figure 4: Impact of Baseline, RAG, Few-shot Prompting, and their combination (RAG+Prompt) on (a) Wrong Format and (b) Repetition Error.
  • Figure 5: Comparison of all models across four tasks using RAG technique, using: (a) CrystalBLEU, (b) Pass@k (comp), and (c) Pass@k (func). While minor improvements in CrystalBLEU are observed for some models, overall functional correctness remains low, with GPT-4o maintaining the highest performance across all tasks.
  • ...and 2 more figures