Table of Contents
Fetching ...

Making 'syscall' a Privilege not a Right

Fangfei Yang, Anjo Vahldiek-Oberwagner, Chia-Che Tsai, Kelly Kaoudis, Nathan Dautenhahn

TL;DR

This paper tackles the challenge of securely intercepting syscalls in in-process sandboxes, LibOSes, and emulators where traditional methods like ptrace and seccomp-bpf either incur high overhead or offer limited policy flexibility. It introduces nexpoline, a trampoline-based mechanism that partitions the process into a trusted interceptor and an untrusted application, using Memory Protection Keys (MPK) along with Syscall User Dispatch (SUD) or Seccomp-bpf to make syscall instructions a privileged operation within user space. The authors provide a detailed design and implementation that requires no kernel modifications, demonstrate secure signal handling, and show favorable interception overheads in microbenchmarks and realistic benchmarks with nginx and Redis compared to ptrace-based approaches and existing sandboxes. The work offers a practical path to high-assurance syscall control with flexible policies, enabling more capable sandboxes and LibOSes while maintaining strong performance. Overall, nexpoline advances secure, efficient, and policy-rich syscall interception for contemporary Linux systems.

Abstract

Browsers, Library OSes, and system emulators rely on sandboxes and in-process isolation to emulate system resources and securely isolate untrusted components. All access to system resources like system calls (syscall) need to be securely mediated by the application. Otherwise system calls may allow untrusted components to evade the emulator or sandbox monitor, and hence, escape and attack the entire application or system. Existing approaches, such as ptrace, require additional context switches between kernel and userspace, which introduce high performance overhead. And, seccomp-bpf supports only limited policies, which restricts its functionality, or it still requires ptrace to provide assistance. In this paper, we present nexpoline, a secure syscall interception mechanism combining Memory Protection Keys (MPK) and Seccomp or Syscall User Dispatch (SUD). Our approach transforms an application's syscall instruction into a privilege reserved for the trusted monitor within the address space, allowing flexible user defined policy. To execute a syscall, the application must switch contexts via nexpoline. It offers better efficiency than secure interception techniques like ptrace, as nexpoline can intercept syscalls through binary rewriting securely. Consequently, nexpoline ensures the safety, flexibility and efficiency for syscall interception. Notably, it operates without kernel modifications, making it viable on current Linux systems without needing root privileges. Our benchmarks demonstrate improved performance over ptrace in interception overhead while achieving the same security guarantees. When compared to similarly performing firejail, nexpoline supports more complex policies and enables the possibility to emulate system resources.

Making 'syscall' a Privilege not a Right

TL;DR

This paper tackles the challenge of securely intercepting syscalls in in-process sandboxes, LibOSes, and emulators where traditional methods like ptrace and seccomp-bpf either incur high overhead or offer limited policy flexibility. It introduces nexpoline, a trampoline-based mechanism that partitions the process into a trusted interceptor and an untrusted application, using Memory Protection Keys (MPK) along with Syscall User Dispatch (SUD) or Seccomp-bpf to make syscall instructions a privileged operation within user space. The authors provide a detailed design and implementation that requires no kernel modifications, demonstrate secure signal handling, and show favorable interception overheads in microbenchmarks and realistic benchmarks with nginx and Redis compared to ptrace-based approaches and existing sandboxes. The work offers a practical path to high-assurance syscall control with flexible policies, enabling more capable sandboxes and LibOSes while maintaining strong performance. Overall, nexpoline advances secure, efficient, and policy-rich syscall interception for contemporary Linux systems.

Abstract

Browsers, Library OSes, and system emulators rely on sandboxes and in-process isolation to emulate system resources and securely isolate untrusted components. All access to system resources like system calls (syscall) need to be securely mediated by the application. Otherwise system calls may allow untrusted components to evade the emulator or sandbox monitor, and hence, escape and attack the entire application or system. Existing approaches, such as ptrace, require additional context switches between kernel and userspace, which introduce high performance overhead. And, seccomp-bpf supports only limited policies, which restricts its functionality, or it still requires ptrace to provide assistance. In this paper, we present nexpoline, a secure syscall interception mechanism combining Memory Protection Keys (MPK) and Seccomp or Syscall User Dispatch (SUD). Our approach transforms an application's syscall instruction into a privilege reserved for the trusted monitor within the address space, allowing flexible user defined policy. To execute a syscall, the application must switch contexts via nexpoline. It offers better efficiency than secure interception techniques like ptrace, as nexpoline can intercept syscalls through binary rewriting securely. Consequently, nexpoline ensures the safety, flexibility and efficiency for syscall interception. Notably, it operates without kernel modifications, making it viable on current Linux systems without needing root privileges. Our benchmarks demonstrate improved performance over ptrace in interception overhead while achieving the same security guarantees. When compared to similarly performing firejail, nexpoline supports more complex policies and enables the possibility to emulate system resources.
Paper Structure (29 sections, 6 figures, 3 tables)

This paper contains 29 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Secure Syscall Interception Mechanism
  • Figure 2: Nexpoline Overview
  • Figure 3: User Stack Manipulation for Signal Re-Delivery during interception
  • Figure 4: Cycles on signal register, deliver and return
  • Figure 5: NGINX Benchmark [max. std. $<1.34\%$]
  • ...and 1 more figures