Table of Contents
Fetching ...

Reclaiming Idle CPU Cycles on Kubernetes: Sparse-Domain Multiplexing for Concurrent MPI-CFD Simulations

Tianfang Xie

Abstract

When MPI-parallel simulations run on shared Kubernetes clusters, conventional CPU scheduling leaves the vast majority of provisioned cycles idle at synchronization barriers. This paper presents a multiplexing framework that reclaims this idle capacity by co-locating multiple simulations on the same cluster. PMPI-based duty-cycle profiling quantifies the per-rank idle fraction; proportional CPU allocation then allows a second simulation to execute concurrently with minimal overhead, yielding 1.77x throughput. A Pareto sweep to N=5 concurrent simulations shows throughput scaling to 3.74x, with a knee at N=3 offering the best efficiency-cost trade-off. An analytical model with a single fitted parameter predicts these gains within +/-4%. A dynamic controller automates the full pipeline, from profiling through In-Place Pod Vertical Scaling (KEP-1287) to packing and fairness monitoring, achieving 3.25x throughput for four simulations without manual intervention or pod restarts. To our knowledge, this is the first CPU application of In-Place Pod Vertical Scaling to running MPI processes. Experiments on an AWS cluster with OpenFOAM CFD confirm that the results hold under both concentric and standard graph-based (Scotch) mesh partitioning.

Reclaiming Idle CPU Cycles on Kubernetes: Sparse-Domain Multiplexing for Concurrent MPI-CFD Simulations

Abstract

When MPI-parallel simulations run on shared Kubernetes clusters, conventional CPU scheduling leaves the vast majority of provisioned cycles idle at synchronization barriers. This paper presents a multiplexing framework that reclaims this idle capacity by co-locating multiple simulations on the same cluster. PMPI-based duty-cycle profiling quantifies the per-rank idle fraction; proportional CPU allocation then allows a second simulation to execute concurrently with minimal overhead, yielding 1.77x throughput. A Pareto sweep to N=5 concurrent simulations shows throughput scaling to 3.74x, with a knee at N=3 offering the best efficiency-cost trade-off. An analytical model with a single fitted parameter predicts these gains within +/-4%. A dynamic controller automates the full pipeline, from profiling through In-Place Pod Vertical Scaling (KEP-1287) to packing and fairness monitoring, achieving 3.25x throughput for four simulations without manual intervention or pod restarts. To our knowledge, this is the first CPU application of In-Place Pod Vertical Scaling to running MPI processes. Experiments on an AWS cluster with OpenFOAM CFD confirm that the results hold under both concentric and standard graph-based (Scotch) mesh partitioning.

Paper Structure

This paper contains 37 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Per-rank CPU duty cycle (498 834-cell mesh, 16 ranks, 200 iterations, 12-node cluster, concentric decomposition). Top: absolute time split between computation (solid) and MPI wait (light). Bottom: duty cycle percentage. Sparse ranks (blue, weight 1) average 5.0%, dense ranks (red, weight 15) average 19.4%.
  • Figure 2: (a) C-mesh block topology (6 hex blocks, 498 834 cells). (b) Concentric weight zones for 16-rank decomposition: dense ranks (weight 15) receive near-wall cells, sparse ranks (weight 1) receive far-field cells.
  • Figure 3: Manual concentric decomposition of the NACA 0012 mesh into 16 ranks. Cell centers are colored by processor assignment. Dense ranks (red, 68% of cells) occupy the near-wall region, medium ranks (orange, 23%) the intermediate zone, and sparse ranks (blue, 9%) the far-field.
  • Figure 4: (a) Wall-clock time per configuration (solid: Sim A, hatched: Sim B). (b) Total time to complete two simulations: sequential ($2\times$ single-sim, stacked) vs. concurrent (overlapping). Black line shows throughput gain (right axis). Concurrent execution nearly halves total time: C-2P $1.77\times$, C-2E $1.83\times$.
  • Figure 5: Pareto analysis for $N=1\ldots5$ concurrent proportional simulations. (a) Throughput gain (red line) and scheduling efficiency (blue bars). The dashed line shows ideal linear scaling. The knee point at $N=3$ offers 86% efficiency. (b) Per-simulation degradation (orange bars) and makespan (green line).
  • ...and 2 more figures