Dockerfile Flakiness: Characterization and Repair
Taha Shabani, Noor Nashid, Parsa Alian, Ali Mesbah
TL;DR
The paper investigates Dockerfile flakiness, defined as unpredictable temporal build failures due to evolving external dependencies, and presents the first longitudinal study across 8,132 Dockerized projects, finding a nontrivial prevalence of about 9.81%. It introduces FlakiDock, a retrieval‑augmented, LLM‑driven repair framework that fuses static Dockerfile analysis with dynamic build outputs, using similarity retrieval and an iterative feedback loop to generate repairs. In extensive evaluation against Parfum, Shipwright, and GPT‑4 prompting baselines, FlakiDock achieves a repair accuracy of 73.55%, significantly outperforming baselines and demonstrating the value of context‑aware, iterative, and data‑driven repair in addressing flaky Dockerfiles. The work provides a taxonomy of flakiness, a demonstration dataset, and publicly available data and tools, offering practical impact for improving reliability in CI/CD pipelines amid evolving environments.
Abstract
Dockerfile flakiness-unpredictable temporal build failures caused by external dependencies and evolving environments-undermines deployment reliability and increases debugging overhead. Unlike traditional Dockerfile issues, flakiness occurs without modifications to the Dockerfile itself, complicating its resolution. In this work, we present the first comprehensive study of Dockerfile flakiness, featuring a nine-month analysis of 8,132 Dockerized projects, revealing that around 10% exhibit flaky behavior. We propose a taxonomy categorizing common flakiness causes, including dependency errors and server connectivity issues. Existing tools fail to effectively address these challenges due to their reliance on pre-defined rules and limited generalizability. To overcome these limitations, we introduce FLAKIDOCK, a novel repair framework combining static and dynamic analysis, similarity retrieval, and an iterative feedback loop powered by Large Language Models (LLMs). Our evaluation demonstrates that FLAKIDOCK achieves a repair accuracy of 73.55%, significantly surpassing state-of-the-art tools and baselines.
