Finding the Reflection Point: Unpadding Images to Remove Data Augmentation Artifacts in Large Open Source Image Datasets for Machine Learning
Lucas Choi, Ross Greer
TL;DR
This work tackles the problem of reflective padding artifacts introduced by symmetric image padding in large datasets, which can distort evaluation when data is repurposed across tasks. It proposes an image unpadding algorithm that locates the reflection boundary by minimizing the mean squared error between a top crop and its mirrored region, with a thresholding step to decide padding presence and a fixed offset to avoid border artifacts. On the SHEL5k dataset, unpadding yields substantial improvements in zero-shot detection performance with OWLv2, increasing average precision for hard hats from 0.467 to 0.612 and for persons from 0.677 to 0.735, reflecting cleaner annotations and more reliable evaluation. The method enhances dataset integrity for cross-domain machine learning and offers a practical path toward more realistic augmentation practices in large-scale image collections.
Abstract
In this paper, we address a novel image restoration problem relevant to machine learning dataset curation: the detection and removal of noisy mirrored padding artifacts. While data augmentation techniques like padding are necessary for standardizing image dimensions, they can introduce artifacts that degrade model evaluation when datasets are repurposed across domains. We propose a systematic algorithm to precisely delineate the reflection boundary through a minimum mean squared error approach with thresholding and remove reflective padding. Our method effectively identifies the transition between authentic content and its mirrored counterpart, even in the presence of compression or interpolation noise. We demonstrate our algorithm's efficacy on the SHEL5k dataset, showing significant performance improvements in zero-shot object detection tasks using OWLv2, with average precision increasing from 0.47 to 0.61 for hard hat detection and from 0.68 to 0.73 for person detection. By addressing annotation inconsistencies and distorted objects in padded regions, our approach enhances dataset integrity, enabling more reliable model evaluation across computer vision tasks.
