Fusing Forces: Deep-Human-Guided Refinement of Segmentation Masks

Rafael Sterzinger; Christian Stippel; Robert Sablatnig

Fusing Forces: Deep-Human-Guided Refinement of Segmentation Masks

Rafael Sterzinger, Christian Stippel, Robert Sablatnig

TL;DR

This paper tackles the costly task of tracing intricate engravings on Etruscan mirrors by introducing a human-in-the-loop refinement network that updates a photometric-stereo–based initial segmentation using user hints. The method combines a dataset of PS scans, stroke-width statistics, and a two-step correction process (add/erase) with a second network conditioned on human input to achieve faster, higher-quality annotations. Ablation studies quantify the benefits of human guidance, the impact of stroke-width treatment, and the necessity of both addition and erasure operations, reporting up to 75% reduction in manual labor and up to 26% relative improvement over manual refinement. The approach yields efficient, scalable segmentation of complex lines and is demonstrated on multiple mirrors, with public release of code and data to support reproducibility and extension to other domains.

Abstract

Etruscan mirrors constitute a significant category in Etruscan art, characterized by elaborate figurative illustrations featured on their backside. A laborious and costly aspect of their analysis and documentation is the task of manually tracing these illustrations. In previous work, a methodology has been proposed to automate this process, involving photometric-stereo scanning in combination with deep neural networks. While achieving quantitative performance akin to an expert annotator, some results still lack qualitative precision and, thus, require annotators for inspection and potential correction, maintaining resource intensity. In response, we propose a deep neural network trained to interactively refine existing annotations based on human guidance. Our human-in-the-loop approach streamlines annotation, achieving equal quality with up to 75% less manual input required. Moreover, during the refinement process, the relative improvement of our methodology over pure manual labeling reaches peak values of up to 26%, attaining drastically better quality quicker. By being tailored to the complex task of segmenting intricate lines, specifically distinguishing it from previous methods, our approach offers drastic improvements in efficacy, transferable to a broad spectrum of applications beyond Etruscan mirrors.

Fusing Forces: Deep-Human-Guided Refinement of Segmentation Masks

TL;DR

Abstract

Paper Structure (22 sections, 3 equations, 6 figures, 2 tables, 3 algorithms)

This paper contains 22 sections, 3 equations, 6 figures, 2 tables, 3 algorithms.

Introduction
Related Work
Segmentation:
Photometric Stereo:
Interactive Segmentation:
Methodology
Dataset
Preprocessing
Simulation of Human Interaction
Statistics
Operations
Interaction
Architecture
Evaluation
Training
...and 7 more sections

Figures (6)

Figure 1: Illustrating interactive refinement of segmentation masks: Starting from an initial segmentation $\mathbf{Y}$, the user can add ($\mathbf{\Delta}^+$) or erase ($\mathbf{\Delta}^-$) parts to bring it closer to the ground truth $\mathbf{Y}^*$ (in blue), creating an updated mask $\mathbf{Y}^\mathbf{\Delta}$. Next, using a separate model conditioned on the human input $\mathbf{\Delta}$ and $\mathbf{Y}$, we aim that for the refined segmentation $\mathbf{Y}'$ it holds that $||\mathbf{Y}'-\mathbf{Y}^*||_1<||\mathbf{Y}^\mathbf{\Delta}-\mathbf{Y}^*||_1$.
Figure 2: Etruscan mirrors typically feature scenes from Greek mythology. During their examination, archaeologists seek to extract the drawings for visualization.
Figure 3: Illustration of the distribution of stroke widths: After removing outliers from our data, using the two-sigma rule, we fit a Gamma distribution (shape-parameter $a=49.13$, loc$=-4.28$, scale$=0.21$).
Figure 4: An illustration of the overall methodology: In general, segmentation is performed on a per-patch level ($512\times512$, resized to $256\times256$; red denotes patches that are filtered a priori using SAM kirillov_segment_2023). In an interactive paradigm, starting from the initial prediction $\mathbf{Y}$ at timestep $t_0$, based on input $\mathbf{X}$, a human provides hints in the form of $\mathbf{\Delta}$ (the "union" between $\mathbf{Y}$ and $\mathbf{\Delta}$ is denoted with $\mathbf{Y}^\mathbf{\Delta}$), on which a separately trained network $f_{iter}$ is conditioned on to produce a refined mask at timestep $t_1$.
Figure 5: An illustration of $\text{pFM}_\mathbf{\Delta}$, i.e., the relative pFM improvement of our method over pure manual refinement; $n$ denotes the number of human interactions: With the relative improvement peaking at values between ca. +12% and +26%, our human-in-the-loop approach immediately overtakes manual labeling, leading to drastically better annotations earlier.
...and 1 more figures

Fusing Forces: Deep-Human-Guided Refinement of Segmentation Masks

TL;DR

Abstract

Fusing Forces: Deep-Human-Guided Refinement of Segmentation Masks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)