Table of Contents
Fetching ...

STRinGS: Selective Text Refinement in Gaussian Splatting

Abhinav Raundhal, Gaurav Behera, P J Narayanan, Ravi Kiran Sarvadevabhatla, Makarand Tapaswi

TL;DR

3D Gaussian Splatting often loses fine text details, hindering text-rich scene understanding. STRinGS introduces a text-aware, two-phase refinement that isolates and densifies text Gaussians before full-scene optimization, yielding sharper, more readable text early in training. The approach is validated with OCR-CER improvements across multiple datasets and introduces STRinGS-360 to benchmark text readability in 3D reconstructions. The work demonstrates that targeted text refinement can achieve high semantic fidelity without sacrificing global visual quality, enabling time-sensitive, text-rich 3D scene understanding.

Abstract

Text as signs, labels, or instructions is a critical element of real-world scenes as they can convey important contextual information. 3D representations such as 3D Gaussian Splatting (3DGS) struggle to preserve fine-grained text details, while achieving high visual fidelity. Small errors in textual element reconstruction can lead to significant semantic loss. We propose STRinGS, a text-aware, selective refinement framework to address this issue for 3DGS reconstruction. Our method treats text and non-text regions separately, refining text regions first and merging them with non-text regions later for full-scene optimization. STRinGS produces sharp, readable text even in challenging configurations. We introduce a text readability measure OCR Character Error Rate (CER) to evaluate the efficacy on text regions. STRinGS results in a 63.6% relative improvement over 3DGS at just 7K iterations. We also introduce a curated dataset STRinGS-360 with diverse text scenarios to evaluate text readability in 3D reconstruction. Our method and dataset together push the boundaries of 3D scene understanding in text-rich environments, paving the way for more robust text-aware reconstruction methods.

STRinGS: Selective Text Refinement in Gaussian Splatting

TL;DR

3D Gaussian Splatting often loses fine text details, hindering text-rich scene understanding. STRinGS introduces a text-aware, two-phase refinement that isolates and densifies text Gaussians before full-scene optimization, yielding sharper, more readable text early in training. The approach is validated with OCR-CER improvements across multiple datasets and introduces STRinGS-360 to benchmark text readability in 3D reconstructions. The work demonstrates that targeted text refinement can achieve high semantic fidelity without sacrificing global visual quality, enabling time-sensitive, text-rich 3D scene understanding.

Abstract

Text as signs, labels, or instructions is a critical element of real-world scenes as they can convey important contextual information. 3D representations such as 3D Gaussian Splatting (3DGS) struggle to preserve fine-grained text details, while achieving high visual fidelity. Small errors in textual element reconstruction can lead to significant semantic loss. We propose STRinGS, a text-aware, selective refinement framework to address this issue for 3DGS reconstruction. Our method treats text and non-text regions separately, refining text regions first and merging them with non-text regions later for full-scene optimization. STRinGS produces sharp, readable text even in challenging configurations. We introduce a text readability measure OCR Character Error Rate (CER) to evaluate the efficacy on text regions. STRinGS results in a 63.6% relative improvement over 3DGS at just 7K iterations. We also introduce a curated dataset STRinGS-360 with diverse text scenarios to evaluate text readability in 3D reconstruction. Our method and dataset together push the boundaries of 3D scene understanding in text-rich environments, paving the way for more robust text-aware reconstruction methods.

Paper Structure

This paper contains 33 sections, 3 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Qualitative and quantitative comparison of Gaussian Splatting methods on text reconstruction at 7K iterations. Left: On a novel view from the Shelf dataset that features library books on a shelf, our approach STRinGS (bottom) produces sharper and readable text as compared to vanilla 3DGS (top). Right: We quantify text reconstruction using Character Error Rate (CER) used in Optical Character Recognition (OCR). The accompanying scatter plot presents readability (CER, lower is better) vs. training time. STRinGS achieves the best performance both in terms of lowest error and fastest training time.
  • Figure 2: Overview of the scenes in our STRinGS-360 dataset. Each scene contains semantically meaningful text elements: (A) Extinguisher, (B) Books, (C) Chemicals, (D) Globe, and (E) Shelf. The dataset is designed to evaluate text reconstruction performance under diverse layouts and text orientations.
  • Figure 3: STRinGS overview. Given $n$ input images, we use COLMAP to obtain a point cloud $\mathcal{P}$ and undistorted images, which are passed to Hi-SAM ye2024hi to obtain text masks $\mathcal{M}$. $\mathcal{P}$ and $\mathcal{M}$ are passed to the Text Segmentation in 3D module (\ref{['subsec:localization']}, \ref{['alg:localization']}) to obtain partitioned text and non-text point clouds. These are processed through a two-phase pipeline. In phase 1 (\ref{['subsec:phase1']}), we perform targeted densification and reconstruction of text Gaussians. In phase 2 (\ref{['subsec:phase2']}), we perform full scene refinement, where text and non-text Gaussians are optimized with distinct learning strategies, enabling targeted enhancement of text without compromising scene quality. The final output is a text-refined Gaussian Splat representation with enhanced text readability while preserving overall scene fidelity.
  • Figure 4: Learning rate (LR) of the position parameter for Gaussians in STRinGS (see \ref{['eq:pos_lr']}). Left: Learning rate scaling factor $\eta_r(t)$ for text and non-text Gaussians. Right: Effective LR obtained by modulating a shifted base exponential decay schedule $\eta_\text{opt}(t)$ from 3DGS with these factors. $\alpha{=}0.5$, $\beta{=}0.0005$, $\gamma{=}15000$. Note, phase 1 sets the position learning rate of $\mathcal{G}_{\text{text}}$ to 0 while $\mathcal{G}_{\text{non-text}}$ is not optimized. In phase 2, we introduce differentiated learning for text and non-text content.
  • Figure 5: Qualitative comparison of different methods at 7K training iterations on scenes from the DL3DV-10K Benchmark ling2024dl3dv (rows 1, 2) and our STRinGS-360 (rows 3-5) datasets. While existing methods struggle to reconstruct text accurately at this early stage, our STRinGS framework produces significantly sharper and more legible text regions. (Best seen on screen)
  • ...and 4 more figures