Table of Contents
Fetching ...

A Log-Linear Analytics Approach to Cost Model Regularization for Inpatient Stays through Diagnostic Code Merging

Chi-Ken Lu, David Alonge, Nicole Richardson, Bruno Richard

TL;DR

This work tackles instability in high-dimensional cost modeling for inpatient stays by examining how ICD-10 code granularity affects OLS coefficient stability. It pairs explicit regularization (Ridge) with an implicit regularization mechanism achieved by truncating ICD-10 codes to fewer characters, which increases the Hessian trace $tr(\tilde{X}'\tilde{X})$ and reduces coefficient variance. A new coefficient-consistency metric $\eta$ based on Spearman correlation is introduced to quantify stability across data splits. Empirically, finer ICD-10 granularity yields higher predictive accuracy (approx. $R^2_{test}=0.41$) but lower coefficient stability, while reducing granularity improves consistency and robustness, with DRG/HCC groupings offering additional stability but varying in predictive performance. The findings provide a practical, interpretable approach to robust risk adjustment in healthcare cost modeling by leveraging implicit regularization through code aggregation, with implications for policy and clinical coding practices.

Abstract

Cost models in healthcare research must balance interpretability, accuracy, and parameter consistency. However, interpretable models often struggle to achieve both accuracy and consistency. Ordinary least squares (OLS) models for high-dimensional regression can be accurate but fail to produce stable regression coefficients over time when using highly granular ICD-10 diagnostic codes as predictors. This instability arises because many ICD-10 codes are infrequent in healthcare datasets. While regularization methods such as Ridge can address this issue, they risk discarding important predictors. Here, we demonstrate that reducing the granularity of ICD-10 codes is an effective regularization strategy within OLS while preserving the representation of all diagnostic code categories. By truncating ICD-10 codes from seven characters to six or fewer, we reduce the dimensionality of the regression problem while maintaining model interpretability and consistency. Mathematically, the merging of predictors in OLS leads to increased trace of the Hessian matrix, which reduces the variance of coefficient estimation. Our findings explain why broader diagnostic groupings like DRGs and HCC codes are favored over highly granular ICD-10 codes in real-world risk adjustment and cost models.

A Log-Linear Analytics Approach to Cost Model Regularization for Inpatient Stays through Diagnostic Code Merging

TL;DR

This work tackles instability in high-dimensional cost modeling for inpatient stays by examining how ICD-10 code granularity affects OLS coefficient stability. It pairs explicit regularization (Ridge) with an implicit regularization mechanism achieved by truncating ICD-10 codes to fewer characters, which increases the Hessian trace and reduces coefficient variance. A new coefficient-consistency metric based on Spearman correlation is introduced to quantify stability across data splits. Empirically, finer ICD-10 granularity yields higher predictive accuracy (approx. ) but lower coefficient stability, while reducing granularity improves consistency and robustness, with DRG/HCC groupings offering additional stability but varying in predictive performance. The findings provide a practical, interpretable approach to robust risk adjustment in healthcare cost modeling by leveraging implicit regularization through code aggregation, with implications for policy and clinical coding practices.

Abstract

Cost models in healthcare research must balance interpretability, accuracy, and parameter consistency. However, interpretable models often struggle to achieve both accuracy and consistency. Ordinary least squares (OLS) models for high-dimensional regression can be accurate but fail to produce stable regression coefficients over time when using highly granular ICD-10 diagnostic codes as predictors. This instability arises because many ICD-10 codes are infrequent in healthcare datasets. While regularization methods such as Ridge can address this issue, they risk discarding important predictors. Here, we demonstrate that reducing the granularity of ICD-10 codes is an effective regularization strategy within OLS while preserving the representation of all diagnostic code categories. By truncating ICD-10 codes from seven characters to six or fewer, we reduce the dimensionality of the regression problem while maintaining model interpretability and consistency. Mathematically, the merging of predictors in OLS leads to increased trace of the Hessian matrix, which reduces the variance of coefficient estimation. Our findings explain why broader diagnostic groupings like DRGs and HCC codes are favored over highly granular ICD-10 codes in real-world risk adjustment and cost models.

Paper Structure

This paper contains 16 sections, 2 theorems, 17 equations, 10 figures, 1 table.

Key Result

Lemma 1

The effective dimension for the Ridge regression model trained on $n$ data has a upper bound $\rho_B$, where $\bar{s}$ stands for the average of the eigenvalues of $\tilde{X}^{\prime}\tilde{X}/n$.

Figures (10)

  • Figure 1: Panel A: the predictive log costs from OLS models against their true values. The OLS models include explainable variables: indicators of ICD-10 diagnostic codes and demographics (age, sex, and race). Panel B: the inconsistency among a few regression coefficients from OLS fitting to different training data. In Set 3, the corresponding training samples do not contain H18.421 so that the fitted coefficient is zero.
  • Figure 2: A toy example for illustration of reducing code granularity by merging similar codes. The design matrix $X^{(4)}$ on the left records the diagnoses using codes with $CL\leq4$. Then the granularity is lowered by truncating codes with four characters to 3. The predictors for A001 and A002 are added to form the new predictor A00 in the matrix $X^{(3)}$ on the right. The corresponding Hessian matrices are displayed at the bottom. Lowering granularity (left to right) increases the trace due to the co-occurrence of merged codes in the same stay. If the marked 1 in $X^{(4)}$ is set to 0, then the rest of the marked numbers change and the traces on both sides become identical.
  • Figure 3: A. Map of the hospitals that make up the downstate New York subset we use in our analysis. B. Association between the average cost of a stay and the number of diagnostic codes attached to the stay for hospitals in Downstate New York. The more diagnostic codes are attached to a stay, the more expensive the stay. C. Distributions over the Age, Sex, and Race variables in the MedPAR subset.
  • Figure 4: Evolving histograms of diagonal entries of Hessian matrix for different code granularity $CL\leq l=[2,3,4,5,6,7]$ (panel A to F). In panels D-F for higher granularity, the diagonal entries display power law distributions.
  • Figure 5: Left panel: Histogram of HCC codes frequencies. Right panel: Histogram of DRG codes frequencies. Code groupings reduce the prevalence of rare codes in the ICD-10 representation of diagnoses.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Remark 1
  • Proof
  • Remark 2
  • Proof
  • Lemma 1
  • Proof
  • Lemma 2
  • Proof