
Currently, the severity of CLBP and its associated disability are commonly assessed using patient-reported outcome measures (PROMs)22,25,26, such as the numerical rating scale (NRS)27 and visual analog scale (VAS)27 for pain intensity. Establishing a threshold that represents the smallest change in PROM scores perceived as beneficial by patients is essential to determine the clinical significance of a therapy28,29. While previous studies have explored such thresholds in surgical contexts30,31,32,33, there is a lack of studies addressing this need for lumbar steroid injection therapy. To bridge this gap, we leveraged data-driven methodologies to develop a comprehensive predictive framework for lumbar injection therapy in patients with CLBP by integrating clinical data and patient-specific demographics. First, we developed a predictive model to identify key factors influencing the effectiveness of lumbar steroid injection therapy, enabling the identification of patients most likely to experience improvements in pain perception. Treatment success was evaluated based on self-reported pain satisfaction following therapy. Next, we aimed to establish clinically relevant thresholds for pain reduction specific to infiltration therapy, focusing on the minimal reduction in pain scores required for patients to perceive treatment outcomes as satisfactory.
A retrospective secondary analysis was performed using data from the Treatment Expectation and their Influence on Infiltration outcome (TREXI) study. The TREXI study was a prospective observational longitudinal investigation carried out between February 2019 and December 2020 at the Department of Neurology, Schulthess Clinic in Zurich, Switzerland. The original cohort included 306 adult patients, aged 18 to 93 years, diagnosed with CLBP. For this secondary analysis, a subset of 212 patients who provided informed consent for the additional use of their data in research was included (Fig. 1a). The study was approved by the Cantonal Ethics Committee of Zurich (BASEC-NR 2023-02210) and complied with the ethical principles outlined in the Declaration of Helsinki.
Our study focused on patient-specific clinical and demographic characteristics as potential predictors of treatment response, excluding measures of patient expectations that were the main focus of the original analysis. The experimental protocol comprised questionnaires administered in German at three time points: on the day receiving the lumbar steroid injection, immediately prior to treatment, immediately after receiving the lumbar steroid injection and two weeks after the treatment.
To align with the study’s objective of evaluating predictors of treatment response, data collected immediately after the treatment were excluded. This exclusion was implemented to ensure a clear separation between the baseline data and the post-treatment data. The baseline was redefined as , representing the period prior to injection therapy, and the post-treatment period was labeled as , corresponding to two weeks after injection.
Data collection encompassed a comprehensive set of questionnaire items addressing patients’ demographics, pain characteristics, and self-reported health status (Figure 1b).
Demographics: Demographic information including age, sex, and education level, as well as categories of professional status — categorized as self-employed, student, homemaker, retired, incapacitated, or unemployed — was collected through questionnaires.
Pain characteristics: The duration of back complaints was recorded into intervals, namely less than 4 weeks, 4 to 8 weeks, 8 to 12 weeks, and more than 12 weeks. Current back pain was assessed using numerical rating scales (NRS) which involve individuals rating their pain intensity on a scale from 0 (no pain) to 10 (worst pain imaginable). For participants who had previously undergone lumbar steroid injections, additional data were collected, including whether they experienced improvement after the last injection, the time elapsed since the previous treatment — categorized as less than 1 year, 1 to 2 years, or more than 2 years — and whether the procedure was performed by the same doctor or clinic. Motivation for treatment was assessed through sources of influence, including friends, family, the doctor performing the infiltration, general practitioner, internet, personal experience, and the importance of others’ opinions.
Self-reported health status: A comprehensive set of validated PROMs covering medication beliefs, expectations, empathy in care, and self-efficacy were collected when infiltration therapy was administered () to gain a multidimensional understanding of the patient’s pain experience and its impact. The items were extracted from widely used questionnaires in clinical research and consisted of the following: The Perceived Sensitivity to Medicine (PSM) scale was utilized to evaluate patients’ perceived responsiveness to medication in general. This questionnaire includes items assessing perceived susceptibility to medications, beliefs about experiencing strong reactions, perceptions of having stronger reactions than others, and concerns about side effects from regular medication use. Responses were recorded on a 5-point Likert scale, ranging from “strongly disagree” to “strongly agree.” Furthermore, the Consultation and Relational Empathy (CARE) measure was employed to record their evaluation of the overall care experience. This questionnaire assesses various aspects of the patient-provider interaction, such as the provider’s ability to make the patient feel at ease, allow them to tell their story, feeling understood by the healthcare provider, be interested in them as a whole person, fully understand their concerns, show care and compassion, and explain things clearly. Responses are given on a 5-point Likert scale ranging from “poor” to “excellent”. For multidimensional assessment of pain and disability, the Core Outcome Measures Index (COMI) back score was used. The COMI questionnaire is a concise 7-item questionnaire which assesses the LBP disability, quality of life and pain perception including questions on back and leg pain intensity (0-10 NRS scale), function, symptom-specific well-being, general quality of life, and disability at work and social situations (5-point Likert scale). The COMI has been extensively validated against well established longer questionnaires such as the Roland Morris Disability Questionnaire and 36-item short-form health survey (SF-36). The Pain Self-Efficacy Questionnaire (PSEQ) was used to evaluate patients’ beliefs in their ability to cope with and manage pain despite its presence. This questionnaire includes 10 items assessing the patient’s confidence in performing various activities despite pain, such as enjoying things, doing household chores, socializing, coping with pain without medication, achieving goals, engaging in leisure activities, coping with pain in general, accomplishing work tasks, leaving a normal lifestyle, and becoming more active. Responses are provided on a 7-point scale ranging from 0 (“not at all confident”) to 6 (“completely confident”).
The primary focus of this study was to assess the effectiveness of lumbar steroid injections by evaluating both patient satisfaction and clinically meaningful improvements in pain intensity levels two weeks after treatment (). Accordingly, the main outcome was pain level satisfaction. A dichotomized variable was created based on the question: “Are you satisfied with the current pain level?” (“Sind Sie mit dem aktuellen Schmerzniveau zufrieden?”). Patients rated their satisfaction on a scale from 0 to 10, where 0 indicated complete satisfaction and 10 indicated no satisfaction at all. To align with clinical success criteria, a cut-off point of 6 was established. Patients scoring between 0 and 6 were classified as satisfied, reflecting a successful treatment outcome, while those scoring between 7 and 10 were classified as dissatisfied. Secondary outcomes focused on changes in pain intensity to objectively assess improvements in maximum pain levels. The baseline pain () level was computed as the maximum pain reported in the first two questions of the COMI questionnaire referring to back and leg pain evaluated using 0-10 NRS. The absolute change in pain () was calculated as the difference between baseline pain () and pain two weeks after treatment (), computed as the maximum between back and leg pain reported at . To account for individual variability in baseline pain levels, a relative change in pain () was calculated by normalizing the absolute change to the baseline value, as follows:
Figure 1c illustrates the comprehensive statistical analysis workflow, from initial data preprocessing through descriptive statistics to feature analysis of predictive models. Descriptive statistics were used to summarize the general characteristics of the participants across the two groups created according to the “pain level satisfaction” variable.
Continuous variables were reported as mean ± standard deviation. Categorical variables were presented as frequencies and percentages. For groups comparison, Student’s t-test (or Wilcoxon signed-rank tests when appropriate) and chi-square test were used for continuous and categorical variables, respectively. To control for multiple comparisons and maintain a false discovery rate of 5%, all statistical comparisons were adjusted using the Benjamini-Hochberg correction method. All statistical analysis were performed using the statsmodels library in Python version 3.10.
A data-driven predictive model was developed to classify treatment outcomes based on the dichotomized pain level satisfaction variable, distinguishing between “satisfied” and “dissatisfied” patients. This section outlines the methodological approach used to design and benchmark predictive models. The development and implementation of the predictive models, were performed using Python version 3.10, with the scikit-learn and PyCaret.
To mitigate the potential risk of overfitting associated with the limited sample size, a stratified nested cross-validation (CV) methodology was employed. Nested CV involves two iteration loops over the data. In the first iteration, the outer loop applied a 10-fold stratified CV scheme to divide the dataset into training and testing sets, ensuring class balance (i.e., satisfied and dissatisfied) across folds and providing unbiased estimates of model generalization performance on completely unseen data.
Prior to predictive model training, several preprocessing steps were performed to ensure the quality and relevance of the features. Features with missing values exceeding 15% were removed, given the limited dataset size, to prevent bias and ensure reliable analysis. The remaining missing data were imputed using an iterative approach: Random Forest (RF) was used for numeric features, and K-Nearest Neighbors (KNN) for categorical features, iterating five times for optimal imputation. Importantly, data imputation was completed prior to the outer cross-validation split to avoid any information leakage. Features were encoded based on their data type to ensure effective model training: numeric features were standardized using z-scores for consistent scaling, categorical variables were one-hot encoded to convert them into a numerical format, and ordinal features were label encoded to preserve their inherent order. To address potential multicollinearity and improve classification performance, features with a variance less than 0.01, as well as those with a correlation coefficient greater than 0.7, were removed.
During each outer loop cycle, the training data was further split using a 10-fold stratified inner CV to train the classifier. This step ensured the selection of optimal model configurations without introducing information leakage from the test set. This nested approach kept model training and hyperparameter tuning separate from the final performance evaluation, thereby improving the reliability of the generalization assessments..
Multiple baseline classifiers were trained on all the features and benchmarked using various classification algorithms: Logistic Regression (LR) , KNN, Support Vector Machines (SVM) with linear kernel, Ridge Classifier (RC), Naive Bayes (NB), Linear (LDA) and Quadratic Discriminant Analysis (QDA) , Decision Trees (DT), and ensemble methods including Extra Trees (ET), RF, AdaBoost, Gradient Boosting Machine (GBM), XGBoost, and LightGBM. These baseline comparison provided a reference point for subsequent optimization.
Predictions from all outer loop iterations were concatenated to calculate the final performance metrics, providing a robust estimate of the model’s generalization ability while maintaining strict train-test separation. Model performance was primary assessed based on F1-score, average precision (AP), and Matthews correlation coefficient (MCC), which are classification metrics particularly useful for imbalanced datasets. The F1-score represents the harmonic mean of precision and recall calculated as:
where TP, FP, and FN denote true positives, false positives, and false negatives, respectively. The F1 score ranges from 0 to 1, with 1 indicating perfect precision and recall, and 0.5 indicating random guessing. The AP for a given class is calculated as the area under the precision recall curve:
where P(n) denotes precision at the n-th recall level, while measures changes between consecutive recall levels. AP ranges from 0 to 1, with 1 indicating optimal precision and recall and 0 indicating performance equivalent to random guessing. MCC is a robust summary metric computed as the correlation coefficient between observed and predicted binary classifications:
MCC values range from -1 to +1, where +1 represents a perfect prediction, 0 indicates that the prediction is not better than a random prediction, and -1 represents total disagreement between prediction and observation. In addition, standard classification metrics including precision, recall, accuracy, Cohen’s Kappa coefficients, and Area under the Receiver Operating curve (AUC) were used for a comprehensive evaluation.
Building on the baseline comparison, the best-performing model from the cross-validation splits was further tuned (Fig. 1d) with the objective of identifying the most informative clinical and demographic characteristics through statistical techniques. The data encoding procedures, i.e., encoding and standardization of variables, remained consistent with the approach used in the baseline classifiers, ensuring methodological uniformity.
Next, feature selection was performed according to an XGBoost estimator based importance ranking, retaining the top 70% (35 features). To address the class imbalance in the dataset, random oversampling was applied during model training. This technique involved duplicating samples from the minority class to balance the class distribution.
We optimized hyperparameters for the best-performing baseline model, RF, using Bayesian and random grid search. This optimization aimed to identify the most informative clinical and demographic characteristics through systematic exploration of the model’s parameter space while minimizing overfitting. The hyperparameters were tuned through 5-fold stratified cross-validation (inner folds) for each outer fold, with predefined hyperparameter space ranges.
The RF model was configured with 10 to 1000 trees, with a higher number of trees that potentially improve performance, but increase computational time. The tree depth ranged from 1 to 32, allowing the model to capture more complex patterns, although deeper trees carry a higher risk of overfitting. The risk of overfitting was mitigated by adjusting split node samples (2 to 20) and leaf samples (1 to 20). The feature fractions for splitting were varied between 0.1 and 1.0, with smaller values introducing randomness to reduce overfitting.
Feature importance was assessed using SHapley Additive exPlanations (SHAP), which identified the principal predictors for the best-performing classifier. SHAP values provide a quantitative measure of the influence of individual features, representing the average marginal contribution of each feature to the model’s prediction for a given instance. These values are computed by comparing the model’s predictions with and without each feature, considering all possible feature combinations. Larger absolute SHAP values indicate stronger effects, while the sign of the value shows whether a feature with a positive SHAP value increases or decreases the prediction outcome.
ROC analysis (Fig. 2) was performed to identify meaningful thresholds for both absolute () and relative () changes in patient-reported pain scores reduction after lumbar infiltration, using ‘pain level satisfaction’ as a reference variable. The ideal point on an ROC curve, would be in the upper left corner (0, 1), representing the best trade-off between specificity (Sp) and sensitivity (Se) for a diagnostic test (Sp 100%, Se 100%). In our analysis, the optimal cut-off points on these curves would represent the smallest change in pain score, absolute and relative, that best distinguishes between satisfied and unsatisfied patients. The optimal cut-off point can be determined using several approaches.
To further assess the robustness of our classification approach under a clinically meaningful definition of success, we conducted an additional evaluation using the smallest relative change in baseline pain scores () as a threshold. Specifically, based on the optimal cut-off identified in our ROC analysis, we stratified participants into “satisfied” or “dissatisfied” groups according to whether their relative reduction in self-reported pain exceeded this threshold at . This aims to reflect a more clinically relevant perspective of improvement, since percentage-based reductions in pain often better capture individual differences than absolute changes alone.
In this simplified analysis, we applied the same preprocessing and cross-validation schemes but focused exclusively on the set of baseline classifiers without hyperparameter optimization (see previous section). By benchmarking these models, we obtained a clear view of how different algorithms perform when using a clinically significant cut-off for pain relief, rather than the original dichotomous satisfaction variable.

