Scientific Papers

How to perform prespecified subgroup analyses when using propensity score methods in the case of imbalanced subgroups | BMC Medical Research Methodology

We aim to evaluate from observational data a subset-by-treatment interaction on right-censored data using propensity score methods. To specifically evaluate the empirical performances of the two “across” and “within” subsets strategies, we performed a Monte-Carlo simulation study. We generated data close to the REFCOR setting, where patients with facial palsy, the smaller group of the sample, received mostly (in 80% of cases) a facial nerve resection and may have benefit more to that resection than those without (the majority of the sample but who only received a nerves resection in 40% of cases).

We thus considered a population partitioned into two subsets \((S=1,2)\) of different sizes, with a potential heterogeneity in treatment effect across the subsets. Similarly to the REFCOR study, we considered the treatment effect possibly restricted to the smaller subset, but where the treatment has been widely preferred.

Data generation-generating mechanisms

We considered a population partitioned into two subsets \(S (=1,2)\) of potential differential influence on a right-censored outcome (where large times indicate improved outcomes), with a proportion of \(p(S=1)=0.25\) patients in the subset 1 (the smaller subset).

We simulated samples of n=3,000 patients, with a set of continuous (\(X_1\) and \(X_2\)) covariates using independent normal distributions of mean 0 and standard deviation of 1, and seven binary (\(X_3, \ldots , X_{10}\)) covariates using independent Bernoulli distributions, with parameter equal to 0.5. Covariates had a strong, moderate or no association, first with outcome, and second with treatment allocation (Table 3).

Table 3 Covariates included in the simulation as a function of their association with treatment allocation and outcome

For subject \(i=1,\ldots ,n\), we generated his(her) belonging to subset \(S=1\), from a Bernoulli \(S_i \sim B(0.25)\) distribution, then generated the treatment group, \(Z_{i}\) \(\sim B(p_{i})\) with \(logit(p_{i})= \beta _{0|S} + \sum _{j=1}^{10}\beta _{j|S}.x_{j,i}\), where \(\beta _{0|S=1}\) was set at 0.3 and \(\beta _{0|S=2} = -1.9\) to obtain \(P(Z_i=1|S=1) \approx 0.8\) and \(P(Z_i=1|S=2)\approx 0.4\) (close to the REFCOR proportions of treated patients in each subset); and \(\beta _{j|S}\) denote the different covariate effects on treatment allocation.

Each survival outcome \(T_{i}\) was then generated an exponential distribution with hazard depending on the treatment \(Z_i\), covariates \((x_{ji},j=1,\ldots ,10)\) and subset \(S_i\) of the patient, given by \(\lambda _{i} = \lambda _0.exp [\theta _S.Z_{i} + \alpha _S.S_i + \sum _{j=1}^{10} \alpha _j.x_{j,i}]\), where the baseline hazard, \(\lambda _0\), was set to 0.005, and the conditional treatment effects in subsets 1 and 2 at \(\theta _{S=1}\) and \(\theta _{S=2}\), respectively, while \(\alpha _S\) denote the effect of the subset on the outcome, and \(\alpha _j\) denote the covariates effect on the outcome (Table 3). Strong, moderate and no impact were set by parameter values of \(\log 2, \log 1.3\) and \(\log 1\), respectively.

We simulated an independent censoring time for each patient using a uniform distribution \(U=[1,150]\), where patients with a censoring time below the time-to-event, or above 60, were administratively right-censored.

Several scenarios were investigated, depending on the impact of the subsets on the outcome (\(\alpha _{S=1}\) and \(\alpha _{S=2}\)), that is without treatment effect and treatment-by-subset interaction (Table 4). We then assessed the influence of sample size (n, from 500 to 5,000), proportions of patients in each subset (\(p(S=1))\), relative risks to be treated in each subset (\(\beta _{0|S=1}\) and \(\beta _{0|S=2}\)), and impact of covariates on the outcome (\(\alpha _j\)).

Table 4 Summary of the main scenarios

Estimand/target of analysis

True causal marginal treatment effect in the treated, as measured on the log HR scale, were computed for each scenario in each subset \(S=1,2\), using a sample of 1,000,000 individuals.


In looking for treatment-by-subset interaction, two strategies of analysis regarding the PS estimation were considered and applied to each dataset. First, the “across subsets” strategy was used, which consisted of estimating a single PS from the whole sample but incorporating the subset indicator and potential interaction terms into the PS. Second, we applied the “within subsets” strategy, which consisted of estimating the PS in each subset separately.

Regardless of the strategy, several PS methods were applied, targeting the average treatment effect in the treated group (ATT) and then the average treatment effect in the overlap (ATO). Thus, we first performed PS matching without replacement using a 1-to-1 nearest neighbor matching algorithm, with a caliper set to 0.2, then to 0.1 standard deviations of the logit of the PS [19]. The hazard ratio (HR) of an event was then estimated from a Cox model with a robust estimator of the variance. In a second approach, we allowed replacement in the untreated group, calculating the variance in the estimator based on the Austin and Caufri estimator [20]. We also used a PS weighting approach, with standardized mortality ratio weights (SMRWs) [21] after stabilization [22] and with a bootstraped variance estimation [23], and then with overlap weights [24], with a robust estimation of the standard errors to account for weighting.

To evaluate the influence of the PS model, we used different models for PS estimation. “True PS” was defined as multivariable logistic regression including true confounders (variables affecting both treatment allocation and outcome) and interaction terms between the subset and variables with different effects on treatment allocation (X4 and X9). We further included X6 for the “PS with a prognostic variable”, X1 for the “PS with an instrumental variable”, interaction terms with X2, X5, X7 and X10 for the “PS with all interaction terms”, and we omitted all interaction terms for the “PS without interaction term”.

Performance measures

To assess the performances of these methods, \(n_{sim}=\) 1,000 independent replications of each scenario were performed, corresponding to a \(< 1\%\) Monte Carlo standard error, for a coverage of 95% [25].

Over those replications, we computed the mean bias, defined as the average difference between the estimated treatment effect and the true marginal treatment effect, and the coverage of the 95% confidence interval, defined as the percent of time the true treatment effect was included in the 95% confidence interval. We also reported the 95% confidence interval of the bias estimation, using the Monte Carlo standard error, and recorded the frequency of non-convergence issues.

The simulation study and analyses for the applied example were performed in R version 4.1.3 using the “survival”, “survey”, “simsurv”, “ggplot2”, “survminer”, “tableone”, “mice”, “MatchIt”, “MatchThem”, “WeightIt”, “cobalt”, “boot”, “VIM” and “forestplot” packages.


We first considered samples of \(n=3,000\) individuals. In Scenario 1, where some treatment-by-subset interaction was introduced, both the “across subsets” and “within subsets” strategies yielded similar results, except using the “across subsets” approach when no interaction term was included in the PS model. Using this PS model, an important bias in the estimation of the treatment effect and impaired coverage in subset 1 were observed (Fig. 1). As expected, the inclusion of an instrumental variable in the PS model increased the variance in the estimation (Fig. 1 and Supplementary Fig. 3, Additional file 3). Bias in the estimated effect was higher in subset 1 than in subset 2 and was proportional to the treatment effect (Fig. 2).

Fig. 1
figure 1

Comparison of strategies according to the PS model in Scenario 1. Comparison of “across subsets” or “within subsets” strategy in terms of the mean absolute bias (A), variance (B) and the coverage of the 95% CI (C) according to the PS model. S1 = subset 1; S2 = subset 2

Fig. 2
figure 2

Bias according to the treatment effect in subset 1 (Scenario 2). Comparison of the “across subsets” or “within subsets” strategy in terms of the bias (A) variance (B) and the coverage of the 95% CI (C) for the estimation of the treatment effect in each subset

Using PS matching, caliper set at 0.1 gave similar results. We further show only results with the caliper set at 0.2 standard deviations of the logit of the PS. PS matching without replacement resulted in greater amounts of bias than the other approaches. This result can be explained by the discarding of treated patients because of the lack of comparative untreated patients. This bias was thus inversely proportional to the proportion of treated patients who could be matched in both strategies and inversely proportional to the relative number of comparative untreated patients (Fig. 3). When the relative risk to be treated increased in the small subset (subset 1), the “across subsets” strategy was significantly biased compared to the “within subsets” strategy using PS matching without replacement; however, this bias was controlled with replacement or PS-weighting methods (Supplementary Fig. 4B, Additional file 3). Given the importance of this bias, PS matching without replacement was not represented in the following simulations. The results of the simulations with this method can be found in Supplementary Fig. 5, Additional file 3.

Fig. 3
figure 3

Bias in the estimation of the treatment effect under PS matching without replacement using the “across subsets” or the “within subsets” strategy, according to the treatment prevalence (A), the relative risk to be treated in subset 1 (B) and the treatment effect in subset 1 (C) (Scenario 1)

When sample size decreased down to \(n=300\), convergence issues occurred, notably using SMRW, while variance inflated using other methods, especially with the “within subets” strategy, which also reflects a convergence problem even if an estimation of the treatment effect could be obtained (Supplementary Figs. 6-8, Additional file 3). When the sample size increased from 300 to 5,000, results were poorly affected, except that PS matching without replacement achieved a decrease in variance while the bias persisted, resulting in a lowered coverage probability of confidence interval (Supplementary Fig. 9, Additional file 3), while type I error rate slightly decreased (Supplementary Fig. 5, Additional file 3). Otherwise, results were not markedly impacted by the size of the subsets (Supplementary Fig. 10, Additional file 3), the prognostic value of the subsets (Supplementary Fig. 11, Additional file 3), or by the treatment prevalence (Supplementary Fig. 4, Additional file 3).

When a non-observed confounder was generated, all methods were biased, as expected. Bias was proportional to the impact of confounders on the outcome and inversely proportional to its correlation with an observed covariate (Fig. 4). This also resulted in a decrease of the coverage probability of the confidence interval, more pronounced with the “within subsets” approach. Overall, the overlap weighting and matching with replacement were slightly more robust than the SMRW weighting.

Fig. 4
figure 4

Simulations with an unknown confounder. Comparison of the “across subsets” or “within subsets” strategy in terms of bias (A), variance (B) and coverage (C) in the estimation of the treatment’s effect according to the presence of an unknown confounder (Scenario 1)

In the absence of treatment-by-subset interactions, whichever there was a treatment effect or not (Scenarios 3 and 4), type I error of the Gail & Simon interaction quantitative test was maintained (Supplementary Fig. 12, Additional file 3). However, the “across subsets” strategy appeared to be slightly more powerful for detecting an interaction in small samples (Fig. 5 C). Weighting methods (overlap weighting and SMRW weighting) also seemed to be more powerful than PS matching with replacement (Fig. 5).

Fig. 5
figure 5

Power of the interaction test. Comparison of the power of the Gail and Simon quantitative interaction test by the number of patients (A) (Scenario 1) or the treatment effect (B and C) (Scenario 2). The number of patients is set to n=3000 in B and n=300 in C (Scenario 1). Robust estimate of variance was used for SMRW weighting when \(n=300\), rather than bootstrapping, due to the importance of convergence problems that made it impratical to compute it

Source link