Brilliaz

Statistics

Guidelines for constructing propensity score models that account for clustering and hierarchical data structures.

This evergreen guide outlines practical, theory-grounded strategies to build propensity score models that recognize clustering and multilevel hierarchies, improving balance, interpretation, and causal inference across complex datasets.

By Brian Adams

July 18, 2025

In observational studies, propensity score methods aim to balance observed covariates between treated and untreated groups, approximating randomization. When data exhibit clustering or hierarchical structure—such as patients nested within clinics, students within schools, or repeated measures within individuals—standard propensity score models may fail to capture dependence, leading to biased estimates and overstated precision. The first practical step is to define the level at which treatment assignment occurs and identify the clustering units that influence both treatment and outcomes. This framing informs the modeling choice, helps avoid erroneous independence assumptions, and sets the stage for robust causal estimation that respects the data’s structure.

A foundational recommendation is to incorporate random effects or stratified blocks that reflect the clustering. Mixed-effects propensity score models, which include random intercepts (and potentially random slopes), can absorb unobserved heterogeneity across clusters. By allowing the propensity score to vary by cluster, researchers acknowledge that enrollment practices, access to care, or clinician preferences may differ across sites. These approaches also improve balance diagnostics, because standardized differences are assessed within or across clusters rather than assuming a single global distribution. However, one must guard against overfitting when clusters are small or sparse, which can undermine stability.

Use hierarchical strategies to capture dependence and context.

An explicit modeling strategy is to fit a hierarchical logistic regression for the treatment indicator, with fixed covariates plus random effects for the relevant clusters. This yields cluster-specific propensity scores that reflect local conditions while maintaining comparability across units. Crucially, the random effects help capture unmeasured context-specific factors that could confound the treatment–outcome relationship. After estimating these scores, researchers typically perform matching, weighting, or stratification based on the estimated probabilities. The key is to ensure that the method of balancing respects the multilevel structure, thereby avoiding biased comparisons and inflated variance.

In addition to hierarchical models, generalized estimating equations (GEE) offer a population-averaged perspective that can be appropriate when cluster sizes vary greatly or when correlation structures are complex. GEEs provide robust standard errors and avoid some convergence issues inherent to random-effects specifications. Whenever possible, report both marginal balance metrics and cluster-level diagnostics to convey how well the approach handles within-cluster dependence. When applying weighting, consider stabilized weights to prevent extreme values that could destabilize the analysis. The ultimate aim is to achieve balance that remains credible under the study’s clustering assumptions.

Balancing approaches must respect data structure and overlap.

A practical step is to examine covariate balance after computing propensity scores with cluster-aware models. Conduct balance checks within clusters to determine whether treated and control units are comparable in each context. If substantial imbalance persists in some clusters, consider site-specific matching or trimming procedures to focus inference on regions with adequate overlap. Document the proportion of units dropped and the remaining effective sample size to avoid overgeneralization. Transparent reporting of balance by cluster helps readers gauge the generalizability of findings and the reliability of causal conclusions drawn from the propensity-adjusted analysis.

When clusters vary in size, weighting schemes can be tuned to reflect both within-cluster heterogeneity and the desire for overall balance. Calibration or entropy balancing extensions can help align covariate moments across treatment groups while respecting cluster boundaries. Researchers should be mindful of the potential for weighting to amplify noise in small clusters. In such cases, pragmatic thresholds—such as minimum cluster sample sizes or conservative trimming rules—can preserve statistical stability. The combination of hierarchical modeling and thoughtful weighting often yields more credible causal effects in clustered settings.

Explore interactions and heterogeneity with care.

An essential consideration is the choice of covariates included in the propensity score model. Include variables that predict treatment assignment and the outcome, while avoiding highly collinear or post-treatment variables. In hierarchical data, some covariates operate at different levels; for example, patient demographics at the individual level and clinic quality indicators at the cluster level. The model should reflect this multilevel architecture, with careful cross-level interactions if theory or prior evidence suggests differential treatment effects. Sensitivity analyses can explore how alternative specifications affect balance and subsequent causal estimates.

Interaction terms between treatment indicators and cluster identifiers can reveal whether treatment effects are heterogeneous across sites. If heterogeneity is detected, stratified reporting by cluster or random-slope models can illuminate where and why effects differ. However, too many interactions may exhaust degrees of freedom in small samples. In such cases, pre-specification based on substantive knowledge or prior research helps maintain interpretability. While exploring complexity is valuable, maintaining a parsimonious and robust model often yields clearer, more actionable insights.

Report uncertainty with appropriate clustering-aware methods.

A critical diagnostic is the assessment of overlap or common support across the propensity score distribution within and across clusters. Without sufficient overlap, comparisons may rely on extrapolation, compromising validity. Visual tools such as density plots by cluster and standardized mean differences before and after weighting can highlight regions of poor overlap. If overlap is limited, consider redefining the target population, focusing on regions with common support, or employing alternative estimators that better accommodate sparse data in certain clusters. Explicitly stating the extent of overlap informs readers about the reliability of causal claims.

In clustered designs, variance estimation requires attention to correlation. Standard errors that neglect within-cluster dependence routinely underestimate uncertainty, yielding overly optimistic confidence intervals. Bootstrap methods that resample at the cluster level, or sandwich-robust variance estimators tailored to hierarchical structures, are common remedies. When reporting results, present both point estimates and appropriately adjusted uncertainty. Transparently communicating the method used to handle clustering strengthens the credibility of conclusions and supports replication efforts across studies with similar data architectures.

Finally, consider the practical implications of your modeling choices for policy or clinical recommendations. Propensity scores that account for clustering may shift estimated effects, alter conclusions about effectiveness, and influence decisions about resource allocation. Stakeholders value analyses that reflect real-world settings, where institutions and communities shape treatment practices. Provide clear explanations of how clustering was addressed, what assumptions were made, and how sensitive results are to alternative specifications. A well-documented, cluster-conscious approach helps bridge methodological rigor and actionable insight.

To close, adopt a disciplined, transparent workflow for propensity score modeling in hierarchical data. Start with a clear definition of the treatment and clustering levels, then select a modeling framework that captures dependence without compromising interpretability. Validate balance at multiple levels, assess overlap rigorously, and report uncertainty with cluster-aware standard errors. Where feasible, conduct sensitivity analyses that test the robustness of findings to alternative random effects structures and weighting schemes. By adhering to these guidelines, researchers can draw credible causal inferences from complex datasets and advance evidence-based practice in fields with nested data.

Methods for designing cluster randomized trials that minimize contamination and account for intracluster correlation properly.

Designing cluster randomized trials requires careful attention to contamination risks and intracluster correlation. This article outlines practical, evergreen strategies researchers can apply to improve validity, interpretability, and replicability across diverse fields.

Get marketing news you’ll actually want to read