Brilliaz

Causal inference

Assessing techniques for dealing with missing not at random data when conducting causal analyses.

This evergreen overview surveys strategies for NNAR data challenges in causal studies, highlighting assumptions, models, diagnostics, and practical steps researchers can apply to strengthen causal conclusions amid incomplete information.

By Samuel Perez

July 29, 2025

When researchers confront data missing not at random, the central challenge is that the absence of observations carries information about the outcome or treatment. Unlike missing completely at random or missing at random, NNAR mechanisms depend on unobserved factors, complicating both estimation and interpretation. A disciplined approach begins with clarifying the causal question and mapping the data-generating process through domain knowledge. Analysts must then specify a plausible missingness model that links the probability of missingness to observed and unobserved variables, often leveraging auxiliary data or instruments. Transparent documentation of assumptions and sensitivity to departures are critical for credible causal inferences under NNAR conditions.

One foundational tactic for NNAR scenarios is to adopt a selection model that jointly specifies the outcome process and the missing data mechanism. This approach, while technical, formalizes how the likelihood of observing a given data pattern depends on unobserved attributes. By integrating over latent variables, researchers can estimate causal effects with explicit uncertainty that reflects missingness. However, identifiability becomes a key concern; without strong prior information or instrumental constraints, multiple parameter configurations can yield indistinguishable fits. Practitioners often complement likelihood-based methods with bounds analysis, showing how conclusions would shift under extreme but plausible missingness patterns.

Designing robust strategies without overfitting to scarce data.

An alternative path relies on doubly robust methods that blend outcome modeling with models of the missing data indicators. In NNAR contexts, one can impute missing values using predictive models that incorporate treatment indicators, covariates, and plausible interactions, then estimate causal effects on each imputed dataset and pool results. Crucially, the doubly robust property implies that consistency is achieved if either the outcome model or the missingness model is correctly specified, offering resilience against misspecification. Yet, the quality of imputation hinges on the relevance and richness of observed predictors. When NNAR arises from unmeasured drivers, imputation provides only partial protection.

Sensitivity analysis plays a pivotal role in NNAR discussions because identifiability hinges on untestable assumptions. Analysts explore how conclusions change as the presumed relationship between missingness and the unobserved data varies. Techniques include pattern-mixture models, tipping-point analyses, and bounding strategies that quantify the range of plausible causal effects under different missingness regimes. Presenting these results helps stakeholders gauge the robustness of findings and prevents overconfidence in a single estimated effect. Sensitivity should be a routine part of reporting, not an afterthought, especially when decisions depend on fragile information about nonresponse.

Utilizing auxiliary information to illuminate missingness.

When NNAR data arise in experiments or quasi-experiments, causal inference benefits from leveraging external information and structural assumptions. Researchers may incorporate population-level priors or meta-analytic evidence about the treatment effect to stabilize estimates in the presence of missingness. Hierarchical models, for instance, allow borrowing strength across similar units or time periods, reducing variance without prescribing unrealistic homogeneity. Care is required to avoid circular reasoning, ensuring that priors reflect genuine external knowledge rather than convenient fits. The objective remains to produce credible, transportable inferences that hold up across plausible missingness scenarios.

A practical tactic is to collect and integrate auxiliary data specifically designed to illuminate the NNAR mechanism. For example, passive data streams, administrative records, or validator datasets can reveal correlations between nonresponse and outcomes that are otherwise hidden. Linking such information to the primary dataset enables more informative models of missingness and improves identification. When feasible, researchers should predefine plans for auxiliary data collection and specify how these data will update the causal estimates under different missingness assumptions. This proactive approach often yields clearer conclusions than retroactive adjustments alone.

Emphasizing diagnostics and model verification.

In some contexts, instrumental variables can mitigate NNAR concerns when valid instruments exist. An instrument that affects treatment assignment but not the outcome directly (except through treatment) can help disentangle the treatment effect from the bias introduced by missing data. Implementing an IV strategy requires rigorous checks for relevance, exclusion, and monotonicity. When missingness is correlated with unobserved instruments, IV estimates may still be biased, so researchers must examine the extent to which the instrument strengthens identification relative to baseline analyses. Transparent reporting of instrument validity and diagnostic statistics is essential for credible causal conclusions.

Model diagnostics matter just as much as model specifications. In NNAR settings, checking residuals, compatibility with observed data patterns, and the coherence of imputed values with known relationships helps detect misspecifications. Posterior predictive checks or out-of-sample validation can reveal whether the chosen missingness model reproduces essential features of the data. Robust diagnostics also include assessing the stability of treatment effects across alternative model forms and subsets of the data. When diagnostics flag inconsistencies, researchers should revisit assumptions rather than push forward with a potentially biased estimate.

A disciplined, phased approach to NNAR causal inference.

A principled evaluation framework for NNAR analyses combines narrative argument with quantitative evidence. Researchers should articulate a clear causal diagram that depicts assumptions about missingness, followed by a plan for identifying the effect under those assumptions. Then present a suite of results: primary estimates, sensitivity analyses, and bounds or confidence regions that reflect plausible variations in the missing data mechanism. Clear communication is vital for stakeholders who must make decisions under uncertainty. By organizing results around explicit assumptions and their consequences, analysts foster accountability and trust in the causal conclusions.

Finally, practitioners can adopt a phased workflow that builds confidence incrementally. Start with simple models and transparent assumptions, document limitations, and incrementally incorporate more sophisticated methods as data permit. Each phase should yield interpretable insights, even when NNAR remains a salient feature of the dataset. In practice, this means reporting how conclusions would change under alternative missingness scenarios and demonstrating convergence of results across methods. A disciplined, phased approach reduces the risk of overclaiming and supports sound, evidence-based decision-making in the presence of nonignorable missing data.

Beyond technical choices, organizational culture shapes how NNAR analyses are conducted and communicated. Encouraging skepticism about a single “best” model and rewarding thorough sensitivity exploration helps teams avoid premature certainty. Documentation standards should require explicit statements about missingness mechanisms, data limitations, and the rationale for chosen methods. Collaboration with subject matter experts ensures that domain knowledge informs assumptions and interpretation. Moreover, aligning results with external benchmarks and prior studies strengthens credibility. A culture that values transparency about uncertainty ultimately produces more trustworthy causal conclusions in the face of NNAR challenges.

In sum, addressing missing not at random data in causal analyses demands a blend of principled modeling, sensitivity assessment, auxiliary information use, diagnostics, and clear reporting. There is no universal remedy; instead, robust analyses hinge on transparent assumptions, verification across multiple approaches, and thoughtful communication of uncertainty. By combining selection models, doubly robust methods, and well-justified sensitivity checks, researchers can derive causal insights that survive scrutiny even when missingness cannot be fully controlled. The enduring goal is to illuminate causal relationships while honestly representing what the data can—and cannot—tell us about the world.

Using graphical models to reason about selection bias introduced by conditioning on colliders in studies.

This evergreen guide distills how graphical models illuminate selection bias arising when researchers condition on colliders, offering clear reasoning steps, practical cautions, and resilient study design insights for robust causal inference.

Get marketing news you’ll actually want to read