Brilliaz

Econometrics

Estimating firm-level production and markups with machine learning-imputed inputs while preserving identification.

This article explores robust strategies to estimate firm-level production functions and markups when inputs are partially unobserved, leveraging machine learning imputations that preserve identification, linting away biases from missing data, while offering practical guidance for researchers and policymakers seeking credible, granular insights.

By Timothy Phillips

August 08, 2025

In empirical production analysis, researchers regularly confront incomplete input data at the firm level. The core objective is to quantify how firms transform inputs into outputs and to infer the corresponding price of production, or markup, embedded in observed decisions. When some inputs are not directly observed, naive imputation can distort parameter estimates, undermining both inference and policy relevance. A rigorous approach must couple an accurate imputation mechanism with a stable identification strategy that ensures the estimated production function reflects causal input-output relationships rather than spurious correlations. This balance—imputation accuracy plus identification fidelity—defines the practical challenge of modern econometric practice.

One viable route combines machine learning imputations with structural estimation. The idea is to first predict missing inputs using rich observational data and flexible algorithms, then feed those predictions into a production taxonomy that tracks marginal products and markups. The imputation model benefits from cross-sectional and time-series variation, regularization, and interpretable feature engineering to avoid overfitting. Crucially, the subsequent estimation step must guard against bias arising from imputations by propagating uncertainty and maintaining consistency conditions that tie inputs to outputs in a theoretically sound way. This two-stage framework, when correctly implemented, preserves essential identification.

Leveraging rich data while controlling for uncertainty

A central concern is ensuring that imputations do not erase the economic signals that reveal how production decisions respond to input changes. When imputations introduce information not present in the underlying data-generating process, estimates of elasticities and marginal products can drift. To counter this, researchers should treat imputations as latent variables with associated uncertainty rather than as fixed truths. Methods that incorporate prediction intervals, multiple imputation cycles, and Bayesian revisions help keep the estimation honest about what is known and what remains guesswork. The result is a more faithful reflection of the underlying production technology.

Another key principle is leveraging economic constraints to regularize imputations. Production functions possess monotonicity, convexity, and returns-to-scale properties that can be encoded into learning objectives. By embedding these properties into the imputation model—through constrained optimization, monotone neural networks, or shape-preserving transformations—one can reduce implausible imputations without sacrificing predictive power. The combination of data-driven imputations with theory-grounded restrictions strengthens both the plausibility of predicted inputs and the credibility of the subsequent production estimates, especially for firms with sparse observations.

Integrating imputation with a structural markup analysis

Data richness matters because imputations rely on correlates available in the observed features. Details such as firm size, sectoral dynamics, regional conditions, asset tangibility, and historical production patterns often determine missing input values. A well-designed imputation model uses cross-sectional heterogeneity and temporal autocorrelation to infer likely input levels. Importantly, the model should quantify uncertainty about each imputed value, enabling standard errors of production parameters to reflect both sampling variation and imputation risk. This dual accounting helps avoid overstated confidence in production elasticities and markup estimates.

Beyond prediction accuracy, interpretability plays a vital role. Stakeholders prefer transparent imputation mechanisms that reveal why a particular input is predicted to take a given value. Techniques such as SHAP values, partial dependence plots, or local interpretable approximations can illuminate which features drive imputations. When researchers communicate which inputs were most influential and how imputed values align with observed patterns, the resulting narrative strengthens trust in the estimates. Interpretability thus complements identification by clarifying the pathways through which inputs influence production.

Robust inference under imputation uncertainty

To estimate markups in a production framework, one must separate price effects from quantity decisions. A common tactic is to model output as a function of inputs while allowing a simultaneous equation for revenue or price that captures markup behavior. Imputed inputs enter both equations, but with proper identification restrictions, researchers can disentangle marginal productivity from pricing power. The identification often relies on instruments, functional form restrictions, or timing assumptions that link input choices to costs and output. When imputations are handled with care, the inferred markups reflect genuine firm-level pricing power rather than artifacts of missing data.

A practical strategy is to use a control-function approach augmented with imputed inputs. In this setup, the residual variation in input choices that is unexplained by observed predictors is captured by a control term that absorbs endogeneity and measurement error. The imputed inputs contribute to both the production function and the cost structure, but the control function isolates the portion of variation attributable to unobserved factors. The method yields more reliable estimates of both production elasticity and markup, provided that the control term remains well-specified and that the imputation uncertainty is propagated through the inference.

Practical guidance for researchers and policymakers

A robust inference framework treats imputations as stochastic components. Analysts should use multiple imputation to create several plausible data sets, each with different plausible imputations consistent with the observed data and the economic model. Estimation is then performed across these data sets, and results are combined to produce pooled estimates and standard errors that reflect imputation variability. This approach guards against underestimating uncertainty and reduces the risk of overconfident conclusions about production elasticities and markups. In practice, it also helps diagnose sensitivity to different imputation specifications.

Computational considerations matter because machine learning imputations can be resource-intensive. Researchers should balance model complexity with stability, avoiding black-box pitfalls by preferring models that are interpretable or at least offer transparent uncertainty quantification. Cross-validation helps select models that generalize beyond the sample, while bootstrap methods can complement multiple imputation for variance estimation. Documenting the imputation procedure, including data preprocessing, feature selection, and hyperparameter choices, enhances replicability and allows others to assess the robustness of the identified production mechanism.

For practitioners, a practical workflow begins with a careful data audit to catalog missingness patterns and their potential economic implications. Then, choose an imputation strategy informed by the theoretical structure of the production process. Where possible, integrate economic constraints into the learning stage, ensuring the imputations align with monotonicity and returns to scale. After imputations, implement a structural estimation that explicitly models production and price decisions, using instruments or restrictions that preserve identification. Finally, report imputation uncertainty alongside point estimates, so readers can gauge the reliability of the production-and-markup narrative.

The payoff of this integrated approach is a more credible, granular view of firm behavior under incomplete information. By marrying machine learning imputations with solid identification strategies, researchers can recover nuanced insights into how firms transform inputs into outputs and how they exercise pricing power. The combination yields policy-relevant evidence about efficiency, competition, and innovation across industries. While challenging, the discipline of transparent imputation and rigorous inference ultimately strengthens the empirical foundations for understanding firm-level production and market dynamics in an increasingly data-rich, imperfect-information world.

Incorporating prior structural knowledge in machine learning models to preserve interpretability for econometric use.

This article explores how embedding established economic theory and structural relationships into machine learning frameworks can sustain interpretability while maintaining predictive accuracy across econometric tasks and policy analysis.

Get marketing news you’ll actually want to read