Causal Inference with Quasi-Experimental Data

In this article, we attempt to overview the methodological toolkit available to empirical researchers who are interested in making causal inference using quasi-experimental data. In particular, Figure 1 provides an overview of the type of data available to researchers (e.g., randomized treatment, rich or constrained availability of observables, small or large number of time periods or treatment units) and describes corresponding suitable approaches, along with some pros and cons involved in their tactical use.

In marketing, randomized experimentation represents the gold standard for making causal inference using empirical data. In an ideal setting, we would randomly assign participants to different groups to receive varying types or levels of treatment. A rich history of research, including work published in the Journal of Marketing Research, has used randomized experimental designs for causal inference (see Ghose et al. [2024] and Cao, Chintagunta, and Li [2023] for recent examples). Nonetheless, there are many marketing-relevant settings where researchers do not have access to experimental data, or where running such experiments is too expensive or infeasible. Additionally, ethical considerations frequently preclude randomly assigning treatments, such as instances where it could lead to harm or deprive participants of necessary care, as in the case of life-saving treatments or medications. When randomization of treatment is not possible, one may have to rely on enhanced “statistical rigor” to compensate for the deficiencies in “design rigor.”

Download Article

Get this article as a PDF

Download

In the remainder of this article, we review the common challenges pertaining to making causal inference with quasi-experimental data and discuss recent advances in helping alleviate them. Specifically, we focus on discussing methods that emphasize matching units based on the outcome variable (Y). Then, we provide an overview of methods that focus on matching on observable covariates (X). Lastly, we conclude by reflecting on our recommendations and discussing future research in related areas.

Difference-in-Differences

In (the many) situations where researchers do not have the luxury of assigning units into treatment and control groups, they can still understand causal effects by leveraging quasi-experimental methods. The difference-in-differences (DID) method is the most widely used quasi-experimental method. It can be used in data settings with treatment and control units and pre- and post-treatment time periods.

**Figure 1: Overview of Design Choices in Quasi-Experimental Settings**

Here, we begin by describing the simplest DID design where all treatment units are treated at the same time. We observe treatment and control units over time so y_it is the outcome for unit i at time t. The DID model can be estimated using the following regression model:

y_it = β₁ + β₂Treat_i + β₃Post_t + β₄Treat_iPost_t + x_iβ˜ + ϵ_it.

(1)

where Treat_i is a treatment indicator that takes a value of 1 if unit i belongs to the treatment group and 0 if it belongs to the control group, Post_t is a posttreatment time period indicator that takes a value of 1 if time period t is in the post-treatment time period and 0 otherwise, x_i is a k-dimensional vector of time invariant observable covariates, β˜ = (β₅, …, β_{5 + k})′, and ϵ_it is an error term. The coefficient β₄ is the causal effect of interest, which is the average treatment effect (ATE) or average treatment effect on the treated (ATT).

Assumptions

Before choosing the DID identification strategy, researchers first need to assess whether the identifying parallel trends assumption holds. The DID parallel trends assumption states that the treatment unit would have followed a path parallel to the control units in the absence of treatment. We make two observations about this assumption. First, the DID method can be interpreted as a method that primarily matches on outcomes. Although covariates can be included in the DID regression model, the main goal is to use the control units’ outcomes to match the treatment unit’s outcome during the pretreatment period, and then predict the treatment counterfactual and the ATT. Second, since the parallel trends assumption is a statement about the treatment counterfactual, we cannot directly test the parallel trends assumption. However, what we can do is test whether the treatment and control units followed parallel trends in the pretreatment period (parallel pretrends assumption). This is essentially the testable part of the parallel trends assumption. There are two popular approaches to check the testable part of the parallel trends assumption: (1) visual inspection and (2) statistical tests. Visual inspection involves plotting the treatment and control trends in the pretreatment period and inspecting whether they look parallel. Statistical tests “formalize” this evaluation somewhat, by testing whether the difference in mean outcomes of the treatment and control group, for each pretreatment period, are statistically different from a constant. However, statistical tests often have low power (i.e., they may fail to reject the null hypothesis of no difference even when the parallel pretrends assumption is violated), implying that visual inspection can be the only viable way of assessing whether the parallel pretrends assumption holds.

If the parallel pretrends assumption of the DID method is violated, researchers can attempt to use matching methods (described in more detail in the “Selection on Variables” section ) to first identify a subset of control units that are more similar to the treatment unit on covariates and then apply the DID method. Alternatively, if the number of treatment units is not very large (e.g., less than 100), researchers can apply the synthetic control or related methods (described in more detail in the “Synthetic Control and Related Methods” section).

Another assumption underlying the use of DID methods (and many causal inference methods) is the stable unit treatment value assumption (SUTVA). SUTVA can be decomposed into two parts. The first part of SUTVA (no interference) requires that treatment applied to one unit does not affect the outcome of other units. To understand this better, let us consider the case where a state experiences a treatment (e.g., enactment of a local tax law). This should mean that the treatment should not affect outcomes in the control states or other treatment states (or vice versa). Researchers can use logical argumentation based on institutional knowledge to justify this assumption (e.g., geographical variation in treatment and control units makes interference unlikely). To the extent that researchers have access to more granular data, they can also check and confirm that there is no movement of individuals across treatment and control states, patterns that support the case for no interference. The second part of SUTVA (no hidden variations of treatment) requires that for each unit, there are no different forms or versions of each treatment level that may lead to different potential outcomes. Researchers can use institutional knowledge about the treatment itself as justification for this assumption.

Staggered Treatment Timing

As researchers, we are commonly faced with treatments that apply to different units at different times. This is referred to as differential timing of treatment or staggered treatment timing. When treatment effects are homogeneous over time, the following regression equation can be used to estimate the DID model:

y_it = β₁ + β₂Treat_iPost_t + x_iβ˜ + FE_i + FE_t + ϵ_it.

(2)

This model is called a two-way fixed effects (TWFE) model because it contains fixed effects for both unit (FE_i) and time (FE_t). However, the homogeneity in treatment response can be a restrictive assumption, as it is common that the treatment effect may change over time or differ across treatment units.

Goodman-Bacon (2021) identified that for the case of heterogeneous treatment effects and staggered treatment timing, the conventional TWFE model breaks down (i.e., yields a biased average treatment effect). The intuition is that the conventional TWFE model uses a weighted average of all potential DID estimates obtained using different combinations of groups of treatment and control units (treated, not yet treated, and never treated groups). However, a group that has already been treated can only be used as a control group for a group that is treated later, in the case of time-invariant treatment effects. In other words, the already treated group is not a “clean” control when the treatment effect varies over time (we also refer to this as heterogeneous or time-variant treatment effects). Because the standard TWFE model uses control units that are not “clean,” it is biased when the treatment effect varies over time. While Goodman-Bacon (2021) helps the researcher identify the specific sources of bias in their setting by decomposing the standard DID estimator into different underlying comparisons (e.g., early vs. late treated, treated vs. never treated), this paper does not provide a single, unbiased estimate of the treatment effect.

To solve the problem identified in Goodman-Bacon (2021), scholars have proposed many estimators. We discuss three proposed solutions. The first solution is an estimator proposed by Callaway and Sant’Anna (2021). It is perhaps the most popular and widely used. The main idea is to first estimate the ATT for each treatment group cohort and then use a weighted average of those ATTs. Specifically, ATT(g) is the ATT for the treatment group cohort that first receives treatment at time period g. To estimate ATT(g), we can use a simple DID setup, where the treatment group is the treatment group cohort g and the control group are units that are not yet treated or never treated (excluding units that are already treated). Then, the overall ATT is a weighted average of all the ATT(g)’s estimated over different time periods. Callaway and Sant’Anna (2021) can accommodate an outcome-regression-based estimator (Heckman, Ichimura, and Todd 1997), an inverse probability weighted estimator (Abadie 2005), and a doubly robust estimator (Sant’Anna and Zhao 2020).

The second solution, which is closely related to Callaway and Sant’Anna (2021), is an estimator proposed by Sun and Abraham (2021). This estimator is an extension of Callaway and Sant’Anna (2021) to the case of dynamic treatment effects, where researchers are interested in estimating separate treatment effects for each posttreatment time period. This setting experiences the same problem that already treated units are not “clean” controls when the treatment effect is heterogeneous. If researchers need to calculate a treatment effect estimate for every posttreatment period, then they should use Sun and Abraham (2021).

The third solution is the “stacked regression” proposed by applied researchers (Cengiz et al. 2019; Gormley and Matsa 2011). The main idea is to create separate clean datasets of treatment groups and “clean” control groups, stack them by aligning the intervention time period, and use the DID TWFE regression with dataset and time fixed effects. While in practice, this solution is the simplest to implement, one may need to be cautious, as the sample average ATT may be inconsistent (Baker, Larcker, and Wang 2022).

Given the discussions above, we provide the following recommendations. First, if all treatment units receive treatment at the same time (no staggered timing), use the standard DID TWFE model. However, if there is staggered treatment timing, use one of the proposed solutions and justify the choice of clean controls. Finally, as a robustness check, researchers can separately estimate a DID TWFE for each treatment cohort group using clean controls (e.g., the never treated group) (Baker, Larcker, and Wang 2022). Note that each of these separate analyses does not suffer from the issues that affect staggered DID with heterogeneous treatment effects because each analysis only examines one treatment cohort at a time (no staggered treatment timing). While this robustness check does not result in an aggregated treatment effect, it may still be informative to show what the separate ATTs are for each treatment group cohort.^[1]

For a more detailed discussion of these three solutions among many others, we refer readers to two review papers: “What’s Trending in Difference-in-Differences? A Synthesis of the Recent Econometric Literature” by Roth et al. (2023) and “How Much Should We Trust Staggered Difference-in-Differences Estimates?” by Baker, Larcker, and Wang (2022). Note that all of the methods discussed in these two articles deal with settings characterized by a large number of treatment and control units and relatively short time periods.

While the DID method is the most popular quasi-experimental method, it is often not viable, due to obvious observed violations in the parallel pretrends assumption. To overcome this, there has been a recent surge in flexible alternative estimators that are more widely applicable than DID. The most famous of such methods is the synthetic control method proposed by Abadie and Gardeazabal (2003) and Abadie et al. (2010). The synthetic control method has been called “arguably the most important innovation in the evaluation literature in the last fifteen years” by Susan Athey and Guido Imbens (the latter is winner of the 2021 Nobel Prize in Economics for making “methodological contributions to the analysis of causal relationships”).

The synthetic control method uses a weighted average of the control units (instead of a simple average used in the DID) to predict the treatment counterfactual and the ATT. The synthetic control method, like DID, primarily matches on outcomes. It achieves this by using a weighted average of control units’ outcomes to match the treatment unit’s outcomes during the pre-treatment period. This approach better matches the treatment outcomes in the pre- treatment period, which then improves the prediction of the counterfactual and consequently, the estimate of the average treatment effect on the treated (ATT). However, the guidance for conducting proper inference using the synthetic control method is not clear. Previously, researchers had to rely on placebo tests, which entail making restrictive assumptions that are often violated. Li (2020) developed the inference theory for the synthetic control method, which allows the calculation of confidence intervals and quantifying uncertainty using a subsampling procedure.

While the synthetic control method is a very powerful new tool, it still has some associated restrictions. First, it is unable to easily accommodate settings involving a large number of treated units. Second, it is less suited for handling situations wherein the treatment and control units are very different from one another (e.g., situations where the outcome for the treatment unit is outside the range of that for the control units). Such settings call for more flexible methods, the most popular of which is the factor model.

In many marketing contexts, researchers have access to a large number of control units. The factor model in particular has gained traction in marketing due to its ability to elegantly handle such situations via dimension reduction. Conveniently, the dimension reduction of the factor model also serves as implicit regularization to prevent overfitting. The factor model is also known as generalized synthetic control (Xu 2017) or interactive fixed effects model (Chan and Kwok 2016; Gobillon and Magnac 2016).

Marketing researchers have applied this method in a variety of settings ranging from policy evaluation (Guo, Sriram, and Manchanda 2020; Pattabhiramaiah, Sriram, and Manchanda 2019) to advertising effects measurement (Lovett, Peres, and Xu 2019). How should researchers quantify uncertainty when using the synthetic control in combination with the factor model? Past research has recommended the use of a bootstrap procedure (Xu 2017), which can be restrictive. Li and Sonnier (2023) show that the bootstrap procedure provided in Xu (2017) often results in biased confidence intervals that are either too narrow or too wide, leading to false precision or false imprecision. False precision may lead researchers to erroneously conclude that they detected a true effect, whereas false imprecision may lead researchers to erroneously conclude that there was no detectable true effect. Both mistakes—false positives and false negatives—can lead to incorrect business decisions. Following the inference theory in Li and Sonnier (2023), researchers can correctly quantify uncertainty of causal effects to make more informed business decisions.

The synthetic DID method proposed by Arkhangelsky et al. (2021) is another flexible quasi-experimental method that has gained traction in marketing. Marketing scholars have used synthetic DID to study the effect of TV advertising on online browsing and sales (Lambrecht, Tucker, and Zhang 2024) and the effect of soda tax on marketing effectiveness (Keller, Guyts, and Grewal 2024). The synthetic DID method proposes a general framework that uses both individual weights and time weights for additional flexibility. To conduct inference and compute standard errors, Arkhangelsky et al. offer three alternative procedures: (1) block bootstrap, (2) jackknife, and (3) permutation. Block bootstrap and jackknife require a large number of treatment units, without which the estimated confidence intervals may be unreliable (Clarke et al. forthcoming). Permutation does not have any restriction on the number of treatment units, but it requires a moderate to large number of control units and requires that the treatment unit and control units’ variance be similar.

Next, we overview two additional quasi-experimental methods. The first is the ordinary least squares (OLS) method proposed by Hsiao, Ching, and Wan (2012) (which is also called the HCW method). The OLS method can be used when the number of control units is (much) smaller than the pretreatment time periods. Due to the increased flexibility of both the OLS method and synthetic DID methods, it is even more important when using these methods to check for overfitting using the backdating exercise described the last paragraph of this section. Another method is the matrix completion method (for additional details, see Athey and Imbens [2019] and Bai and Ng [2021]). The matrix completion method imputes missing values in a panel data setup to estimate counterfactuals when potential outcomes have a factor structure to estimate the ATT. In other words, this approach estimates missing values in a dataset by using principal components analysis, which can be used to estimate the effects of a treatment when some of the potential outcomes are missing (for a recent marketing application of this method, see Bronnenberg, Dub´e, and Sanders [2020]).

All the methods described so far are frequentist methods. However, each of these methods can be estimated using a Bayesian framework. Kim, Lee, and Gupta (2020) propose a Bayesian synthetic control method that uses Bayesian shrinkage priors to solve the sparsity problem and conduct inference. Pang, Liu, and Xu (2022) propose a Bayesian factor model that uses a Bayesian shrinkage method for model searching and factor selection.

The synthetic control method and related flexible alternative methods (e.g. factor model method, synthetic DID method, OLS method, matrix competition method) we have discussed thus far require access to a sufficiently long pre-treatment time window (e.g. at least ten pretreatment time periods). However, what if researchers need a flexible alternative to DID but do not have access to a sufficient number of time periods before the treatment occurs? To fill this gap, researchers can consider using the augmented DID or forward DID methods. Specifically, if the outcome for the treatment unit is outside of the range of that of the control units, researchers can use the augmented DID method, which uses a scaled average of the control units to construct the treatment counterfactual (Li and Van den Bulte 2023). On the other hand, if the outcome for treatment unit is within the range of that of the control units, researchers can consider using the forward DID method, which uses a forward selection algorithm to select a relevant subset of control units and then applies the DID method (Li 2024).

Building on recent advances in the literature studying flexible alternatives to DID, we recommend following two best practices when implementing the synthetic control and related methods. First, after applying the method, visually inspect whether the parallel trends assumption of the corresponding method holds in the pretreatment window by plotting the outcome variable corresponding to the treatment unit(s) and that of the fitted in-sample curve, which is created using the control units. If the parallel pretrends assumption does not hold, do not adopt the method. If the parallel pretrends assumption does hold, then continue to conduct a backdating (out-of-sample prediction) exercise to check for overfitting (Abadie 2021; Li 2020; Li and Sonnier 2023). We recommend only using the methods that satisfy both best practices.

Our discussion thus far has focused on methods aimed primarily at matching treated and control units on the outcome variable (Y). In the next section, we discuss approaches focused on matching on observable covariates (X).

Selection on Observables (and Consequently, on Unobservables)

As we have noted, one main challenge in estimating causal effects from observational studies is the presence of confounding factors that simultaneously affect the treatment status and the outcome of interest. In some observational studies, the allocation of treatment can be presumed to resemble random assignment (e.g., a policy change determined independently from the outcomes being measured). However, such situations are rare because forces such as consumer incentives, firm objectives, or regulation can threaten such a pure treatment exogeneity argument. For such reasons, researchers often find themselves in two states of the world—one where covariates are not available, and one where they are. In situations where suitable covariates may not be available, researchers may consider methods such as instruments, copulas, and control function to infer causality using observational data. We refer readers to Wooldridge (2019), Petrin and Train (2010), Park and Gupta (2012), and Danaher and Smith (2010) for additional details.

However, in our information-rich era, researchers generally have abundant covariates at hand. Oftentimes, researchers observe many/most confounding factors that could simultaneously affect the treatment status and the outcome of interest. For example, Ellickson, Kar, and Reeder (2023) consider the case of observational studies using data from targeted marketing campaigns, where, following the definition of the targeting rule, the treatment assignment is determined based on some observable demographic or behavioral covariates. Thus, there are many observational studies where the unconfoundedness (or selection on observables) assumption is satisfied, and researchers can adopt methods to adjust for the observed confounders.

Matching on Covariates

Matching methods, in simple terms, aim to pair/match units with similar covariates but different treatment statuses to estimate the treatment effects by comparing their outcomes. Some well-known traditional methods aimed at adjusting for observed confounders are parametric matching, propensity scores, and weighting methods. All these methods require that the researcher has knowledge about which covariates are important a priori. The identification of the observed confounders and the selection of the variables that represent them is usually based on economic theories, institutional knowledge, or intuition (e.g., targeting ads depends on consumer engagement on a website).

Propensity score matching has been a commonly used matching method in marketing for decades, although their viability has recently been called into question due to the technique’s sensitivity to parametric assumptions (Athey and Imbens 2017). These methods start with the estimation of the propensity score (i.e., the probability of receiving the treatment conditional on covariates (e_i = P(T_i = 1|X_i)), which can be later combined with matching, stratification, inverse probability weighting, or covariate adjustment (Austin 2011). Another popular method to estimate treatment effects under the unconfoundedness assumption is the augmented inverse probability weighting approach (Robins, Rotnitzky, and Zhao 1994), which combines regression models to estimate the potential outcomes with inverse propensity score weighting methods. One attractive property of this estimator is its robustness to bias or misspecifications in either the potential outcome estimate or the propensity score estimate (Bang and Robins 2005). See Gordon et al. (2019) for a recent example in marketing that illustrates such applications in the context of online advertising.

Despite their popularity, matching methods have limitations, especially when the number of covariates is very large. In such cases, conventional methods such as exact-matching might become infeasible, and nearest-neighbor matching can result in a biased estimate of the ATT (Abadie and Imbens 2006). To obtain a flexible specification of the propensity score when the number of covariates is large, we can apply variable reduction methods such as Lasso (e.g., Gordon et al. 2019) or penalized logistic regression (e.g., Eckles and Bakshy 2021). Additionally, we can use machine learning (ML) methods to reduce the dimension of the covariate space. For example, Li et al. (2016) illustrate the usefulness of using linear dimensionality reduction ML algorithms such as principal component analysis (PCA), locality preserving projections (LPP), and random projections before matching the treatment and control units. Ramachandra (2018) explores the use of auto-encoders as a dimensionality reduction technique prior to neighbor matching on simulated data. In a similar vein, Yao et al. (2018) develop a method based on deep representation learning that jointly preserves the local similarity information and balances the distributions of the control and the treated groups.

Additionally, Diamond and Sekhon (2013) propose GenMatch, a multivariate matching method based on genetic algorithm to iteratively check and improve covariate balance between the treated and the control groups. Zubizarreta (2015) also proposes a weighting method that allows researchers to prespecify the level of desired balance between the treated and the control groups. One advantage of this weighting method is that it runs in polynomial time, so large datasets can be handled quickly.

Causal Machine Learning Methods on Flexible Matching

While matching on covariates can be highly powerful in many research settings, the methods discussed in the previous section often require that the researchers have prior knowledge about which covariates are important and which functional form is the most suitable for capturing their influence on the outcome variables. However, when working in high-dimensional settings, it might become difficult for researchers to identify which specific covariates are important (e.g., number of clicks, time spent on different sections of the website) or which functional form is appropriate for modeling their influence on outcomes (linear, quadratic, or more flexible specifications). Meanwhile, including all the covariates or allowing for flexible functional forms may reduce the power available in the dataset for learning about the treatment effect of interest (Chernozhukov et al. 2018). In such cases, researchers can benefit from adopting causal ML methods for flexible matching. Next, we discuss some commonly used causal ML methods often used on observational data for inferring causality. These methods are especially helpful in settings that involve high-dimensional covariates and/or when the relationship between them cannot be satisfactorily modeled in a parametric way. In such cases, ML methods will arguably provide a better specification of the propensity score and outcome models than more traditional methods.

One such example is the doubly robust estimator (Bang and Robins 2005) that leverages ML methods for predicting both the propensity score and the potential outcome variables. The method then allows a doubly robust estimation of the potential outcome preserving the favorable statistical properties that permit rigorous causal inference. Another recent development in the estimation of treatment effects under the unconfoundedness assumption is the use of ML methods to directly make inference about the parameters using the double ML approach (Chernozhukov et al. 2018). This method involves using ML methods to residualize any potential impact that the covariates may have had on both the treatment and the outcome variables. The double ML framework can be combined with doubly robust estimators. It can also be readily extended to estimate heterogeneous treatment effects.

We can also use tree-based ML approaches such as causal forest (Wager and Athey 2018) for estimating both average and heterogeneous treatment effects in observational studies where the unconfoundedness assumption is satisfied. As discussed in Athey and Imbens (2016) and Wager and Athey (2018), the causal forest model can be particularly suitable for inferring treatment effects from rich observational data containing a large number of covariates. In contrast with conventional propensity score matching, causal forests utilize a flexible nonparametric data-driven approach to determine similarity across observations. Additionally, the estimation of traditional propensity score methods is often sensitive to the model specification (Fong, Hazlett, and Imai 2018), especially when the treatment variable is continuous. The causal forests are immune to such problems because the building of an honest tree (the building block of causal forests) does not rely on any particular functional form. Some recent examples in marketing of the use of causal forests for inferring causality from observational data are Guo, Sriram, and Manchanda (2021), Ellickson, Kar, and Reeder (2023), Pattabhiramaiah, Overby, and Xu (2022), and Zhang and Luo (2023).

For any of the quasi-experimental methods discussed above, researchers can also conduct sensitivity analyses, which involve assessing the extent of unobserved confounding necessary for nullifying the causal effect (Altonji, Elder, and Taber 2008; Imbens 2003; Rosenbaum and Rubin 1983). Liu, Kuramoto, and Stuart (2013) provide a nice introduction to sensitivity analysis. Recent work (e.g., Cinelli and Hazlett 2020; Oster 2019) has expanded on this idea to formally bound the strength of unobserved confounders by comparing them with observed covariates. Oster (2019) argues that the robustness of estimates to omitted variable bias can be examined by observing movements in (1) the coefficient of interest and (2) model R² from specifications that either include or exclude control variables in a regression. Masten and Poirier (2022) point out that unobserved confounders can either drive baseline estimates to zero or reverse their sign, with the latter actually being easier. They recommend several best practices for sensitivity assessment and even offer a companion Stata package to help researchers adopt these tools.

Last but not least, in addition to using sophisticated statistical/econometric/ML methods for mitigating such concerns, we can also consider using field or lab experiments to complement causal conclusions drawn from field data, especially for forming a deeper understanding of the underlying mechanisms (for some recent examples, see Nickerson et al. [2023] and Anderson et al. [2024]).

Conclusion

The main purpose of this article is to offer some guidance to help marketing researchers choose the most appropriate method for understanding causal relationships from quasi-experimental data. We begin with the basic DID method that is widely used in settings with treatment and control units and pre- and posttreatment periods. We thereafter discuss advances in using DID for contexts characterized by staggered treatment timing and heterogeneous treatment effects. We then explore flexible alternatives to DID, such as the synthetic control method, which is predicated on the researcher’s access to a relatively large number of pre-treatment periods, and other alternative methods that do not require a large number of pretreatment periods. We cover how to estimate causal effects, conduct inference, and recommend best practices for these alternatives. Additionally, we review quasi-experimental methods with covariates, such as matching, and recent advances in causal ML methods for flexible matching. Given the rapid development of new methods, this article is not meant to be an exhaustive review of the literature on causal inference using observational data, but rather a useful starting point. Recent research has introduced novel ways for combining ML with instrumental variables (e.g., Hartford et al. 2017; Singh, Hosanagar, and Gandhi 2020) and incorporating natural language processing techniques within a causal framework (Feder et al. 2022) with the goal of improving causal inference. Thus, the researcher’s methodological toolkit is ever expanding. We hope that this writeup helps guide researchers identify the right set of tools for answering causal research questions based on the data characteristics of their problem.

References

Abadie, Alberto (2005), “Semiparametric Difference-in-Differences Estimators,” Review of Economic Studies, 72 (1), 1–19.

Abadie, Alberto (2021), “Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects,” Journal of Economic Literature, 59 (2), 391–425.

Abadie, Alberto, Alexus Diamond, and Jens Hainmueller (2010), “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program,” Journal of the American Statistical Association, 105 (490), 493–505.

Abadie, Alberto and Javier Gardeazabal (2003), “The Economic Costs of Conflict: A Case Study of the Basque Country.” American Economic Review, 93 (1), 113–32.

Abadie, Alberto and Guido W. Imbens (2006), “Large Sample Properties of Matching Estimators for Average Treatment Effects,” Econometrica, 74 (1), 235–67.

Altonji, Joseph G., Todd E. Elder, and Christopher R. Taber (2008), “Using Selection on Observed Variables to Assess Bias from Unobservables When Evaluating Swan-Ganz Catheterization,” American Economic Review , 98 (2), 345–50.

Anderson, Eric, Chaoqun Chen, Ayelet Israeli, and Duncan Simester (2024), “Canary Categories,” Journal of Marketing Research, 61 (5), 872–90.

Arkhangelsky, Dmitry, Susan Athey, David A. Hirshberg, Guido W. Imbens, and Stefan Wager (2021), “Synthetic Difference-in-Differences,” American Economic Review , 111 (12), 4088–4118.

Athey, Susan and Guido Imbens (2017), “The State of Applied Econometrics: Causality and Policy Evaluation,” Journal of Economic Perspectives, 31 (2), 3–32.

Athey, Susan and Guido W. Imbens (2019), “Machine Learning Methods That Economists Should Know About,” Annual Review of Economics, 11, 685–725.

Austin, Peter C. (2011), “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies,” Multivariate Behavioral Research, 46 (3), 399–424.

Bai, Jushan and Serena Ng (2021), “Matrix Completion, Counterfactuals, and Factor Analysis of Missing Data,” Journal of the American Statistical Association, 116 (536), 1746–63.

Baker, Andrew C., David F. Larcker, and Charles C.Y. Wang (2022), “How Much Should We Trust Staggered Difference-In-Differences Estimates?” Journal of Financial Economics, 144 (2), 370–95.

Bang, Heejung and James M. Robins (2005), “Doubly Robust Estimation in Missing Data and Causal Inference Models,” Biometrics, 61 (4), 962–73.

Bronnenberg, Bart J., Jean-Pierre Dub´e, and Robert E. Sanders (2020), “Consumer misinformation and the brand premium: a private label blind taste test,” Marketing Science, 39 (2), 382–406.

Callaway, Brantly and Pedro H.C. Sant’Anna (2021), “Difference-in-differences with Multiple Time Periods,” Journal of Econometrics, 225 (2), 200–230.

Cao, Jingcun, Pradeep Chintagunta, and Shibo Li (2023). “From free to Paid: Monetizing a Non-Advertising-Based App,” Journal of Marketing Research, 60 (4), 707–27.

Cengiz, Doruk, Arindrajit Dube, Attila Lindner, and Ben Zipperer (2019), “The Effect of Minimum Wages on Low-Wage Jobs,” Quarterly Journal of Economics, 134 (3), 1405–54.

Chan, Marc K. and Simon C.M. Kwok (2016). “Policy Evaluation with Interactive Fixed Effects,” working paper, University of Sydney.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins (2018), “Double/Debiased Machine Learning for Treatment and Structural Parameters,” Econometrics Journal , 21 (1), C1–C68.

Cinelli, Carlos and Chad Hazlett (2020), “Making Sense of Sensitivity: Extending Omitted Variable Bias,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82 (1), 39–67.

Clarke, Damian, Daniel Pailanir, Susan Athey, and Guido Imbens (forthcoming), “On Synthetic Difference-in-Differences and Related Estimation Methods in Stata,” Stata Journal.

Danaher, Peter J. and Michael S. Smith (2010). “Modeling Multivariate Distributions Using Copulas: Applications in Marketing,” Marketing Science, 30 (1), 4–21.

Diamond, Alexis and Jasjeet Sekhon (2013). “Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies,” Review of Economics and Statistics, 95 (3), 932–45.

Eckles, Dean and Eytan Bakshy (2021), “Bias and High-Dimensional Adjustment in Observational Studies of Peer Effects,” Journal of the American Statistical Association, 116 (534), 507–17.

Ellickson, Paul B., Wreetabrata Kar, and James C. Reeder III (2023), “Estimating Marketing Component Effects: Double Machine Learning from Targeted Digital Promotions,” Marketing Science, 42 (4), 704–28.

Feder, Amir, Katherine A. Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, et al. (2022). “Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond,” Transactions of the Association for Computational Linguistics, 10, 1138–58.

Fong, Christian, Chad Hazlett, and Kosuke Imai (2018). “Covariate Balancing Propensity Score for a Continuous Treatment: Application to the Efficacy of Political Advertisements,” Annals of Applied Statistics, 12 (1), 156–77.

Ghose, Anindya, Heeseung Andrew Lee, Kihwan Nam, and Wonseok Oh (2024), “The Effects of Pressure and Self-Assurance Nudges on Product Purchases and Returns in Online Retailing: Evidence from a Randomized Field Experiment,” Journal of Marketing Research, 61 (3), 517–35.

Gobillon, Laurent and Thierry Magnac (2016), “Regional Policy Evaluation: Interactive Fixed Effects and Synthetic Controls,” Review of Economics and Statistics, 98 (3), 535–51.

Goodman-Bacon, Andrew (2021), “Difference-in-Differences with Variation in Treatment Timing,” Journal of Econometrics, 225 (2), 254–77.

Gordon, Brett R., Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky (2019), “A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook,” Marketing Science, 38 (2), 193–225.

Gormley, Todd A. and David A. Matsa (2011), “Growing Out of Trouble? Corporate Responses to Liability Risk,” Review of Financial Studies, 24 (8), 2781–2821.

Guo, Tong, Srinivasaraghavan Sriram, and Puneet Manchanda (2020), “‘Let the Sunshine In’: The Impact of Industry Payment Disclosure on Physician Prescription Behavior,” Marketing Science, 39 (3), 516–39.

Guo, Tong, Srinivasaraghavan Sriram, and Puneet Manchanda (2021), “The Effect of Information Disclosure on Industry Payments to Physicians,” Journal of Marketing Research, 58 (1), 115–40.

Hartford, Jason, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy (2017), “Deep IV: A Flexible Approach for Counterfactual Prediction,” in Proceedings of the 34th International Conference on Machine Learning, Vol. 70, D. Precup and Y.W. The, eds. PMLR, 1414–23.

Heckman, James J., Hidehiko Ichimura, and Petra E. Todd (1997), “Matching as an Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme,” Review of Economic Studies, 64 (4), 605–54.

Hsiao, Cheng, H. Steve Ching, and Shui Ki Wan (2012), “A Panel Data Approach for Program Evaluation: Measuring the Benefits of Political and Economic Integration of Hong Kong with Mainland China,” Journal of Applied Econometrics, 27 (5), 705–40.

Imbens, Guido W. (2003), “Sensitivity to Exogeneity Assumptions in Program Evaluation,” American Economic Review, 93 (2), 126–32.

Keller, Kristopher O., Jonne Y. Guyt, and Rajdeep Grewal (2024), “Soda Taxes and Marketing Conduct,” Journal of Marketing Research, 61 (3), 393–410.

Kim, Sungjin, Clarence Lee, and Sachin Gupta (2020), “Bayesian Synthetic Control Methods,” Journal of Marketing Research, 57 (5), 831–52.

Lambrecht, Anja, Catherine Tucker, and Xu Zhang (2024), “TV Advertising and Online Sales: A Case Study of Intertemporal Substitution Effects for an Online Travel Platform,” Journal of Marketing Research, 61 (2), 248–70.

Li, Kathleen T. (2020), “Statistical Inference for Average Treatment Effects Estimated by Synthetic Control Methods,” Journal of the American Statistical Association, 115 (532), 2068–83.

Li, Kathleen T. (2024), “Frontiers: A Simple Forward Difference-in-Differences Method,” Marketing Science, 43 (2), 267–79.

Li, Kathleen T. and Garrett P. Sonnier (2023), “Statistical Inference for Factor Model Approach to Estimate Causal Effects in Quasi-Experimental Settings,” Journal of Marketing Research, 60 (3), 449–72.

Li, Kathleen T. and Christophe Van den Bulte (2023), “Augmented Difference-in-Differences,” Marketing Science, 42 (4), 746–67.

Li, Sheng, Nikos Vlassis, Jaya Kawale, and Yun Fu (2016), “Matching via Dimensionality Reduction for Estimation of Treatment Effects in Digital Marketing Campaigns,” in Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI). IJCAI, 3768–74.

Liu, Weiwei, S. Janet Kuramoto, and Elizabeth A. Stuart (2013), “An Introduction to Sensitivity Analysis for Unobserved Confounding in Nonexperimental Prevention Research,” Prevention Science, 14, 570–80.

Lovett, Mitchell J., Renana Peres, and Linli Xu (2019), “Can Your Advertising Really Buy Earned Impressions? The Effect of Brand Advertising on Word of Mouth,” Quantitative Marketing and Economics, 17, 215–55.

Masten, Matthew A. and Alexandre Poirier (2022), “The Effect of Omitted Variables on the Sign of Regression Coefficients,” arXiv, https://doi.org/10.48550/arXiv.2208.00552.

Nickerson, Dionne, Michael Lowe, Adithya Pattabhiramaiah, and Alina Sorescu (2023), “The Impact of Corporate Social Responsibility on Brand Sales: An Accountability Perspective,” Journal of Marketing, 87 (1), 5–28.

Oster, Emily (2019), “Unobservable Selection and Coefficient Stability: Theory and Evidence,” Journal of Business Economic Statistics, 37 (2), 187–204.

Pang, Xun, Licheng Liu, and Yiqing Xu (2022), “A Bayesian Alternative to Synthetic Control for Comparative Case Studies,” Political Analysis, 30 (2), 269–88.

Park, Sungho and Sachin Gupta (2012), “Handling Endogenous Regressors by Joint Estimation Using Copulas,” Marketing Science, 31 (4), 567–86.

Pattabhiramaiah, Adithya, S. Sriram, and Puneet Manchanda (2019), “Paywalls: Monetizing Online Content,” Journal of Marketing, 83 (2), 19–36.

Pattabhiramaiah, Adithya., Eric Overby, and Lizhen Xu (2022), “Spillovers from Online Engagement: How a Newspaper Subscriber’s Activation of Digital Paywall Access Affects Her Retention and Subscription Revenue,” Management Science, 68 (5), 3528–48.

Petrin, Amil and Kenneth Train (2010), “A Control Function Approach to Endogeneity in Consumer Choice Models,” Journal of Marketing Research, 47 (1), 3–13.

Ramachandra, Vikas (2018), “Deep Learning for Causal Inference,” arXiv.org, https://doi.org/10.48550/arXiv.1803.00149.

Robins, James, Andrea Rotnitzky, and Lue Ping Zhao (1994), “Estimation of Regression Coefficients When Some Regressors Are Not Always Observed,” Journal of the American Statistical Association, 89 (427), 846–66.

Rosenbaum, P.R. and D.B. Rubin (1983), “Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study with Binary Outcome,” Journal of the Royal Statistical Society: Series B (Methodological), 45 (2), 212–18.

Roth, Jonathan, Pedro H.C. Sant’Anna, Alyssa Bilinski, and John Poe (2023), “What’s Trending in Difference-in-Differences? A Synthesis of the Recent Econometrics Literature,” Journal of Econometrics, 233 (2), 345–66.

Sant’Anna, Pedro H.C. and Jun Zhao (2020), “Doubly Robust Difference-in-Differences Estimators,” Journal of Econometrics, 219 (1), 101–22.

Singh, Amandee, Kartik Hosanagar, and Amit Gandhi (2020), “Machine Learning Instrument Variables for Causal Inference,” in Proceedings of the 21st ACM Conference on Economics and Computation (EC ’20). ACM, 400–417

Sun, Liyang and Sarah Abraham (2021). “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects,” Journal of Econometrics, 225 (2), 175–99.

Wager, Stefan and Susan Athey (2018), “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests,” Journal of the American Statistical Association, 113 (523), 1228–42.

Wooldridge, Jeffrey M. (2019). Introductory Econometrics: A Modern Approach, 7th ed. Cengage Learning.

Xu, Yiqing (2017), “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models,” Political Analysis, 25 (1), 57–76.

Yao, Liuyi, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao, and Aidong Zhang (2018), “Representation Learning for treatment effect estimation from observational data,” in Advances in Neural Information Processing Systems. NeurIPS, 2633–43.

Zhang, Mengxia and Lan Luo (2023), “Can Consumer-Posted Photos Serve as a Leading Indicator of Restaurant Survival? Evidence from Yelp,” Management Science, 69 (1), 25–50.

Zubizarreta, José R. (2015), “Stable Weights That Balance Covariates for Estimation with Incomplete Outcome Data,” Journal of the American Statistical Association, 110 (511), 910–22.

^[1]Researchers looking to overcome issues related to heterogeneous treatment effects in staggered treatment settings and interpret the composite treatment effect across all cohorts might also consider the synthetic control method that we discuss in the next section. Synthetic control methods do not suffer from this problem, as they estimate a separate ATT for each treatment group and use “clean” controls (units that have never been treated) for computing the ATT for the entire sample.

Go to the Journal of Marketing Research

More IMPACT at JMR