Skip to Content Skip to Footer
Can You Trust Your Ad Data? A New Study Exposes a Hidden Flaw in A-B Testing on Digital Ad Platforms

Can You Trust Your Ad Data? A New Study Exposes a Hidden Flaw in A-B Testing on Digital Ad Platforms

Michael Braun and Eric M. Schwartz

Consider a landscaping company whose designs focus on native plants and water conservation. The company creates two advertisements: one focused on sustainability (ad A) and another on aesthetics (ad B). As platforms personalize the ads that different users receive, ads A and B will be delivered to groups with diverging mixes. Users interested in outdoor activities may see the sustainability ad, whereas users interested in home decor may see the aesthetics ad. Targeting ads to specific consumers is a major part of the value that platforms offer to advertisers because it aims to place the “right” ads in front of the “right” users.

In a new Journal of Marketing study, we find that online A-B testing in digital advertising may not be delivering the reliable insights marketers expect. Our research uncovers significant limitations in the experimentation tools provided by online advertising platforms, potentially creating misleading conclusions about ad performance.

Advertisement

The Issue with Divergent Delivery

We highlight a phenomenon called “divergent delivery,” in which the targeting algorithms used by online advertising platforms like Meta and Google target different types of users with different ad content. The problem arises when the algorithm sends different ads to distinct mixes of users using A-B testing: an experiment designed to compare the effectiveness of the two ads. The “winning” ad may have performed better simply because the algorithm showed it to users who were more prone to respond to the ad than the users who saw the other ad. The same ad could appear to perform better or worse depending on the mix of users who see it rather than on the creative content of the ad itself.

For an advertiser, especially with a large audience to choose from and a limited budget, targeting provides plenty of value. So large companies like Google and Meta use algorithms that allocate ads to specific users. On these platforms, advertisers bid for the right to show ads to users in an audience. However, the winner of an auction for the right to place an ad on a particular user’s screen is not based on monetary value of the bids alone but also the ad content and user–ad relevance. The precise inputs and methods that determine the relevance of ads to users, how relevance influences auction results, and, thus, which users are targeted with each ad, are proprietary to particular platforms and are not observable to advertisers. It is not precisely known how the algorithms determine relevance for types of users and it may not even be able to be enumerated or reproduced by the platforms themselves.

Our findings have profound implications for marketers who rely on A-B testing of their online ads to inform their marketing strategies. Because of low cost and seemingly scientific appeal, marketers use these online ad tests to develop strategies even beyond just deciding what ad to include in the next campaign. So, when platforms do not explicitly state that these experiments are not truly randomized, it gives marketers a false sense of security about their data-driven decisions.

A Fundamental Problem with Online Advertising

We argue that this issue is not just a technical flaw in this tool but a fundamental characteristic of how the online advertising business operates. The platform’s primary goal is to maximize ad performance, not to provide experimental results for marketers. Therefore, these platforms have little incentive to let advertisers untangle the effect of ad content from the effect of their proprietary targeting algorithms. Marketers are left in a difficult position in that they must either accept the confounded results from these tests or invest in more complex and costly methods to truly understand the impact of creative elements in their ads.

Our study makes its case using simulation, statistical analysis, and a demonstration of divergent delivery from an actual A-B test run in the field. We challenge the common belief that results from A-B tests that compare multiple ads provide the same ability to draw causal conclusions as do randomized experiments. Marketers should be aware that the differences in effects of ads A and B that are reported by these platforms may not fully capture the true impact of their ads. By recognizing these limitations, marketers can make more informed decisions and avoid the pitfalls of misinterpreting data from these tests.

Advice for Advertisers

We offer the following recommendations for those using A-B testing tools:

  • If your goal is to predict which ad creatives will perform best in a targeted environment—under the same conditions on the same ad platform with the same campaign setting— our advice is to carry on using the available A-B testing tools. Experimenters with this goal may not mind—and even may prefer—that their A-B tests lack balance across ad creative treatments and lack representativeness of the subjects.
  • If the goal is to learn how different ad creatives generate different responses more generally, the report of the test should include the disclaimer that the A-B comparisons were made on a subset of the audience, across different mixes of users optimized for each ad separately, where subjects were selected by the proprietary algorithm.
  • If the marketing objective is to extrapolate comparisons between ad content for use outside of the current platform (e.g., marketing strategy development, or offline advertising where randomized experimentation and user tracking is more challenging), our advice is to not rely on these A-B tests for causal evidence about the effects of creative content across ads. The analytics team, for instance, should warn that results are confounded by how the algorithm determined which ad treatments were most relevant to different experimental subjects. These disclosures should also be made by academic researchers who use A-B test results for scientific inference.

To summarize, an A-B test may appear to be an easy way to run field experiments to learn about the effects of ads, imagery, and messaging. But experimenters who run A-B tests in targeted online advertising environments should know what they are really getting. Our concern is not the mere usage of certain types of A-B tests. Rather, it is the presentation of results as if they came from balanced experiments and subsequent conclusions and managerial decisions based on those results.

Read the Full Study for Complete Details

Source: Michael Braun and Eric M. Schwartz, “Where A-B Testing Goes Wrong: How Divergent Delivery Affects What Online Experiments Cannot (and Can) Tell You About How Customers Respond to Advertising,” Journal of Marketing.

Go to the Journal of Marketing

Michael Braun is Associate Professor, Marilyn and Leo F. Corrigan Research Professor, Southern Methodist University, USA.

Eric M. Schwartz is Arnold M. and Linda T. Jacob Faculty Fellow, Associate Professor of Marketing, University of Michigan, USA.

The owner of this website has made a commitment to accessibility and inclusion, please report any problems that you encounter using the contact form on this website. This site uses the WP ADA Compliance Check plugin to enhance accessibility.