Blog

The Optimizer’s Curse & Wrong-Way Reductions

A copy of this post can also be found on my personal blog


Following an idea to its logical conclusion might be extrapolating a model beyond its valid range.John D. Cook

Summary

I spent about two and a half years as a research analyst at GiveWell. For most of my time there, I was the point person on GiveWell’s main cost-effectiveness analyses. I’ve come to believe there are serious, underappreciated issues with the methods the effective altruism (EA) community at large uses to prioritize causes and programs. While effective altruists approach prioritization in a number of different ways, most approaches involve (a) roughly estimating the possible impacts funding opportunities could have and (b) assessing the probability that possible impacts will be realized if an opportunity is funded.

I discuss the phenomenon of the optimizer’s curse: when assessments of activities’ impacts are uncertain, engaging in the activities that look most promising will tend to have a smaller impact than anticipated. I argue that the optimizer’s curse should be extremely concerning when prioritizing among funding opportunities that involve substantial, poorly understood uncertainty. I further argue that proposed Bayesian approaches to avoiding the optimizer’s curse are often unrealistic. I maintain that it is a mistake to try and understand all uncertainty in terms of precise probability estimates.

This post is long, so I’ve separated it into several sections:

  1. The optimizer’s curse
  2. Models, wrong-way reductions, and probability
  3. Hazy probabilities and prioritization
  4. Bayesian wrong-way reductions
  5. Doing better

Part 1: The optimizer’s curse

The counterintuitive phenomenon of the optimizer’s curse was first formally recognized in Smith & Winkler 2006.

Here’s a rough sketch:

  • Optimizers start by calculating the expected value of different activities.
  • Estimates of expected value involve uncertainty.
  • Sometimes expected value is overestimated, sometimes expected value is underestimated.
  • Optimizers aim to engage in activities with the highest expected values.
  • Result: Optimizers tend to select activities with overestimated expected value.

Smith and Winkler refer to the difference between the expected value of an activity and its realized value as “postdecision surprise.”

The optimizer’s curse occurs even in scenarios where estimates of expected value are unbiased (roughly, where any given estimate is as likely to be too optimistic as it is to be too pessimistic1). When estimates are biased—which they typically are in the real world—the magnitude of the postdecision surprise may increase.

A huge problem for effective altruists facing uncertainty

In a simple model, I show how an optimizer with only moderate uncertainty about factors that determine opportunities’ cost-effectiveness may dramatically overestimate the cost-effectiveness of the opportunity that appears most promising. As uncertainty increases, the degree to which the cost-effectiveness of the optimal-looking program is overstated grows wildly.

I believe effective altruists should find this extremely concerning. They’ve considered a large number of causes. They often have massive uncertainty about the true importance of causes they’ve prioritized. For example, GiveWell acknowledges substantial uncertainty about the impact of deworming programs it recommends, and the Open Philanthropy Project pursues a high-risk, high-reward grantmaking strategy.

The optimizer’s curse can show up even in situations where effective altruists’ prioritization decisions don’t involve formal models or explicit estimates of expected value. Someone informally assessing philanthropic opportunities in a linear manner might have a thought like:

Thing X seems like an awfully big issue. Funding Group A would probably cost only a little bit of money and have a small chance leading to a solution for Thing X. Accordingly, I feel decent about the expected cost-effectiveness of funding Group A.

Let me compare that to how I feel about some other funding opportunities…

Although the thinking is informal, there’s uncertainty, potential for bias, and an optimization-like process.2

Previously proposed solution

The optimizer’s curse hasn’t gone unnoticed by impact-oriented philanthropists. Luke Muehlhauser, a senior research analyst at the Open Philanthropy Project and the former executive director of the Machine Intelligence Research Institute, wrote an article titled The Optimizer’s Curse and How to Beat It. Holden Karnofsky, the co-founder of GiveWell and the CEO of the Open Philanthropy Project, wrote Why we can’t take expected value estimates literally. While Karnofsky didn’t directly mention the phenomenon of the optimizer’s curse, he covered closely related concepts.

Both Muehlhauser and Karnofsky suggested that the solution to the problem is to make Bayesian adjustments. Muehlhauser described this solution as “straightforward.”3 Karnofsky seemed to think Bayesian adjustments should be made, but he acknowledged serious difficulties involved in making explicit, formal adjustments.4 Bayesian adjustments are also proposed in Smith & Winkler 2006.5

Here’s what Smith & Winkler propose (I recommend skipping it if you’re not a statistics buff):6

“The key to overcoming the optimizer’s curse is conceptually quite simple: model the uncertainty in the value estimates explicitly and use Bayesian methods to interpret these value estimates. Specifically, we assign a prior distribution on the vector of true values μ=(μ1,…,μn) and describe the accuracy of the value estimates V = (V1,…,Vn) by a conditional distribution V|μ. Then, rather than ranking alternatives based on the value estimates, after we have done the decision analysis and observed the value estimates V, we use Bayes’ rule to determine the posterior distribution for μ|V and rank and choose among alternatives based on the posterior means, i = E[μi|V] for i = 1,…,n.”

For entities with lots of past data on both the (a) expected values of activities and (b) precisely measured, realized values of the same activities, this may be an excellent solution.

In most scenarios where effective altruists encounter the optimizer’s curse, this solution is unworkable. The necessary data doesn’t exist.7 The impact of most philanthropic programs has not been rigorously measured. Most funding decisions are not made on the basis of explicit expected value estimates. Many causes effective altruists are interested in are novel: there have never been opportunities to collect the necessary data.

The alternatives I’ve heard effective altruists propose involve attempts to approximate data-driven Bayesian adjustments as well as possible given the lack of data. I believe these alternatives either don’t generally work in practice or aren’t worth calling Bayesian.

To make my case, I’m going to first segue into some other topics.

Part 2: Models, wrong-way reductions, and probability

Models

In my experience, members of the effective altruism community are far more likely than the typical person to try to understand the world (and make decisions) on the basis of abstract models.8 I don’t think enough effort goes into considering when (if ever) these abstract models cease to be appropriate for application.

This post’s opening quote comes from a great blog post by John D Cook. In the post, Cook explains how Euclidean geometry is a great model for estimating the area of a football field—multiply field_length * field_width and you’ll get a result that’s pretty much exactly the field’s area. However, Euclidean geometry ceases to be a reliable model when calculating the area of truly massive spaces—the curvature of the earth gets in the way.9 Most models work the same way. Here’s how Cook ends his blog post:10

Models are based on experience with data within some range. The surprising thing about Newtonian physics is not that it breaks down at a subatomic scale and at a cosmic scale. The surprising thing is that it is usually adequate for everything in between.

Most models do not scale up or down over anywhere near as many orders of magnitude as Euclidean geometry or Newtonian physics. If a dose-response curve, for example, is linear for observations in the range of 10 to 100 milligrams, nobody in his right mind would expect the curve to remain linear for doses up to a kilogram. It wouldn’t be surprising to find out that linearity breaks down before you get to 200 milligrams.

Wrong-way reductions

In a brilliant article, David Chapman coins the term “wrong-way reduction” to describe an error people commit when they propose tackling a complicated, hard problem with an apparently simple solution that, on further inspection, turns out to be more problematic than the initial problem. Chapman points out that regular people rarely make this kind of error. Usually, wrong-way reductions are motivated errors committed by people in fields like philosophy, theology, and cognitive science.

The problematic solutions wrong-way reductions offer often take this form:


“If we had [a thing we don’t usually have], then we could [apply a simple strategy] to authoritatively solve all instances of [a hard problem].”


People advocating wrong-way reductions often gloss over the fact that their proposed solutions require something we don’t have or engage in intellectual gymnastics to come up with something that can act as a proxy for the thing we don’t have. In most cases, these intellectual gymnastics strike outsiders as ridiculous but come off more convincing to people who’ve accepted the ideology that motivated the wrong-way reduction.

A wrong-way reduction is often an attempt to universalize an approach that works in a limited set of situations. Put another way, wrong-way reductions involve stretching a model way beyond the domains it’s known to work in.

An example

I spent a lot of my childhood in evangelical, Christian communities. Many of my teachers and church leaders subscribed to the idea that the Bible was the literal, infallible word of God. If you presented some of these people with questions about how to live or how to handle problems, they’d encourage you to turn to the Bible.11

In some cases, the Bible offered fairly clear guidance. When faced with the question of whether one should worship the Judeo-Christian God, the commandment, “You shall have no other gods before me”12 gives a clear answer. Other parts of the Bible are consistent with that commandment. However, “follow the Bible” ends up as a wrong-way reduction because the Bible doesn’t give clear answers to most of the questions that fall under the umbrella of “How should one live?”

Is abortion OK? One of the Ten Commandments states, “You shall not murder.”13 But then there are other passages that advocate execution.14 How similar are abortion, execution, and murder anyway?

Should one continue dating a significant other? Start a business? It’s not clear where to start with those questions.

I intentionally used an example that I don’t think will ruffle too many readers’ feathers, but imagine for a minute what it’s like to be a person who subscribes to the idea that the Bible is a complete and infallible guide:

You see the hard problem of deciding how to live has a demanding but straightforward solution! You frequently observe people—including plenty of mainstream Christians— experience failure and suffering when their actions don’t align with the Bible’s teachings.

You’re likely in a close-knit community with like-minded people. Intelligent and respected members of the community regularly turn to the Bible for advice and encourage you to do the same.

When you have doubts about the coherence of your worldview, there’s someone smarter than you in the church community you can consult. The wise church member has almost certainly heard concerns similar to yours before and can explain why the apparent issues or inconsistencies you’ve run into may not be what they seem.

A mainstream Christian from outside the community probably wouldn’t find the rationales offered by the church member compelling. An individual who’s already in the community is more easily convinced.15

Probability

The idea that all uncertainty must be explainable in terms of probability is a wrong-way reduction. Getting more detailed, the idea that if one knows the probabilities and utilities of all outcomes, then she can always behave rationally in pursuit of her goals is a wrong-way reduction.

It’s not a novel proposal. People have been saying versions of this for a long time. The term Knightian uncertainty is often used to distinguish quantifiable risk from unquantifiable uncertainty.

As I’ll illustrate later, we don’t need to assume a strict dichotomy separates quantifiable risks from unquantifiable risks. Instead, real-world uncertainty falls on something like a spectrum.

Nate Soares, the executive director of the Machine Intelligence Research Institute, wrote a post on LessWrong that demonstrates the wrong-way reduction I’m concerned about. He writes:16

It doesn’t really matter what uncertainty you call ‘normal’ and what uncertainty you call ‘Knightian’ because, at the end of the day, you still have to cash out all your uncertainty into a credence so that you can actually act.

I don’t think ignorance must cash out as a probability distribution. I don’t have to use probabilistic decision theory to decide how to act.

Here’s the physicist David Deutsch tweeting on a related topic:

What is probability?

Probability is, as far as we know, an abstract mathematical concept. It doesn’t exist in the physical world of our everyday experience.17 However, probability has useful, real-world applications. It can aid in describing and dealing with many types of uncertainty.

I’m not a statistician or a philosopher. I don’t expect anyone to accept that position based on my authority. That said, I believe I’m in good company. Here’s an excerpt from Bayesian statistician Andrew Gelman on the same topic:18

Probability is a mathematical concept. To define it based on any imperfect real-world counterpart (such as betting or long-run frequency) makes about as much sense as defining a line in Euclidean space as the edge of a perfectly straight piece of metal, or as the space occupied by a very thin thread that is pulled taut. Ultimately, a line is a line, and probabilities are mathematical objects that follow Kolmogorov’s laws. Real-world models are important for the application of probability, and it makes a lot of sense to me that such an important concept has many different real-world analogies, none of which are perfect.

Consider a handful of statements that involve probabilities:


  1. A hypothetical fair coin tossed in a fair manner has a 50% chance of coming up heads.

  2. When two buddies at a bar flip a coin to decide who buys the next round, each person has a 50% chance of winning.

  3. Experts believe there’s a 20% chance the cost of a gallon of gasoline will be higher than $3.00 by this time next year.

  4. Dr. Paulson thinks there’s an 80% chance that Moore’s Law will continue to hold over the next 5 years.

  5. Dr. Johnson thinks there’s a 20% chance quantum computers will commonly be used to solve everyday problems by 2100.

  6. Kyle is an atheist. When asked what odds he places on the possibility that an all-powerful god exists, he says “2%.”

I’d argue that the degree to which probability is a useful tool for understanding uncertainty declines as you descend the list.

  • The first statement is tautological. When I describe something as “fair,” I mean that it perfectly conforms to abstract probability theory.
  • In the early statements, the probability estimates can be informed by past experiences with similar situations and explanatory theories.
  • In the final statement, I don’t know what to make of the probability estimate.

The hypothetical atheist from the final statement, Kyle, wouldn’t be able to draw on past experiences with different realities (i.e., Kyle didn’t previously experience a bunch of realities and learn that some of them had all-powerful gods while others didn’t). If you push someone like Kyle to explain why they chose 2% rather than 4% or 0.5%, you almost certainly won’t get a clear explanation.

If you gave the same “What probability do you place on the existence of an all-powerful god?” question to a number of self-proclaimed atheists, you’d probably get a wide range of answers.19

I bet you’d find that some people would give answers like 10%, others 1%, and others 0.001%. While these probabilities can all be described as “low,” they differ by orders of magnitude. If probabilities like these are used alongside probabilistic decision models, they could have extremely different implications. Going forward, I’m going to call probability estimates like these “hazy probabilities.”

Placing hazy probabilities on the same footing as better-grounded probabilities (e.g., the odds a coin comes up heads) can lead to problems.

Part 3: Hazy probabilities and prioritization

Probabilities that feel somewhat hazy show up frequently in prioritization work that effective altruists engage in. Because I’m especially familiar with GiveWell’s work, I’ll draw on it for an illustrative example.20 GiveWell’s rationale for recommending charities that treat parasitic worm infections hinges on follow-ups to a single study. Findings from these follow-ups are suggestive of large, long-term income gains for individuals that received deworming treatments as children.21

There were a lot of odd things about the study that make extrapolating to form expectations about the effect of deworming in today’s programs difficult.22 To arrive at a bottom-line estimate of deworming’s cost-effectiveness, GiveWell assigns explicit, numerical values in multiple hazy-feeling situations. GiveWell faces similar haziness when modeling the impact of some other interventions it considers.23

While GiveWell’s funding decisions aren’t made exclusively on the basis of its cost-effectiveness models, they play a significant role. Haziness also affects other, less-quantitative assessments GiveWell makes when deciding what programs to fund. That said, the level of haziness GiveWell deals with is minor in comparison to what other parts of the effective altruism community encounter.

Hazy, extreme events

There are a lot of earth-shattering events that could happen and revolutionary technologies that may be developed in my lifetime. In most cases, I would struggle to place precise numbers on the probability of these occurrences.

Some examples:

  • A pandemic that wipes out the entire human race
  • An all-out nuclear war with no survivors
  • Advanced molecular nanotechnology
  • Superhuman artificial intelligence
  • Catastrophic climate change that leaves no survivors
  • Whole-brain emulations
  • Complete ability to stop and reverse biological aging
  • Eternal bliss that’s granted only to believers in Thing X

You could come up with tons more.

I have rough feelings about the plausibility of each scenario, but I would struggle to translate any of these feelings into precise probability estimates. Putting probabilities on these outcomes seems a bit like the earlier example of an atheist trying to precisely state the probability he or she places on a god’s existence.

If I force myself to put numbers on things, I have thoughts like this:

Maybe whole-brain emulations have a 1 in 10,000 chance of being developed in my lifetime. Eh, on second thought, maybe 1 in 100. Hmm. I’ll compromise and say 1 in 1,000.

An effective altruist might make a bunch of rough judgments about the likelihood of scenarios like those above, combine those probabilities with extremely hazy estimates about the impact she could have in each scenario and then decide which issue or issues should be prioritized. Indeed, I think this is more or less what the effective altruism community has done over the last decade.

When many hazy assessments are made, I think it’s quite likely that some activities that appear promising will only appear that way due to ignorance, inability to quantify uncertainty, or error.

Part 4: Bayesian wrong-way reductions

I believe the proposals effective altruists have made for salvaging general, Bayesian solutions to the optimizer’s curse are wrong-way reductions.

To make a Bayesian adjustment, it’s necessary to have a prior (roughly, a probability distribution that captures initial expectations about a scenario). As I mentioned earlier, effective altruists will rarely have the information necessary to create well-grounded, data-driven priors. To get around the lack of data, people propose coming up with priors in other ways.

For example, when there is serious uncertainty about the probabilities of different outcomes, people sometimes propose assuming that each possible outcome is equally probable. In some scenarios, this is a great heuristic.24 In other situations, it’s a terrible approach.25 To put it simply, a state of ignorance is not a probability distribution.

Karnofsky suggests a different approach (emphasis mine):26

It’s my view that my brain instinctively processes huge amounts of information, coming from many different reference classes, and arrives at a prior; if I attempt to formalize my prior, counting only what I can name and justify, I can worsen the accuracy a lot relative to going with my gut…Rather than using a formula that is checkable but omits a huge amount of information, I’d prefer to state my intuition – without pretense that it is anything but an intuition – and hope that the ensuing discussion provides the needed check on my intuitions.

I agree with Karnofsky that we should take our intuitions seriously, but I don’t think intuitions need to correspond to well-defined mathematical structures. Karnofsky maintains that Bayesian adjustments to expected value estimates “can rarely be made (reasonably) using an explicit, formal calculation.” I find this odd, and I think it may indicate that Karnofsky doesn’t really believe his intuitions cash out as priors. To make an explicit, Bayesian calculation, a prior doesn’t need to be well-justified. If one is capable of drawing or describing a prior distribution, a formal calculation can be made.

I agree with many aspects of Karnofsky’s conclusions, but I don’t think what Karnofsky is advocating should be called Bayesian. It’s closer to standard reasonableness and critical thinking in the face of poorly understood uncertainty. Calling Karnofsky’s suggested process “making a Bayesian adjustment” suggests that we have something like a general, mathematical method for critical thinking. We don’t.

Similarly, taking our hunches about the plausibility of scenarios we have a very limited understanding of and treating those hunches like well-grounded probabilities can lead us to believe we have a well-understood method for making good decisions related to those scenarios. We don’t.

Many people have unwarranted confidence in approaches that appear math-heavy or scientific. In my experience, effective altruists are not immune to that bias.

Part 5: Doing better

When discussing these ideas with members of the effective altruism community, I felt that people wanted me to propose a formulaic solution—some way to explicitly adjust expected value estimates that would restore the integrity of the usual prioritization methods. I don’t have any suggestions of that sort.

Below I outline a few ideas for how effective altruists might be able to pursue their goals despite the optimizer’s curse and difficulties involved in probabilistic assessments.

Embrace model skepticism

When models are being pushed outside of the domains where they have been built and tested, caution should be exercised. Especial skepticism should be used in situations where a model is presented as a universal method for handling problems.

Entertain multiple models

If an opportunity looks promising under a number of different models, it’s more likely to be a good opportunity than one that looks promising under a single model.27 It’s worth trying to foster several different mental models for making sense of the world. For the same reason, surveying experts about the value of funding opportunities may be extremely useful. Some experts will operate with different models and thinking styles than I do. Where my models have blind spots, their models may not.

Test models

One of the ways we figure out how far models can reach is through application in varied settings. I don’t believe I have a 50-50 chance of winning a coin flip with a buddy for exclusively theoretical reasons. I’ve experienced a lot of coin flips in my life. I’ve won about half of them. By funding opportunities that involve feedback loops (allowing impact to be observed and measured in the short term), a lot can be learned about how well models work and when probability estimates can be made reliably.

Learn more

When probability assessments feel hazy, the haziness often stems from lack of knowledge about the subject under consideration. Acquiring a deep understanding of a subject may eliminate some haziness.

Position society

Since it isn’t possible to know the probability of all important developments that may happen in the future, it’s prudent to put society in a good position to handle future problems when they arise.28

Acknowledge difficulty

I know the ideas I’m proposing for doing better are not novel or necessarily easy to put into practice. Despite that, recognizing that we don’t have a reliable, universal formula for making good decisions under uncertainty has a lot of value.

In my experience, effective altruists are unusually skeptical of conventional wisdom, tradition, intuition, and similar concepts. Effective altruists correctly recognize deficiencies in decision-making based on these concepts. I hope that they’ll come to accept that, like other approaches, decision-making based on probability and expected value has limitations.


Huge thanks to everyone who reviewed drafts of this post or had conversations with me about these topics over the last few years!

Added 4/6/2019: There’s been discussion and debate about this post over on the Effective Altruism Forum.

Become Certified Awesome TODAY!

Hello friends!

Today I’m happy to announce a new, innovative project! The seeds of this idea were planted several months ago when I published Third-party Evaluation: Trophies for Everyone! In that post, I mentioned how legitimate companies seem surprisingly comfortable advertising awards from entities that totally lack credibility.

Since then, I’ve noticed more forms of bogus website endorsements. For example, Comodo Group’s trusted site seals:

A Comodo SSL trust seal indicates that the website owner has made customer security a top priority by securely encrypting all their transactions. This helps build confidence in the site and increases customer conversion rates…For a site seal to be effective, customers have to have confidence in the ‘endorsement brands’ that are on your site. If visitors are to trust you, they must trust the companies behind the logos on your site…Comodo is now the world’s largest SSL certificate authority and over 80 million PC’s and mobile devices are protected using Comodo desktop security solutions. That adds up to a lot of online visitors trusting you because they trust us.1

You can get these seals for free here. You don’t even have to verify that you’re using any kind of security! I indicated that I have a UCC SSL certificate. I don’t have one of those, but look at the cool seal I got!


UC SSL Certificate
UC SSL Certificate

SiteLock also offers cool security seals. They look like this:

That’s just an image for illustrative purposes. It’s not a real, verified seal. Getting an actual seal costs money and involves verification. The verification component is interesting. If SiteLock realizes a site is not safe for visitors, will the seal make that clear?

Nope!

If a scan fails site visitors will not be alerted to any problem. The SiteLock Trust Seal will simply continue to display the date of the last good scan of the website. If the site owner fails to rectify the problem SiteLock will remove the seal from the site and replace it with a single pixel transparent image within a few days. At no point will SiteLock display any indication to visitors that a website has failed a scan.2

All this got me thinking. What if I offered free, honest endorsement seals?

This idea had an obvious flaw: a total lack of credibility or credentials on my part. I decided it was time I got myself some credentials. I went to the Universal Life Church (ULC) website and began the arduous process of becoming an ordained minister. After painstakingly entering my personal details and clicking the “Get Ordained Instantly” button, I had my first credential:

My Universal Life Church Ordination

A few days later, I had physical proof:

A lot of people have been ordained by the ULC. To make sure people could know I’m really trustworthy, I went ahead and got a few less common credentials:

After acquiring my credentials, I spent an intense eight minutes creating a professional endorsement seal:

You can get one of these seals for your own website if you certify its awesomeness. If you’re not sure if your website is awesome, the book On Being Awesome: A Unified Theory of How Not to Suck might be able to help. Click below if you’re ready:

✔ Yes, my website is awesome!
Congratulations! Your website is now certified awesome, and you have permission to use the Confusopoly Endorsement Seal™ displayed below. The seal can be shared with the following code:

<a href="https://coveragecritic.com/2019/04/01/become-certified-awesome-today/"><img src="https://coveragecritic.com/wp-content/uploads/2019/03/ConfusopolyEndorsment.png" width="800" height="600" class="aligncenter size-full wp-image-2338" /></a>

May the force be with you,
Dr. Christian Smith, PhD

Misleading Gimmicks from Consumer Reports

You better cut the pizza in four pieces because I’m not hungry enough to eat six.Yogi Berra (allegedly)

The other day, I received a mailing from Consumer Reports. It was soliciting contributions for a raffle fundraiser. The mailing had nine raffle tickets in it. Consumer Reports was requesting that I send back the tickets with a suggested donation of $9 (one dollar for each ticket). The mailing had a lot of paper:

The raffle had a grand prize that would be the choice of an undisclosed, top-rated car or $35,000. There were a number of smaller prizes bringing the total amount up for grabs to about $50,000.

The materials included a lot of gimmicky text:

  • “If you’ve been issued the top winning raffle number, then 1 of those tickets is definitely the winner or a top-rated car — or $35,000 in cash.”
  • “Why risk throwing away what could be a huge pay day?”
  • “There’s a very real chance you could be the winner of our grand prize car!”

Consumer Reports also indicates that they’ll send a free, surprise gift to anyone who donates $10 or more. It feels funny to donate money hoping that I might win more than I donate, but I get it. Fundraising gimmicks work. That said, I get frustrated when fundraising gimmicks are dishonest.

One of the papers in the mailing came folded with print on each side. Here’s the front:

On the other side, I found a letter from someone involved in Consumer Reports’ marketing. The letter argues that it would be silly for me not to find out if I received winning tickets:

It amazes me that among the many people who receive our Consumer Reports Raffle Tickets — containing multiple tickets, mind you, not just one — some choose not to mail them in. And they do this, despite the fact there is no donation required for someone to find out if he or she has won…So when people don’t respond it doesn’t make any sense to me at all.

The multiple tickets bit is silly. It’s like the Yogi Berra line at the opening of the post; cutting a pizza into more slices doesn’t create more food. It doesn’t matter how many tickets I have unless I get more tickets than the typical person.

Come on. Consumer Reports doesn’t care if a non-donor decides not to turn in tickets. What’s the most plausible explanation for why Consumer Reports includes the orange letter? People who would otherwise ignore the mailing sometimes end up feeling guilty enough to make a donation. Checking the “I choose not to donate at this time, but please enter me in the Raffle” box on the envelope doesn’t feel great.

Writing my name on each ticket, reading the materials, and mailing the tickets takes time. My odds of winning are low. Stamps cost money.

Let’s give Consumer Reports the benefit of the doubt and pretend that the only reason not to participate is that stamps cost money. The appropriate stamp costs 55 cents at the moment.1 Is the expected reward for sending in the tickets greater than 55 cents?

Consumer Reports has about 6 million subscribers.2 Let’s continue to give Consumer Reports the benefit of the doubt and assume it can print everything, send mailings, handle the logistics of the raffle, and send gifts back to donors for only $0.50 per subscriber. That puts the promotion’s cost at about 3 million dollars. The $50,000 of prizes is trivial in comparison. Let’s further assume that Consumer Reports runs the promotion based on the expectation that additional donations brought in will cover the promotion’s cost.

The suggested donation is $9. Let’s say the average, additional funding brought in by this campaign comes out to $10 per respondent.3 To break even, Consumer Reports needs to have 300,000 respondents.

With 300,000 respondents, nine tickets each, and $50,000 in prizes, the expected return is about 1.7 cents per ticket.4 Sixteen cents per person.5 Not even close to the cost of a stamp.


4/12/2019 Update: I received a second, almost-identical mailing in early April.

10/3/2019 Update: I received a few more of these mailings.

Average Download Speed Is Overrated

I’ve started looking into the methodologies used by entities that collect cell phone network performance data. I keep seeing an emphasis on average (or median) download and upload speeds when data-service quality is discussed.

  • Opensignal bases it’s data-experience rankings exclusively on download and upload speeds.1
  • Tom’s Guide appears to account for data-quality using average download and possibly upload speeds.2
  • RootMetrics doesn’t explicitly disclose how it arrives at final data-performance scores, but emphasis is placed on median upload and download speeds.3

It’s easy to understand what average and median speeds represent. Unfortunately, these metrics fail to capture something essential—variance in speeds.

For example, OpenSignal’s latest report for U.S. networks shows that Verizon has the fastest average download speed of 31 Mbps in the Chicago area. AT&T’s average download speed is only 22 Mbps in the same area. Both those speeds are easily fast enough for typical activities on a phone. At 22 Mbps per second, I could stream video, listen to music, or browse the internet seamlessly. For the rare occasion where I download a 100MB file, Verizon’s network at the average speed would beat AT&T’s by about 10.6 seconds.4 Not a big deal for something I do maybe once a month.

On the other hand, variance in download speeds can matter quite a lot. If I have 31 Mbps speeds on average, but I occasionally have sub-1 Mbps speeds, it may sometimes be annoying or impossible to use my phone for browsing and streaming. Periodically having 100+ Mbps speeds would not make up for the inconvenience of sometimes having low speeds. I’d happily accept a modest decrease in average speeds in exchange for a modest decrease in variance.5

deceptive fish

I’m Not Unbiased

Warning: This post is a rant and contains foul language. Enjoy!


Tons of research suggests that people engage in deception and self-deception all the damn time. People are biased.

Despite this, pretty much every website offering reviews makes claims of objectivity and independence. These websites don’t claim that they try to minimize bias. They claim to actually be unbiased.

Let’s take a look at an excerpt from TopTenReviews, a high-traffic review site:

Methods of monetization in no way affect the rankings of the products, services or companies we review. Period.
Bullshit. Total bullshit.

I’ve ranted enough in the past about run-of-the-mill websites offering bogus evaluations. What about websites that have reasonably good reputations?

NerdWallet

NerdWallet publishes reviews and recommendations related to financial services.

Looking through NerdWallet’s website, I find this (emphasis mine):1

The guidance we offer, info we provide, and tools we create are objective, independent, and straightforward. So how do we make money? In some cases, we receive compensation when someone clicks to apply, or gets approved for a financial product through our site. However, this in no way affects our recommendations or advice. We’re on your side, even if it means we don’t make a cent.
Again, bullshit.

NerdWallet meets Vanguard

Stock brokerages are one of the types of services that NerdWallet evaluates.

One of the most orthodox pieces of financial advice—with widespread support from financial advisors, economists, and the like—is that typical individuals who invest in stocks shouldn’t actively pick and trade individual stocks.2 This position is often expressed with advice like: “Buy and hold low-cost index funds from Vanguard.”

Vanguard has optimized for keeping fees low and giving its clients a rate of return very close to the market’s rate of return.3 Since Vanguard keeps costs low, it cannot pay NerdWallet the kind of referral commissions that high-fee investment platforms offer.

What happens when NerdWallet evaluates brokers? Vanguard gets 3 out of 5 stars.4 It’s the worst rating for a broker I’ve seen on the site.5

NerdWallet slams Vanguard for not offering the sort of stuff Vanguard’s target audience doesn’t want. Vanguard gets the worst-possible ratings in the “Promotions” and “Trading platform” categories. Why? Vanguard doesn’t offer those things.6

Imagine someone going to a nice restaurant and complaining that the restaurant’s steak doesn’t come with cake frosting. NerdWallet is doing something similar.

The following excerpt comes from NerdWallet’s Vanguard review (emphasis mine):

Ask yourself this question: Are you part of Vanguard’s target audience of retirement investors with a relatively high account balance? If so, you’ll likely find no better home. You really can’t beat the company’s robust array of low-cost funds.

Investors who fall outside of that audience — those who can’t meet the fund minimums or want to regularly trade stocks — should look for a broker that better caters to those needs.

Vanguard’s minimum is $1,000.7 You shouldn’t buy stocks if you have less than $1,000 to put into stocks! If you invest in stocks, you shouldn’t regularly trade individual stocks!8

From my perspective, NerdWallet is saying that if you are (a) the typical kind of person that should be buying stocks and (b) you don’t use a stupid strategy, then “you really can’t beat [Vanguard].”

So there we have it. Despite the lousy review, NerdWallet correctly recognizes that Vanguard is awesome.

NerdWallet didn’t really lie, but NerdWallet is definitely biased.9

thumbs down

WireCutter

Sometimes evaluators aim to create divisions between editorial content (e.g., review writing) and revenue generation. I think divisions of this sort are a good idea, but they are not magic bullets.

WireCutter is one of my favorite review sites, but it makes the mistake of overemphasizing how much divisions can do to reduce bias:10

We pride ourselves on following rigorous journalistic standards and ethics, and we maintain editorial independence from our business operations. Our recommendations are always made entirely by our editorial team without input from our revenue team, and our writers and editors are never made aware of any business relationships.
I believe WireCutter takes actions to encourage editorial independence. However, I’m skeptical of how the commitment to editorial integrity is described. Absent extreme precautions, people talk. Information flows between coworkers. Even if editors aren’t explicitly informed about financial arrangements, it’s easy for editors to make educated guesses.11

Bias is sneaky

Running Coverage Critic, I face all sorts of decisions unrelated to accuracy or honesty where bias still has potential to creep in. For example, in what order should cell phone plans I recommend by displayed? Alphabetically? Randomly? One of those options will be more profitable than the other.

I don’t have perfect introspective access to what happens in my head. A minute ago, I scratched my nose. I can’t precisely explain exactly how or why I chose to do that. It just happened. Similarly, I don’t always know when and how biases affect my decisions.

I’m biased

I have conflicts of interest. Companies I recommend sometimes pay me commissions. You can take a look at the arrangements here.

I’ve tried to align my incentives with consumers by building my brand around commitments to transparency and rigor. I didn’t make these commitments for purely altruistic reasons. If the branding strategy succeeds, I stand to benefit.

Even with my branding strategy, my alignment with consumers will never be perfect. I’ll still be biased. If you ever think I could be doing better, please let me know.

Beware of Scoring Systems

When a third-party evaluator uses a formal scoring system or rubric, it’s a mistake to assume that the evaluator is necessarily being objective, rigorous, or thoughtful about its methodology.

I’ll use Forbes’ college rankings to illustrate.

Forbes argues that most college rankings (e.g., U.S. News) fail to focus on what “students care about most.” Forbes’ rankings are based on what it calls “outputs” (e.g., salaries after graduation) rather than “inputs” (e.g., acceptance rates or SAT scores of admitted applicants).1

Colleges are ranked based on weighted scores in five categories, illustrated in this infographic from Forbes:2

This methodology requires drawing on data to create scores for each category. That doesn’t mean the methodology is good (or unbiased).

Some students are masochists who care almost exclusively about academics. Others barely care about academics and are more interested in the social experiences they’ll have.

Trying to collapse all aspects of the college experience into a single metric is silly—as is the case for most other products, services, and experiences. If I created a rubric to rank foods based on a weighted average of tastiness, nutritional value, and cost, most people would rightfully ignore the results of my evaluation. Sometimes people want salad. Sometimes they want ice cream.

To be clear, my point isn’t that Forbes’ list is totally useless—just that it’s practically useless. My food rubric would come out giving salads a better score than rotten steak. That’s the correct conclusion, but it’s an obvious one. No one needed my help to figure that out. Ranking systems are only useful if they can help people make good decisions when they’re uncertain about their options.

Where do the weights for each category even come from? Forbes doesn’t explain.

Choices like what weights to use are sometimes called researcher degrees of freedom. The choice of what set of weights to use is important to the final results, but an alternative set of reasonable weights could have been used.

When researchers have lots of degrees of freedom, it’s advisable to be cautious about accepting the results of their analyses. It’s possible for researchers to select a methodology that gives one result while other defensible methodologies could have given different results. (See the paper Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results for an excellent demonstration of this phenomenon.)

Creating scores for each category introduces additional researcher degrees of freedom into Forbes’ analysis. Should 4-year or 6-year graduation rate be used? What data sources should be drawn on? Should debt be assessed based on raw debt sizes or loan default rates? None of these questions have clear-cut answers.

Additional issues show up in the methods used to create category-level scores.

A college ranking method could assess any one of many possible questions. For example:

  • How impressive is the typical student who attends a given school?
  • How valuable will a given school be for the typical student who attends?
  • How valuable will a school be for a given student if she attends?

It’s important which question is being answered. Depending on the question, selection bias may become an issue. Kids who go to Harvard would probably end up as smart high-achievers even if they went to a different school. If you’re trying to figure out how much attending Harvard benefits students, it’s important to account for students’ aptitudes before entering. Initial aptitudes will be less important if you’re only trying to assess how prestigious Harvard is.

Forbes’ methodological choices suggest it doesn’t have a clear sense of what question its rankings are intended to answer.
Confused people

Alumni salaries get 20% of the overall weight.3 This suggests that Forbes is measuring something like the prestige of graduates (rather than the value added from attending a school).4

Forbes also places a lot of weight on the number of impressive awards received by graduates and faculty members.5 This again suggests that Forbes is measuring prestige rather than value added.

When coming up with scores for the debt category, Forbes considers default rates and the average level of federal student debt for each student.6 This suggests Forbes is assessing how a given school affects the typical student that chooses to attend that school. Selection bias is introduced. The typical level of student debt is not just a function of a college’s price and financial aid. It also matters how wealthy students who attend are. Colleges that attract students with rich families will tend to do well in this category.

Forbes switches to assessing something else in the graduation rates category. Graduation rates for Pell Grant recipients receive extra weight. Forbes explains:

Pell grants go to economically disadvantaged students, and we believe schools deserve credit for supporting these students.7

Forbes doubles down on its initial error. First, Forbes makes the mistake of aggregating a lot of different aspects of college life into a single metric. Next, Forbes makes a similar mistake by mashing together several different purposes college rankings could serve.

Many evaluators using scoring systems with multiple categories handle the aggregation from category scores to overall scores poorly.8 Forbes’ methodology web page doesn’t explain how Forbes handled this process, so I reached out asking if it would be possible to see the math behind the rankings. Forbes responded telling me that although most of the raw data is public, the exact process used to churn out the rankings is proprietary. Bummer.

Why does Forbes produce such a useless list? It might be that Forbes or its audience doesn’t recognize how silly the list is. However, I think a more sinister explanation is plausible. Forbes has a web page where schools can request to license a logo showing the Forbes endorsement. I’ve blogged before about how third-party evaluation can involve conflicts of interest and lead to situations where everything under the sun gets an endorsement from at least one evaluator. Is it possible that Forbes publishes a list using an atypical methodology because that list will lead to licensing agreements with schools that don’t get good ratings from better-known evaluators?

I reached out to the licensing contact at Forbes with a few questions. One was whether any details could be shared about the typical financial arrangement between Forbes and colleges licensing the endorsement logo. My first email received a response, but the question about financial arrangements was not addressed. My follow-up email did not get a response.
Greedy businessman on a pile of money

While most students probably don’t care about how many Nobel Prizes graduates have won, measures of prestige work as pretty good proxies for one another. Schools with lots of prize-winning graduates probably have smart faculty and high-earning graduates. Accordingly, it’s possible to come up with a reasonable, rough ranking of colleges based on prestige.

While Forbes correctly recognizes that students care about things other than prestige, it fails to provide a useful resource about the non-prestige aspects of colleges.

The old College Prowler website did what Forbes couldn’t. On that site, students rated different aspects of schools. Each school had a “report card” displaying its rating in diverse categories like “academics,” “safety,” and “girls.” You could even dive into sub-categories. There were separate scores for how hot guys at a school were and how creative they were.9

Forbes’ college rankings were the first college rankings I looked into in depth. While writing this post, I realized that rankings published by U.S. News & World Report and Wall Street Journal/Times Higher Education both use weighted scoring systems and have a lot of the same methodological issues.

To its credit, Forbes is less obnoxious and heavy-handed than U.S. News. In the materials I’ve seen, Forbes doesn’t make unreasonable claims about being unbiased or exclusively data-driven. This is in sharp contrast to U.S. News & World Report. Here’s an excerpt from the U.S. News website under the heading “How the Methodology Works:”

Hard objective data alone determine each school’s rank. We do not tour residence halls, chat with recruiters or conduct unscientific student polls for use in our computations.

The rankings formula uses exclusively statistical quantitative and qualitative measures that education experts have proposed as reliable indicators of academic quality. To calculate the overall rank for each school within each category, up to 16 metrics of academic excellence below are assigned weights that reflect U.S. News’ researched judgment about how much they matter.10

As a general rule, I suggest running like hell anytime someone says they’re objective because they rely on data.

U.S. News’ dogmatic insistence that there’s a clear dichotomy separating useful data from unscientific, subjective data is misguided. The excerpt also contradicts itself. “Hard objective data alone” do not determine the schools’ ranks. Like Forbes, U.S. News uses category weights. Weights “reflect U.S. News’ researched judgment about how much they matter.” Researched judgments are absolutely not hard data.

It’s good to be skeptical of third-party evaluations that are based on evaluators’ whims or opinions. Caution is especially important when those opinions come from an evaluator who is not an expert about the products or services being considered. However, skepticism should still be exercised when evaluation methodologies are data-heavy and math-intensive.

Coming up with scoring systems that look rigorous is easy. Designing good scoring systems is hard.

Thoughts on TopTenReviews

Thumbs down image
I’m not a fan.

TopTenReviews ranks products and services in a huge number of industries. Stock trading platforms, home appliances, audio editing software, and hot tubs are all covered.

TopTenReviews’ parent company, Purch, describes TopTenReviews as a service that offers, “Expert reviews and comparisons.”1

Many of TopTenReviews’ evaluations open with lines like this:

We spent over 60 hours researching dozens of cell phone service providers to find the best ones.2

I’ve seen numbers between 40 and 80 hours in a handful of articles. It takes a hell of a lot more time to understand an industry at an expert level.

I’m unimpressed by TopTenReviews’ rankings in industries I’m knowledgable about. This is especially frustrating since TopTenReviews often ranks well in Google.

A particularly bad example: indoor bike trainers. These devices can turn regular bikes into stationary bikes that can be ridden indoors.

I love biking and used to ride indoor trainers a fair amount. I’m suspicious the editor who came up with the trainer rankings at TopTenReviews couldn’t say the same.

The following paragraph is found under the heading “How we tested on the page for bike trainers”:

We’ve researched and evaluated the best roller, magnetic, fluid, wind and direct-drivebike [sic] trainers for the past two years and found the features that make the best ride for your indoor training. Our reviewers dug into manufacturers’ websites and engineering documents, asked questions of expert riders on cycling forums, and evaluated the pros and cons of features on the various models we chose for our product lineup. From there, we compared and evaluated the top models of each style to reach our conclusions. 3

There’s no mention of using physical products.

The top overall trainer is the Kinetic Road Machine. It’s expensive but probably a good recommendation. I know lots of people with either that model or similar models who really like their trainers.

However, I don’t trust TopTenReviews’ credibility. TopTenReviews has a list of pros and cons for the Kinetic Road Machine. One con is: “Not designed to handle 700c wheels.” It is.

It’s a big error. 700c is an incredibly common wheel size for road bikes. I’d bet the majority of people using trainers have 700c wheels.4 If the trainer wasn’t compatible with 700c wheels, it wouldn’t deserve the “best overall” designation.

TopTenReviews even states, “The trainer’s frame fits 22-inch to 29-inch bike wheels.” 700c wheels fall within that range. A bike expert would know that.

Bike crash

TopTenReviews’ website has concerning statements about its approach and methodology. An excerpt from their about page (emphasis mine):

Our tests gather data on features, ease of use, durability and the level of customer support provided by the manufacturer. Using a proprietary weighted system (i.e., a complicated algorithm), the data is scored and the rankings laid out, and we award the three top-ranked products with our Gold, Silver and Bronze Awards.5

Maybe TopTenReviews came up with an awesome algorithm no one else has thought of. I find it much more plausible that—if a single algorithm exists—the algorithm is private because it’s silly and easy to find flaws in.

TopTenReviews receives compensation from many of the companies it recommends. While this is a serious conflict of interest, it doesn’t mean all of TopTenReviews’ work is bullshit. However, I see this line on the about page as a red flag:

Methods of monetization in no way affect the rankings of the products, services or companies we review. Period.6

Avoiding bias is difficult. Totally eliminating it is almost always unrealistic.

Employees doing evaluations will sometimes have a sense of how lucrative it will be for certain products to receive top recommendations. These employees would probably be correct to bet that they’ll sometimes be indirectly rewarded for creating content that’s good for the company’s bottom line.

Even if the company is being careful, bias can creep up insidiously. Someone has to decide what the company’s priorities will be. Even if reviewers don’t do anything dishonest, the company strategy will probably entail doing evaluations in industries where high-paying affiliate programs are common.

Reviews will need occasional updates. Won’t updates in industries where the updates could shift high-commission products to higher rankings take priority?

TopTenReviews has a page on foam mattresses that can be ordered online. I’ve bought two extremely cheap Zinus mattresses on Amazon.7 I’ve recommended these mattresses to a bunch of people. They’re super popular on Amazon.8 TopTenReviews doesn’t list Zinus.9

Perhaps it’s because other companies offer huge commissions.10 I recommend The War To Sell You A Mattress Is An Internet Nightmare for more about how commissions shadily distort mattress reviews. It’s a phenomenal article.

R-Tools Technology Inc. has a great article discussing their software’s position in TopTenReviews’ rankings, misleading information communicated by TopTenReviews, and conflicts of interest.

The article suggests that TopTenReviews may have declined in quality over the years:

In 2013, changes started to happen. The two principals that had made TopTenReviews a household name moved on to other endeavors at precisely the same time. Jerry Ropelato became CEO of WhiteClouds, a startup in the 3D printing industry. That same year, Stan Bassett moved on to Alliance Health Networks. Then, in 2014, the parent company of TopTenReviews rebranded itself from TechMediaNetwork to Purch.

Purch has quite a different business model than TopTenReviews did when it first started. Purch, which boasted revenues of $100 million in 2014, has been steadily acquiring numerous review sites over the years, including TopTenReviews, Tom’s Guide, Tom’s Hardware, Laptop magazine, HowtoGeek, MobileNations, Anandtech, WonderHowTo and many, many more.11

I don’t think I would have loved the pre-2013 website, but I think I’d have more respect for it than today’s version of TopTenReviews.

I’m not surprised TopTenReviews can’t cover hundreds of product types and consistently provide good information. I wish Google didn’t let it rank so well.

Issues with Consumer Reports’ 2017 Cell Phone Plan Rankings


Consumer Reports offers ratings of cellular service providers based on survey data collected from Consumer Reports subscribers. Through subscriber surveying in 2017, Consumer Reports collected data on seven metrics:1

  1. Value
  2. Data service quality
  3. Voice service quality
  4. Text service quality
  5. Web service quality
  6. Telemarketing call frequency
  7. Support service quality

The surveys collected data from over 100,000 subscribers.2 I believe Consumer Reports would frown upon a granular discussion of the exact survey results, so I’ll remain vague about exact ratings in this post. If you would like to see the full results of their survey, Consumer Reports subscribers can do so here.

Survey results

Results are reported for 20 service providers. Most of these providers are mobile virtual network operators (MVNOs). MVNOs don’t operate their own network hardware but make use of other companies’ networks. For the most part, MVNOs use networks provided by the Big Four (Verizon, Sprint, AT&T, and T-Mobile).

Interestingly, the Big Four do poorly in Consumer Reports’ evaluation. Verizon, AT&T, and Sprint receive the lowest overall ratings and take the last three spots. T-Mobile doesn’t do much better.

This is surprising. The Big Four do terribly, even though MVNOs are using the Big Four’s networks. Generally, I would expect the Big Four to offer network access to their direct subscribers that is as good or better than the access that MVNO subscribers receive.

It’s possible that the good ratings can be explained by MVNOs offering prices and customer service far better than the Big Four—making them deserving of the high ratings for reasons separate from network quality.

Testing the survey’s validity

To test the reliability of Consumer Reports methodology, we can compare MVNOs to the Big Four using only the metrics about network quality (ignoring measures of value, telemarketing call frequency, and support quality). In many cases, MVNOs use more than one of the Big Four’s networks. However, several MVNOs use only one network, allowing for easy apples-to-apples comparisons.3

  • Boost Mobile is owned by Sprint.
  • Virgin Mobile is owned by Sprint.
  • Circket Wireless is owned by AT&T.
  • MetroPCS is owned by T-Mobile.
  • GreatCall runs exclusively on Verizon’s network.
  • Page Plus Cellular runs exclusively on Verizon’s network.

When comparing network quality ratings between these MVNOs and the companies that run their networks:

  • Boost Mobile’s ratings beat Sprint’s ratings in every category.
  • Virgin Mobile’s ratings beat Sprint’s ratings in every category.
  • Cricket Wireless’s ratings beat or tie AT&T’s ratings in every category.
  • MetroPCS’s ratings beat or tie T-Mobile’s ratings in every category.
  • GreatCall doesn’t have a rating for web quality due to insufficient data. GreatCall’s ratings match or beat Verizon in the other categories.
  • Page Plus Cellular doesn’t have a rating for web quality due to insufficient data. Page Plus’ ratings match or beat Verizon in the other categories.
World’s best stock photo.
Taken at face value, these are odd results. There are complicated stories you could tell to salvage the results, but I think it’s much more plausible that Consumer Reports’ surveys just don’t work well for evaluating the relative quality of cell phone service providers.

Why aren’t the results reliable?

I’m not sure why the surveys don’t work, but I see three promising explanations:

  • Metrics may not be evaluated independently. For example, consumers might take a service’s price into account when providing a rating of its voice quality.
  • Lack of objective evaluations. Consumers may not provide objective evaluations. Perhaps consumers are aware of some sort of general stigma about Sprint that unfairly affects how they evaluate Sprint’s quality (but that same stigma may not be applied to MVNOs that use Sprint’s network).
  • Selection bias. Individuals who subscribe to one carrier are probably, on average, different from individuals who subscribe to another carrier. Perhaps individuals who have used Carrier A tend to use small amounts of data and are lenient when rating data service quality. Individuals who have used Carrier B may get more upset about data quality issues. Consumer Cellular took the top spot in the 2017 rankings. I don’t think it’s coincidental that Consumer Cellular has pursued branding and marketing strategies to target senior citizens.4

Consumer Reports’ website gives the impression that their cell phone plan rankings will be reliable for comparison purposes.5 They won’t be.

The ratings do capture whether survey respondents are happy with their services. However, the ratings have serious limitations for shoppers trying to assess whether they’ll be satisfied with a given service.

I suspect Consumer Reports’ ratings for other product categories that rely on similar surveys will also be unreliable. However, the concerns I’m raising only apply to a subset of Consumer Reports’ evaluations. A lot of Consumer Reports’ work is based on product testing rather than consumer surveys.

Third-party Evaluation: Trophies for Everyone!

A lot of third-party evaluations are not particularly useful. Let’s look at HostGator, one of the larger players in the shared web hosting industry, for some examples. For a few years, HostGator had an awards webpage that proudly listed all of the awards it “won.”

Many of the entities issuing awards were obviously affiliate sites that didn’t provide anything even vaguely resembling rigorous evaluation:

Fortunately, HostGator’s current version of the page is less ridiculous.

Even evaluations carried out by serious, established entities often have problems. Rigorous evaluation tends to be difficult. Accordingly, third-party evaluators generally use semi-rigorous methodologies—i.e., methodologies that have merit but also serious flaws.

In many industries, there will be several semi-rigorous evaluators using different methodologies. When an evaluator enters an industry, it will have to make a lot of decisions about its methods:

  • Should products be tested directly or should consumers be surveyed?
  • What metrics should be measured? How should those metrics be measured?
  • If consumers are surveyed, how should the surveyed population be selected?
  • How should multiple metrics be aggregated into an overall rating?

These are tough questions that don’t have straightforward answers.

Objective evaluation is often impossible. Products and services may have different characteristics that matter to consumers—for example, download speed and call quality for cell phone services. There’s no defensible, objective formula you can use to assess how important one characteristic’s quality is versus another.

There’s a huge range of possible, defensible methods that evaluators can use. Different semi-rigorous methods will lead to different rankings of overall quality. This can lead to situations where every company in an industry can be considered the “best” according to at least one evaluation method.

In other words: Everyone gets a trophy!

This phenomenon occurs in the market for cell phone carriers. At the time of writing, Verizon, AT&T, T-Mobile, and Sprint all get at least one legitimate evaluator’s approval. (More details in The Mobile Phone Service Confusopoly.)

Evaluators are often compensated in exchange for permission to use their results and/or logos in advertisements. Unfortunately, details on the specific financial arrangements between evaluators and the companies they recommend are often private.

Here are a few publicly known examples:

  • Businesses must pay a fee before displaying Better Business Bureau (BBB) logos in their advertisements.1
  • J.D. Power is believed to charge automobile companies for permission to use its awards in advertisements.2
  • AARP-approved providers pay royalties to AARP.3

An organization that is advertising an endorsement from the most rigorous evaluator in its field probably won’t be willing to pay a lot to advertise an endorsement from a second evaluator. A company with no endorsements will probably be much more willing to pay for its first endorsement.

Since there are many possible, semi-rigorous evaluation methodologies, maybe we should expect at least one evaluator to look kindly upon each major company in an industry. This phenomenon could even occur without any evaluator deliberately acting dishonestly. For example, lots of evaluators might try their hand at evaluation in a given industry. Each evaluator would use its own method. If an evaluator came out in favor of a company that didn’t have an endorsement, the evaluator would be rewarded monetarily and continue to evaluate within the industry. If an evaluator came out in favor of a company that already had an endorsement, the evaluator could exit the industry.

Bogus Evaluation Websites

Sturgeon’s law: Ninety percent of everything is crap.1

Rankings & reviews online

The internet is full of websites that ostensibly rank, rate, and/or review companies within a given industry. Most of these websites are crappy. Generally, these ranking websites cover industries where affiliate programs offering website owners large commissions are common.

Here are a few examples of industries and product categories where useless review websites are especially common:

  • Credit cards
  • Web hosting services
  • Online fax services
  • VoIP services
  • VPN services
  • Foam mattresses

If you Google a query along the lines of “Best [item from the list above]” you’ll likely receive a page of search results with a number of “top 10 list” type sites. At the top of your search results you will probably see ads like these:

Lack of in-depth evaluation methodologies

Generally, these “review” sites don’t go into any kind of depth to assess companies. As far as I can tell, rankings tend to be driven primarily by a combination of randomness and the size of commissions offered.

Admittedly, it’s silly to think that the evaluation websites found via Google’s ads would be reliable. Unfortunately, the regular (non-ad) search results often include a lot of garbage “review” websites. From the same query above:

Most of these websites don’t offer evaluation methodologies that deserve to be taken seriously.

Even the somewhat reputable names on the list (i.e. CNET & PCMag) don’t offer a whole lot. Neither CNET nor PCMag clearly explain their methodologies, and the written content doesn’t lead me to believe either entity went in depth to evaluate the services considered.2

Fooling consumers

If consumers easily recognized these bogus evaluation websites for what they are, the websites would just be annoyances. Unfortunately, it looks like a substantial portion of consumers don’t realize these websites lack legitimacy.

Google offers a tool that presents prices that “advertisers have historically paid for a keyword’s top of page bid.” According to this tool, advertisers are frequently paying several dollars per click on the kind of queries that return ads for bogus evaluation websites:3

We should expect that advertisers will only be willing to pay for ads when the expected revenue per click is greater than the cost per click. The significant costs paid per click suggest that a non-trivial portion of visitors to bogus ranking websites end up purchasing from one of the suggested companies.

How biased are evaluation websites found via Google?

Let’s turn to another industry. The VPN industry shares a lot of features with the web hosting industry. Both VPN and web hosting services tend to be sold online with reasonably low prices and reoccurring billing cycles. Affiliate programs are very common in both industries.

There’s an awesome third-party website, ThatOnePrivacySite.net, that assesses VPN services and refuses to accept commissions.4 ThatOnePrivacySite has reviewed over thirty VPN services. At the time of writing, only one, Mullvad, has received a “TOPG Choice” award,5 indicating an excellent review.6

Interestingly, Mullvad doesn’t have an affiliate program. That allowed me to perform a little experiment. I Googled the query “Best VPN service”. I received 15 results directing to websites that ranked VPN services.

Six of the results came from paid ads. None of those six websites listed Mullvad.

Of the nine websites in the organic results, only three listed Mullvad:7

  • Tom’s Guide
  • TheBestVPN.com
  • PCWorld