In which we demystify the potentially divisive question of sample size in UX research.
I am sure at some point in your career you’ve heard Jacob Nielsen say, “Testing with five is enough.” You might have also heard Jared Spool say, “Eight is not enough.” Market Researchers doing multivariate, A/B testing and analytics say, “Sample sizes of 1,000 or even 50,000 are not enough.”
So, which is it and why? Before looking at the answer(s) let’s look at an analogy.
If I asked you: “What tool do I need” and showed you this picture below, what would your answer be?
Well, without context, there is no answer to the question, “What tool do I need?”
I assume you would ask, “What do you need to do” before answering the question. If I say, “I need to punch this nail into the wall,” you might reply, “Ok, use the hammer.”
It’s the same for the “What sample size do I need” question.
With that in mind, let’s look at three main categories and use cases that are common in the UX world to help us discern what sample size is typically needed for each:
If you read Jacob Nielsen’s article closely it would be clear that sample size of 5 is enough to identify 85% of the problems. What’s not always clear is that 85% applies to high probability problems of 31% or higher chance of occurring with at least one participant.
If your objective is to identify big issues (a.k.a. big rocks, lower hanging fruits) a sample size of 5 is enough. But if you want to go beyond that, and there is a business value to fixing and optimizing several workflows, go with larger sample sizes.
Small Rocks (Low probability of occurrence) vs Big Rocks (High chance of occurrence)
A sample size of 5 is enough if you are looking for big rocks. A sample size of 20 is not enough if you are looking for big rocks and several smaller rocks.
The higher the sample size the more issues you will find. In many cases, especially where design has matured and lots of optimization is needed (ecommerce, B2C websites, e.g.), its recommended to test with larger sample sizes of 20, 30 or even more per segment.
Big decisions have bigger implications. That means the person making that decision wants to be confident that they are making the right decision. That’s why I am sure everyone reading this article has met at least one business executive, one engineer, or one marketer who would “intuitively” question smaller sample size.
We might be doing a disservice to the better UX cause by stating “3 out of 8 users liked this” or “4 out of 6 users preferred this design.” “Like” is not an observed behavior, it’s an attitude. Preference can be subjective and vary by several segment attributes.
That’s where “confidence level” and “confidence interval” comes in. Simply put, CL (Confidence Level) is the amount of uncertainty you can tolerate. Confidence Interval or Margins is the amount of error that you can tolerate. Lower margins of error requires a larger sample.
Anyone making a big decision would want at least 90% Confidence Level and be able to measure things at +- 10% margins.
The higher the sample size the smaller the margin of error (at the same Confidence Level)
Comparing two or more design options usually includes comparing those designs on behavioral attributes like Success Rates, Time and/or attitudes like preference, on brand or subjective ratings.
We are trying to detect a difference and claim that Design Option A or Design Option B is better on a certain attribute or multiple attributes.
Without a clear hypothesis of what winning means, it’s hard to know who the winner is.
Sample size depends on the amount of difference you are trying to detect. The smaller the difference you’re trying to detect the larger the sample size needs to be to determine the winner. If the two options are very different (for participants), the chances of detecting significant difference is higher at even smaller sample sizes. But in most cases the differences are not dramatically different, hence the need for a larger sample size.
As an example below, a sample size of 115 at 90% confidence can detect a 10% significant difference. But if you want to detect 3% difference, a sample size of 421 or higher would be needed (at 90% confidence level).
Also, sample size depends on the experimental design: Within Subject or Between Subject. Within Subject essentially means that each participant sees both design A and Design B, in a counterbalanced order of seeing A and B. Between subject means that one group of participants see or use design A and another set of participants see or use Design B. To account for any group differences, between subject needs a higher sample size.
Most UX research we see use Within Subject experiments because it allows for a more apples-to-apples comparison of both A and B with an explicit question at the end (preference) and comparison of both options. The added benefit is smaller sample size. Within subject also helps with ideas to combine A and B into a new option C which has benefits of both A and B.
Depending on your intended outcomes, here are our recommendations about sample size. The bold and blue rows are the most common set of numbers we see across hundreds and hundreds of studies here at UserZoom.