Written by Jeff Sauro.
Benchmarking is an essential step to understanding how design changes actually improve the user experience. A reliable benchmark helps you differentiate real improvements from random noise. A lot goes into a benchmarking study, and any of those things can go wrong. It’s important to get the details right because you’ll use the benchmarks to make comparisons over time and the wrong decisions can have an impact that last for years.
Here are five of the more common mistakes made when conducting benchmark studies and what you can do to prevent them.
1. Testing the wrong type of participant
Benchmarking studies need participants. A lot of services promise to deliver test participants quickly and easily. While they’re great for getting general population participation, they’re not ideal for obtaining participants with specific profiles; maybe you need accountants, IT administrators, radiological technicians, or people who have recently sold a home. But both domain knowledge and motivations specific to specialized tasks will have a major impact on benchmark data.
What to do: Understand the essential domain knowledge and skills of your users and recruit accordingly using a good panel provider.
If you’re unsure if you’re using the right mix of participants, record the participants’ skills and knowledge so you can account for discrepancies between the actual and ideal profiles over time. For example, if your sample in year 1 had a lot of experienced users and year 2 had more novice participants, then you’ll likely need to account for this discrepancy in the analysis.
2. Using the wrong tasks
For a task-based benchmarking study, such as for a retail website, so many metrics are affected by the tasks you have participants perform. If you provide irrelevant tasks, then you’ll get irrelevant results. But it’s more complicated that just knowing what users are trying to accomplish on a site. You also have to effectively simulate these tasks and have the right type of validation. All too often I see tasks, especially in unmoderated studies, that are not representative of what users actually do on a site.
What to do: Don’t pick tasks because they’re easy or seem right. Use data from a top-tasks analysis and get stakeholder buy in. Answer the question: If most people fail this task, will the stakeholders care? Then be sure the success criteria is realistic (not too hard or easy). It takes pretesting and some experience to craft tasks that will provide meaningful data.
3. Not collecting the right or enough metrics
Many benchmarking studies are bloated with too many tasks and questions. While the study should be a manageable time so participants don’t get too fatigued, you still need to collect a sufficient amount of data to describe the user experience. This means you need to measure what participants are doing (behavioral metrics), what they think (attitudinal metrics), and who they are (experience and demographics).
4. Having too small of a sample size
When budgets are tight, sample sizes are one the first things that get cut in benchmarking studies. It’s understandable when the cost per user is very expensive. However, you should look at the cost of participants as the smaller incremental cost compared to the initial fixed cost of setting up and planning the benchmark.
It’s a waste to build an expensive factory to churn out only 20 products. It likewise doesn’t make sense to go through the trouble of planning a benchmark study only to collect data from a few participants. With too few participants you won’t be able to differentiate real changes from chance. This is especially the case in competitive benchmarks.
What to do: Understand how much precision you need (based on a future comparison or standalone study) and compute the sample size needed. Call me if you get stuck.
5. Not accounting for sampling error
Just because you use a large sample size (whatever that means to your organization), you can’t ignore the very real impact of sampling error on your data. Statistical comparisons allow you to differentiate real changes (the signal) from random chance (the noise) and should be used on any sized sample.