What’s the real value behind remote usability testing tools?
We have to be honest: We were very surprised and quite puzzled when we read a recent discussion in the IxDA forum the other day. The topic was about new remote usability testing tools and a well-recognized and highly respected member of the User Experience professional’s community labeled Unmoderated Remote Usability Testing (URUT) as ‘Voo Doo measurement techniques’. The discussion was actually started about a new tool coming out to the market (Loop11), but in his reply he also mentioned another tool (Netraker) that has been in the market for many years and has proven its worth when it was sold to a bigger company (Keynote Systems) in 2003.
Fortune 500 companies such as Google, Yahoo, eBay, Monster.com, Verizon, Sprint, IBM, Orbitz.com, Pegasus, Continental Airlines, Charles River Labs, National Cancer Institute, Orange among many others have used and are currently using as one of their user experience toolkit techniques. Therefore, we were surprised to see how he basically rejected a research technique. I mean, this may be a relatively new technique, but we are not all witches here.
We were late getting to the discussion so we could not respond to add our comments and insight. We think it’s generally good for everyone, including industry professionals, to open up and welcome new ways of understanding web usability. After all, the web is a changing medium and as such, understanding how users use it will require innovative technologies. It’s also very important to be as specific and detailed as possible when judging or concluding what’s valid or not, as it may vary quite a bit depending on many factors. URUT results may vary depending on the actual tool being used and, specially, the purpose and research goals for which it’s used.
In the case of URUT, we see the need to more effectively communicate the why, how and when it should be used by companies and UX professionals. In this regard, Tom Tullis, Bill Albert and Donna Tedesco (all Fidelity Investment UX team members, Bill actually heads the Usability Team at Bentley University at the moment) are about to publish a book entitled “Beyond the Usability Lab: Conducting Large-scale User Experience Studies”. They offer tried and tested methodologies for conducting online usability studies and give practitioners the guidance they need to collect a wealth of data through cost-effective, efficient, and reliable practices. www.remoteusability.com also offers interesting and detailed information about what some call ‘automated remote usability testing’, which is the same as URUT. And there are many other articles and publications on the web about this technique.
Furthermore, URUT can be both simple and quite sophisticated. With this article our mission is to explain URUT in a way that will hopefully help people better understand it. Obviously, we write from our own experience using UserZoom’s user testing software technology and methodology, which includes about 7 years specializing in remote usability testing, a large number of remote testing projects, about half of them international testing, and an endless number of participants.
To begin with it, here is a summary of the issues highlighted in the IxDA discussion. Below each point, we added our own comments:
‘The pool of invited participants is critically important. Many unmoderated tools offer their own pre-recruited pools, which keeps costs down, but are often low quality participants, such as people who only participate to get the incentive and don’t really use the design’
UserZoom comment: As research professionals who work with other researchers, we understood the importance of proper user recruitment from the very beginning. That’s why UserZoom does not use our own recruiting pools. Instead, we leave that to the panel experts, such as Survey Sampling International (SSI, a company used by most major research companies in the world), who have the experience and ‘bandwidth’ to strictly manage and screen qualified participants. In addition, UserZoom has added several Sample Quality Control measures to further screen out participants that might slip through and are of low quality. End result is data from your target market and high quality results. You can also say it is more likely that participants in a lab study are participating only to get the incentive as the cost to pay a lab participant is a minimum of 10 to 20 times the cost of a remote participant. So in this, case with the proper process defined and the virtue of technology, recruiting participants is not an issue.
‘You are limited in the tasks your participants can perform. Can the system tell if all the values were properly entered?’
UserZoom comment: Participants can perform a wide variety of tasks as they typically would on the web. As with moderated research, you need to decipher whether comments made were “properly answered” with the truest answer. Again, with the right technology, a researcher has control of the study, even if it’s not moderated. Task validation is absolutely critical in URUT. With UserZoom, we validate if the user actually found the answer, reached the right page, or finished the task successfully. For example, our technology offers the opportunity to validate by asking a user a multiple choice question or by URL reached.
‘We know from our research at UIE that participants who are actually interested in the task (for example, currently planning a vacation in Paris) will behave substantially differently than those who are asked to pretend to do a task.’
UserZoom comment: We agree and that is why we use strict screening procedures and make sure participants invited are actually the target market for the study and are interested in the tasks being asked. We take the same measure as you would do with a traditional research study in recruiting and targeting participants. No concerns there either in URUT.
‘The site reports standard analytic measures: time on task, “fail pages”, common navigation paths. Without talking to the individual, it’s hard to even know if a reported measure is good or bad, let alone the action the team should take based on the reported result.’
UserZoom comment: Quite contrary. You are able to gain qualitative feedback as well as capture quantitative metrics in a survey format after the task to understand the measure and tie it into the navigational paths. Matter of fact, participants are very honest in their comments as they can make them anonymously and without fear of “hurting our feelings.” They are not swayed by undetected cues in a moderator’s body language or by nerves, as they are aware they are being “watched” in another room.
‘In the ten years since I first started seeing these tools on the market. I’ve never seen results from a study that the team could actually interpret and act on.’
UserZoom comment: On the contrary (once again), we’ve heard from many of our clients we’ve worked with that the data collected was quite actionable, they actually implemented the changes and the result was always positive. Certainly results can be bad in ANY study if the tasks are not set up correctly or the data is not analyzed properly. In fact, the data received has impacted many sites in a positive way and has ultimately improved their ROI.
‘In the Netraker results, 94% of the participants completed the tasks and the average time was 1m 18s. In our study, only 33% of the participants completed the task and the average time was 18 minutes. ‘ ‘Why do you think there were such striking differences? Which study would you pay more attention to?’
UserZoom comment: We think that this is a tough point to bring up and, as such, it requires a far more detailed discussion. There can be a variety of factors that lead to these differences. Without knowing the details, it is hard to pinpoint and come to conclusions. First you have to look at the number of participants 94% of what? 100? 200? And 33% of what? 8? 6? You can’t compare the percentages at face value. Also was the lab study think aloud? If so, the time will be longer as participants are interrupted with that method to process. Were the tasks duplicate across the studies? What we have found is that when you conduct a study duplicate across studies in the lab and with URUT; they complement each other. With an URUT study you are gaining the quantitative metrics, geographically spread. You can concretely determine and validate your issues. With a lab-based study you gather the think aloud data which nicely ties into the quantitative metrics.
URUT and (not vs.) lab-based usability testing
Here’s one hot topic, relevant to the last point mentioned in the previous paragraph:
Many people try to understand how unmoderated remote testing compares to other, more traditional and widely used research methods, such as Lab-based usability testing. To understand the difference between the two methods and the value in adding URUT to your methods toolkit, we must first break apart each method at a high level and understand the value each brings to the table. Hopefully after doing this, the differences will be highlighted and the comparison can be made:
In a traditional lab-based research study between 6-10 (varying according to needs and perspective) are brought into a “lab” environment to run through a series of tasks. Participants work on a pre-configured PC or Mac in a pre-configured environment while being observed in a separate room either via monitor or through a one-way mirror. During the study, participants are given tasks and asked to perform them with a researcher sitting next to them or in the other room.
If using a think aloud protocol participants are asked to express their thoughts out loud and the researcher can feel free to probe or ask further questions while the participant is walking through their task and after. Alternatively, participants can walk through their task with no interruptions and questions can be left for probing after the task or after the study in order to gauge time on task. There are variations to this method; however, we are defining the method in its traditionally used form.
The value in this method is the ability to probe users while they are walking through their tasks, gather visual cues to include facial expression and body language, provide assists to stumped participants, and change your question set or even tasks mid-way during your research study. These are just to name a few of the important values; obviously there are more according to the variations of the method.
In an URUT hundreds of participants take part in the study from their own computer, in their own environment, participating simultaneously. During the study participants are provided tasks by the URUT tool and asked to walk through the tasks as they normally would and then provide feedback after the tasks via likert scale, open-ended, multiple-choice and one-choice questions (just to name a few). Participants act and respond naturally as they are participating on their own time when it is convenient for them. Participants are given the tasks in the same manner and format virtually eliminating moderator bias.
The value in the method is the ability to have participants participate from their natural context, have a cross-representation of your population across the country or internationally, gather statistically significant data, and it is very cost-effective. Again we are just naming a few of the important values; obviously there are more according to how to test and what you are testing.
In outlining the differences and values in each method, you can clearly see one is NOT to replace another but rather to meet a specific need and complement each other. Now that we understand the differences and value in each, we can explore why, when, and how to URUT to gain most value and benefits in your research.
So… why, when and how would you use URUT?
To quantify your usability research: Consider you have a huge customer base. This customer base includes different personalities, different usage patterns, and different perspectives. Quantifying your usability is the only way you can ensure that you are reaching a true representation of your diverse population. Not only can you gain valuable data that solidifies your true population but also you can validate your lab findings or alternatively target which critical tasks you need to be probing in a lab-based study.
To test users in their natural context: My computer and environment is different from my friend’s computer and environment and most likely different than a good portion of the population. Testing participants in their natural context accounts for different systems, configurations, and setups. The data you gain not only accounts for a mix of these various environments and setups but also encourages participants to act as they normally would, as they are not being “observed.”
To understand users’ behavior: You want to understand why users are coming to your site and what they do once they come there. URUT uses a combination of web analytics (where users go) and surveys (the why) to create a complete picture and provide valuable data in providing the best user experience for your site.
To validate or define your lab-based research: You want to ensure that the research you are currently conducting is valid and a true representation. With URUT not only do you gain valuable data that quantitatively solidifies your current research but also alternatively you can use URUT to target key critical issues and tasks to bring in the lab for further probing.
To test internationally without traveling: International research is very expensive and at times put aside due to the cost and time commitment. URUT allows you the flexibility to conduct a study in many internationally locations without taking a step out of your home. Not only does it removes the expense of travel but also removes the need for all your data collected to be translated in order to analyze it. URUT removes the barriers that have traditionally impeded this very critical research.
To conduct benchmark studies: URUT allows researchers to obtain statistically significant usability metrics on how a website performs vs. other versions of the site or vs. competing sites. Therefore, it’s a great way to take usability testing a step further to actually measure user experience and compare results either across time or through ‘industry benchmarking’.
Other FAQs and discussions on URUT:
Should I conduct URUT always combined with moderated testing?
URUT is a very valuable method that, if used correctly, can definitely be used as a stand-alone method. However, this does not devalue other methods such as qualitative lab, focus groups, contextual inquires which all can complement URUT nicely and each have their place and purpose in the UX ecosystem. We firmly believe in the concept that you must use the right method and tool for the right goal or data needed. In user experience, very often a combination of methods + tools yields the best results.
How do you resolve the issue of recruiting the right sample of participants and guarantee the validity of their behavior and feedback?
As with all recruiting methods, the key is in a top-notch screener. Without that, no one can guarantee quality. In addition having sample quality controls can guarantee the validity of your results and sample. There is no difference from recruiting with any research method you use.
How often should I use an URUT method for research?
As with any research method, URUT should be an integrated part of your usability roadmap. URUT can easily be used for international research, iterative testing, prototype testing, pre-code freeze research, quick and dirty usability research, and much more. That said, URUT should be used regularly to make the most of your research and ultimately your ROI.
How easy is an URUT tool to use?
Actually quite easy. As with any new software tool, you need to spend time with it to get familiar with it and all its functionalities. With our tool many of our clients are up and running in a matter of hours with a study without assistance. URUT are made for researchers, and in our case, by researchers so the tool is made with ease of use and with a great user experience in mind.
Can URUT track web 2.0 sites built on Flash or Ajax?
Yes! Most sites today such as Flash and Ajax sites are built trackable. What that means is the elements are tagged so that they are tracked by web analytics tools which would include URUT.
Necessary requirements a URUT tool must have:
We realize technology plays a huge role in URUT. Perhaps not every tool works the same way and some do things better than others. Generally speaking, we have a pretty good idea of what a tool should do and what features it should have in order to make URUT a valid and powerful research method. Here is a basic list:
The ability to collect quantitative metrics such as effectiveness, efficiency, success and abandon ratios. UserZoom technology, for example, is based on ISO-9241-11 standard, which establishes the definition of the usability objectives that shall be evaluated. How effective (success and error) and efficient (time and clicks) were they when completing task, and were they successful with the task or was there a high rate of abandonment due to frustration and/or errors.
Strict recruitment process to include sample quality control. The ability to screen out cheaters (participants who interact none to very little on the task), speeders (participants who spend less than the expected time on the task), and manual exclusion (the ability to remove a particular participant according to researcher determined quality of that sample).
The ability to validate tasks via by question or URL. Ensuring a participant accurately completes the task by validating by a question related to the task or by validating by determining success according to the participant reaching the correct page (URL).
Flexible and advanced scripting capabilities this puts the researcher under control, such as logic, branching, piping, system text customization, sophisticated randomization, profile participant completes control, and much more.
Advanced analytics capabilities such as filtering, clickstream management and aggregation tool, heatmaps, graphs, charts, participant data exclusion, exporting capabilities and much more.
Closing word on URUT by Rob Aseron, Principal Researcher at Yahoo!
Rob Aseron has used URUT for about 3 years and has conducted over 15 projects at Yahoo! Even though we have not worked with Rob or Yahoo! before, we recently shared a panel discussion at the UPA Conference in Portland dedicated to this subject. Over 100 people showed up. Here’s what he has to add to this article:
As a researcher I’ve seen remote unmoderated studies be like any other method in some ways:
- It takes some time to learn how to do them well.
- Executed with diligence and care one can obtain worthwhile insights—even powerful insights—but execute a study without all your ducks in a row and you’ll find yourself trying to explain your results to yourself or to your customers why they should listen to you
- When the wrong method is applied for a particular research question—well, you know what happens (see previous sub-bullet)
- The findings are most interpretable, powerful and actionable when set in a context that includes other data, particularly from complementary methods.
In other ways remote URUT projects are unlike other methods and are an indispensible part of the skilled researcher’s toolbox. Consider, for example, the following:
- There are situations in an organization where only large sample size data will receive a hearing. Although I know how to effectively explain the value of a 6-8 person lab study, I also know there are times when no amount of explanation will matter.
- If one needs to create a metric or look for the relationship between user performance or self-report variables and server-side variables there’s nothing better.
- Easily observing the click target distribution on individual pages and navigation paths across an entire site provides a rich context in which to understand other observations.
Ultimately, I only see my use of URUT expanding in the future. The method allows me to be efficient in data collection in a way that I wouldn’t know how to replace.
As the web becomes a more complex, mature place and users use it and interact in different ways, user experience and usability testing and measurement must evolve and continuously innovate. UX professionals should be open and welcome new efforts and initiatives. URUT is an example of this innovation and has proven its worth for the past 6 or 7 years, since various tools came out to the market.
The key to solid research lies not only in proper execution and the right technology, but also in the ability of the research team to understand that different data comes from different methods and tools, and that each should be used with a purpose and to meet specific goals (what, why, when and how). The combination of methods and tools is often the best way to go. URUT is a great choice for specific purposes and, if well executed, can become an invaluable source of data about user experience.