Back to Basics 12: Fine-tuning the Design

Here at Human Interfaces, Inc, we are happy to ring in the new year with a final installment in our “Back to Basics” blog series. In our last couple of posts, we’ve talked about formative user research methods such as moderated usability testing and diary studies. With this article, we want to focus on a couple of methods that can be employed as you near the finish line of a final product or interface design, as well as help plan for future iterations.

A/B Testing & Usability Benchmark Testing

Once a design has undergone several rounds of formative testing, various UX research methods can be implemented at the summative testing stage to help fine-tune the design, or to see how a live product is stacking up against its competitors. Once major usability issues and pain points have been identified and vetted, a variety of methods can be used to identify the more micro-level design elements that may influence the usability and overall experience of a system. Two commonly used and effective methods for this purpose are A/B testing and usability benchmark testing. A/B testing can be used to help make decisions regarding discrete design elements like the color and size of a call to action button and the way in which certain elements are organized on a webpage, which can significantly influence desired user actions. Gaining competitor insights is also useful in attaining an understanding of how a live product or interface is performing against its competitors in a variety of areas that may need some improvement before a future release. Such insights are best attained using usability benchmark testing.  

A/B Testing

What is it?

 A/B testing, also referred to as split testing, is an experimental method for comparing two different versions of a digital design (e.g., a website or mobile application) in order to determine which one fares better among users. This method can be used to validate a new design, determine the effect of small design changes, and help decide between different versions of a design. Essentially, once two versions of a design have been created, one version is presented to one group of users and the other version is presented to a different group of users. The varying effectiveness of each design is then compared between user groups by establishing some pre-defined metric(s) (e.g., number of clicks on a CTA button, number of downloads, page views, etc.). The chosen effectiveness criteria will depend upon the objectives of the project, which will have been determined before testing is initiated.

 This method of testing is particularly useful for evaluating small changes (e.g., button color or size), which may be difficult to evaluate using other methods. However, the results fail to reveal the “whys” of which version is most effective, which will require the addition of other research methods (e.g., moderated usability testing) to fully understand. It also fails to reveal any other problems that may exist in the system outside of the discrete elements specifically being tested. Given that A/B testing is a quantitative research method, it is useful and important to compliment the findings by running some qualitative research studies as well, in order to understand why users prefer or perform better with a certain design version.


  • Understand the effects of isolated design elements on user performance
  • Quick and cost-effective method for completing many rounds of testing with small isolated changes
  • The quantitative results are reliable and can be used to determine statistical significance


  •  A/B testing alone does not solve all usability problems, and should not be used in isolation of qualitative testing (e.g., moderated usability testing)!
  • Only provides quantitative measures, and does not provide information that allows the researcher to infer to the reasons behind measured performance differences
  • Does not provide any information about the users (i.e., are you attracting the right users through your design changes?)
  • Not all rounds of A/B testing produces significant performance differences

 How to do it?

Whether you are trying to optimizing the user experience, increase revenue, etc., preliminary data collection should be completed before the A/B testing is conducted. For example, be sure to determine which page across your website gets the most traffic so that you can get the most bang for your buck when you A/B test it.

The main steps involved when conducting an A/B test include:

  1. Determine the specific test objectives (e.g., improve number of downloads)
  2. Create two versions of a design to test where only one variable has been changed (e.g., varying the placement or size of a CTA)
  3. Determine the required sample size (i.e., conduct a power analysis to see how many participants you will need)
  4. Run the test and gather the data
  5. Analyze the results
  6. Implement the results, and continue to A/B test!

What is the output?

 A/B tests are commonly conducted in an iterative fashion, with several tests planned in a row following the implementation of the design changes. The final test report is often communicated in the form of a short summary of the data that can concisely communicate the quantitative findings. The final report may include screenshots of the design versions being tested, with the specific design changes highlighted in call-outs. Based on the gathered metrics, determine which design version performed significantly better among users (if that is the case), and make this distinction visually apparent so that the findings can be clearly communicated to stakeholders. 

Usability Benchmark Testing

What is it?

Benchmark studies can be conducted to evaluate how the experience associated with a single product changes over time (e.g., with subsequent version releases), or to compare the experience of a product against its competitors. Previous product designs or competitors’ products can serve as reference points against which a new design can be measured to better understand the product’s ease of use and user preferences. This type of usability testing is usually summative rather than formative, and is therefore, typically conducted on live products.  

When comparing the performance of a design against itself, benchmarking is usually implemented after a series of considerable design changes have been made or after a major release (or when a competitor’s product enters the market). The outcome of a benchmarking study can also be used to inform the next steps for design improvements that should be made before the next product update.  

Benchmark studies typically report on basic usability metrics related to the three main measures of usability: effectiveness, efficiency, and satisfaction. Typical metrics include task success, time on task, and subjective user ratings (e.g., satisfaction, task confidence, difficulty, etc.). Often, a goal of the benchmark study will be to establish how the product measures up against industry standards or competitive products, and to assess user preferences (e.g., which of the competitive products do users most prefer?). 

Although using actual users is always preferred, expert researchers can also complete a benchmark report based on their own evaluation of the product(s). For example, researchers could create a list of usability and business objective criteria (e.g., taskflow, visual appeal, monetization of behaviors) against which each product or interface could be evaluated and rated. But remember, it is always prudent to meet with all stakeholders to finalize this list prior to the evaluation so that you are sure to deliver the most value in your final report.


  • A “baseline” performance against which to measure future improvements is critical for calculating the ROI of future designs
  • Seeing results over time can help foster a more user-centered focus and engender stakeholder buy-in
  • Keeping an eye on similar products in the market helps ensure that you don’t fall behind your competitors
  • Determine where your product is positively differentiated in the market, and capitalize on that margin
  • Identify areas that may enhance or detract from the user’s experience


  • Determining scoring criteria across unlike features can be challenging; consult stakeholders (e.g., product managers) for an appropriate weighting system
  • If testing against competitor products, it can be difficult to procure access to all related systems across brands
  • If using expert evaluators, subjective ratings of systems can vary; consider validating ratings using 2+ independent evaluators 

How to do it? 

Typically, task-based usability benchmark studies will include measures that assess the users’ actions and experiences at both the task and the test level (either using actual users or expert evaluators). Benchmark studies differ from more general usability studies in that the focus is less on uncovering usability issues and more on gauging the user’s current experience. A common mistake is to include too many tasks or questions within a relatively short amount of time. The challenge is in finding the right balance between the number of tasks that will sufficiently capture the users’ experience, and being careful not to overwhelm or fatigue the user during the session (or give the expert evaluator more tasks than he/she can reasonably complete). These types of studies typically contain metrics that assess both behavioral and attitudinal aspects of the user experience. Ideally, metrics should be included that assess each of the three pillars of usability: effectiveness, efficiency, and satisfaction.

The development of a representative task list may be derived from previous data collection methods, such as focus groups or cognitive walkthroughs. The complexity of the tasks will determine how many the user (or expert evaluator) will be able to complete within the given time frame, but a typical number of tasks for a 2-hour usability session may be around ten or so. Ten tasks is also a good sweet spot for expert evaluators – providing them enough hands-on experience for a comprehensive evaluation while not exceeding the scope of product features that can reasonably be considered for evaluations. Quantitative metrics related to task performance may include measurements of task success, time on task, and the number of errors performed while attempting to complete a task.

The main steps involved when conducting a benchmark study include:

  1. Set the usability goals for the study, and determine the closest competitors in the product space
  2. Create a core list of representative tasks, and be sure to include the competitive features of interest in the task flow(s)
  3. Pilot test with a few participants, and make changes
  4. Recruit and qualify representative users (if using actual users)
  5. Present the tasks to users in a counterbalanced order
  6. Gather any post-test measures (e.g., System Usability Scale)
  7. Analyze the data, and create a report that summarizes whether the predetermined usability goals were met, which areas need improvement, and how the competition stacks up

What is the output?

Benchmark studies typically produce a lot of raw data. When constructing the final report, it is helpful to go back to the research objectives in order to create a cogent message, and effectively communicate the findings to stakeholders and team leaders.

The final report will likely include a “scorecard” that compares products or interfaces across selected criteria/features, a combination of screenshots or pictures of the product or UI with associated call-outs, charts and tables of the quantitative data (e.g., time on task, percentage of task success, etc.), user quotes, and actionable recommendations for improvements. The results from the benchmark study can be used to influence the product roadmap, including specific design change recommendations and points of differentiation from the competition.

Final Thoughts

Integrating insights from both A/B testing and usability benchmark testing can make a positive impact on the usability and overall experience of a product or service. Fine-tuning a design during the summative testing phase, for instance, can help to optimize your product or service for launch. Similarly, seeing how a live product stacks up against its competitors can help you differentiate future versions of your product or service from the competition. Although UX researchers test interfaces at all stages of the design cycle, the methods discussed here are typically reserved for later design stages when the system under development is fully functional so that they can be fine-tuned, and final decisions on feature lists can be made.

At Human Interfaces, an expert team of UX professionals work with clients from a variety of industries to develop custom research solutions for any UX challenge. If you need help from a full-service UX research consultancy for a study, recruitment, or facility rental, visit our website, send us an email, or connect with us through LinkedIn