Usability testing: a step-by-step guide using Yelp as an example. Why do we need numbers anyway?

Interview / focus group

Mass Poll

A quantitative method to validate hypotheses about user behavior or collect new data about user needs through online surveys.

Field study

Card sort

Multichannel research

A combination of methods for identifying similarities and differences in user behavior patterns at various points of contact with a business product.

Artifacts

Business requirements

User portraits

User requirements

Plan a project

Methods and artifacts

Interview / focus group
Mass Poll
Field study
Card sort
Multichannel research

Interview / focus group

A qualitative method for identifying needs and looking for behavioral patterns through a personal conversation with users according to a given plan.

Mass Poll

A quantitative method to validate hypotheses about user behavior or collect new data about user needs through online surveys.

Field study

A qualitative method for finding behavioral patterns through direct observation of user behavior.

Card sort

A method for searching for typical relationships between concepts or objects of a digital product in the user's world view.

Multichannel research

A combination of methods to identify needs and behavioral patterns in users at various points of contact with a business product.

Usability Research Interface Design Usability Audit

Artifacts

Business requirements
User portraits
Activity portraits
users
User requirements

Business requirements

The purpose and objectives of the business that need to be solved with the help of a digital product, as well as the resources and constraints for the implementation of the project.

User portraits

The user model, as a rule, in the form of a set of characters, explaining personal goals, motives and barriers, habits and other features of thinking.

User activity portraits

The structure of personal goals and objectives, knowledge and thought patterns, environment and emotional context, as a rule, all together in the form of contextual scenarios.

User requirements

Goals and objectives of designing a new interface, taking into account the needs of users and the context of using the product.

60,000 usability hours

on average per year, our analysts spend in the laboratory and in the "fields", collecting data on the needs of users and the structure of their activities, on their motivation and decision-making features.

Usability research is the only way to focus the entire product team on the real goals and needs of users that affect the success of a digital product. This is the best way to break people's "If I want it, then others should want it" habit of thinking, which is often detrimental to a project.

It is the study that will reliably tell us who the people who will use the digital product are, in what context they will do it, and what they will need at these moments.

It happens that customers come with an already formed idea of users as consumers from a marketing point of view (or as employees who “accurately follow the rules”). This is critical data for business success, but experience design requires looking deeper. It is necessary to take into account not only socio-demographic indicators, but also to obtain specific information about the characteristics of user behavior.

We strive to understand why, how, when and where people interact with your service and its competitors. What goals and objectives arise in their personal or professional activities related to the proposed solution? What influences decision-making, attracts or repels the use of digital products to solve these problems? What emotional and logical context is the user in while solving a particular task? How does he explain to himself the process of solving the problem? In the course of research, we seek answers to the questions necessary for design.

We use many different methods depending on the business needs of our clients. Some clients come for "insights", they need an influx of new ideas about where the usability is "lame". Others come up with ideas and want to quantify and prioritize them according to user needs. Still others need to build an efficient catalog system. The Human Centered Design methodology offers a sufficient set of tools that can be combined and found answers to important business questions about the user qualities of the product.

Through a deep understanding of user behaviors and mindsets, we identify opportunities for digital product development and help our clients base the development process on a clear description of people's goals, objectives and needs.

Do you periodically hold promotions, arrange discounts and use other marketing tools, but buyers do not stay on the site? It's time to watch your customers and find out what is causing the low sales. Quite often the problem lies in usability. Simply put, the resource is inconvenient to use, it does not inspire confidence, and navigating through it is associated with great difficulties.

What results can be achieved with website usability testing?

First of all, the influx of new users or customers. Qualitative usability testing with the subsequent elimination of identified shortcomings allows you to instantly get a positive result!

You can lure a buyer to the site in many ways, including by offering a great discount. But this is only half the battle, you still need to get him to place an order. Users easily leave the resource as soon as something becomes incomprehensible to them.

Varieties of usability testing

Depending on the goals, the following types of research are distinguished:

Quality testing.

The goal is to find out which of the implemented design elements do not work, and which really bring benefits and improve usability. The test result is recommendations that will help make the web resource as convenient and attractive as possible for the end user.

Quantitative testing.

The goal is to get a lot of statistical data and then use it to improve usability. A rather expensive procedure, it is used only in cases where it is necessary to understand the specific reasons why a product or service is sold worse than a competitor.

Comparative testing.

The goal is to make a direct comparison with more successful competitors in the niche. This is one of the most important and effective methods to avoid repeating other people's usability mistakes.

Cross-cultural testing.

The goal is to adapt a foreign product for a Russian buyer or vice versa.

User testing

The goal is to see the scenarios of the behavior of visitors on the site and evaluate the performance of the resource with the involvement of real people. This usability testing technique allows you to see how well users perform specific tasks and what problems they encounter in doing so.

Eyetracking testing

The goal is to evaluate the effectiveness of the location and visibility of the site blocks. It is carried out using a special device that evaluates where and for how long the user is looking.

Character Testing

The goal is to understand the behavior on the site of typical representatives of various target audience groups and, based on the knowledge gained about their preferences, to satisfy the widest group of users.

Expert analysis

The goal is to identify user interface flaws and make recommendations based on research and well-established rules in the field of web design. Website testing is carried out by usability and marketing experts.

A/B testing

The goal is to offer users multiple design options and compare responses. A distinction is made between serial and parallel A/B tests. With sequential usability testing, changes are immediately made to the site and the periods before and after the changes are measured. With parallel, several versions are made, which are compared with the main one.

All of the above testing methods affect usability in one way or another, but Demis Group, based on its extensive experience, uses two areas:

usability evaluation;
comparison with competitor sites.

Preparing for usability testing and getting a report

Preparation consists of 4 main stages:

Formation of hypotheses. Experts who already know the problem areas of usability put forward hypotheses containing instructions for qualitative improvement of design. It will be possible to find out whether they will be effective only after all the work has been completed.

Definition of metrics for testing. For each of the hypotheses, a tool is selected that will help to obtain an expert assessment of the work done in order to improve usability.

Character and script definition. A script is a set of instructions, the execution of which will allow you to test all problem areas of usability with maximum efficiency. The character is a collective image, which in a number of ways must correspond to the average buyer.

Selection of respondents for testing. A group of people is selected who are as similar as possible to the character according to the description.

How to select respondents for usability testing?

Respondents are selected by gender, age, social status, position, interests. At the same time, to analyze the behavior of new users, it is necessary that they have not visited the web resource before. In the case of a narrowly focused business and a specific target audience, people who understand the topic are chosen. Thanks to this, the conclusions obtained will be as correct as possible. If a web resource offers consumer goods and services, almost any user will do. A questionnaire with questions will help to select respondents who are similar to characters.

3-5 people are enough for the procedure. Respondents must consistently go through the character's path through the site according to a pre-prepared scenario. It is compiled on the basis of typical tasks that visitors to the resource solve. As the journey progresses, respondents leave comments with impressions. Video and audio recordings of actions are being made. This makes it possible to assess how successfully users completed the tasks, and how much time it took to do this.

How to create a scenario for respondents?

To create a user script, it is important to understand 3 things:

What tasks are solved by visitors of the analyzed web resource. For example, for clients of a hairdressing salon, this will be an appointment for a haircut, hair coloring, etc.

What difficulties users may encounter in solving problems. For example, the site has an incomprehensible menu, mandatory registration, an inconvenient form for registering with the master, etc.

What problems have users already encountered. Perhaps they wrote to the support service about this.

The scenario describes the actual conditions of using the web resource, reveals the problems through the story and helps the respondents to get used to the role of the characters. It is written in detail to avoid misinterpretation.

Here are two examples of assignments:

Vacation from September 10 to 17 is approaching, and tickets have already been bought. You are flying together. The husband wants the hotel to be located near the sea, and the wife needs the Internet and service. Choose a hotel in Italy that will please everyone and book it for a week from 10 to 17 September.

Book a hotel room in Italy.

The first description is correct. The task should describe the real situation in which the site visitor may find himself, and contain certain conditions and restrictions. But it should not contain hints and look like step-by-step instructions. For example, you don’t need to write like this: “Go to the main page and request a call back by clicking on the button in the upper right corner.”

Examples of our usability testing cases

Website usability testing

The procedure consists of two stages.

Usability test. It requires a computer with a webcam and Eye Tracking software to record users' eye movements. The program takes into account all the actions that occur on the screen: moving the cursor, using the keyboard, switching between tabs. It is important to record the time codes of the tasks in order to quickly navigate through a long video.

Survey of respondents. Immediately after the test, users are specified what they expected to see, what they thought at one time or another during the task. Sometimes during the survey, hypotheses about problems on the web resource and ways to change the interface appear. You can check your assumptions next time.

Analysis of usability testing results

Highlight the issues that need to be addressed first. You can identify them by how many respondents found an error and how critical it is. The opinion of people should not be taken as a direct instruction for action. The analysis should help to see the problem. Just because a testee has found a bug doesn't mean it needs to be fixed immediately.

Based on the analysis of the results, a list of tasks is created to improve the site: add hints to the registration form, highlight the purchase button, etc. The changes made should be tested next time. The number of checks depends on the goals and budget of the web resource owner.

Usability testing helps improve your site and increase conversions. But in order to avoid critical errors before the audit, it is advisable to immediately create a user-friendly resource. The principles of usability specialist Jakob Nielsen will help with this.

Approximate description of a convenient resource

The user understands what is happening when they receive feedback.

The resource speaks the language of the target audience and uses understandable terms.

If the user performs an action by mistake, he immediately sees how to go back and do exactly what he wanted. This takes a minimum of time.

The web resource has a well-thought-out design, where the occurrence of problems for users is excluded. If errors still appear, visitors are informed about them, and recommendations are given on how to solve them. There is no information and elements in the interface that do not make sense and only distract attention.

The user does not have to strain the memory. He sees possible actions, choices, objects, buttons. Before your eyes are instructions for using the site, which simplify and speed up the path to the goal.

The resource is understandable for both beginners and experienced users.

The error message is written in plain language, and the visitor understands how to solve the problem if it occurs.

The information on the site is easy to find and meets the needs of the audience.

Friends, welcome to the UserPoint service blog. This is our first publication and before we go directly to the topic of the article, let me tell you in a nutshell about us and how this blog can be useful for you.

In August 2015, our team launched a service for testing websites and applications on real users - UserPoint. With many years of experience in Internet marketing and usability analysis behind us, and constantly studying the best world practices, we are developing a product that is one of the important tools for website analysis and conversion increase.

This blog aims to develop a professional approach to the process of increasing conversion in Russia and the CIS. It will be useful for Internet marketers, owners of online stores and other businesses, as well as usability specialists.

What knowledge will we share?

reviews of analytics and A/B testing tools,
experience and cases of Russian and foreign companies to increase conversion,
usability improvement techniques and user interface secrets,
UX research, insights and analytics,
interviews and webinars with market experts.

Well, now closer to the topic.

What is usability testing and why is it useful?

Usability testing is a study of how real people use your site or product. You describe a certain scenario - a set of tasks for testing, and users perform them, commenting aloud on their thoughts and actions.

For large, medium-sized businesses, Internet services, this is an important stage in professional usability analysis. Using usability testing along with standard methods and tools (analytics systems, Webvisor, heuristic analysis, etc.), you will be able to put forward the right hypotheses for A / B tests, make changes to the site and achieve a steady increase in conversion.

2. Online testing

The development of the Internet has contributed to the development of remote online testing which is usually much cheaper and faster. This is a study without real-time interaction with the tester. If you write the script correctly, this method will save you a lot of resources. Of course, you can find a remote focus group, make a list of questions, email them, and ask people to test the site or product for themselves by going through the scenario and commenting on their thoughts and actions out loud. But it is much easier to use automation services. The flagship of the foreign market is usertesting.com, on the Russian market we offer its analogue -. Huge databases of real testers allow you to instantly select a focus group according to any parameters (for example, ten women from 30 to 35 years old who have a cat) and conduct testing within a couple of days.

The advantages of online testing using specialized services are obvious:

everything happens online, you do not need to control the process, users already have special software for capturing and recording the screen,
a large database of testers allows you to instantly select a focus group for any parameters (age, gender, country of residence, operating system, as well as any arbitrary parameters),
time - 1-2 days (tens of times faster than traditional testing),
cash costs are minimal (4900 rubles for 10 users).

Also, many experts consider that the user is in a natural habitat, in a comfortable environment, and not in a laboratory under the supervision of researchers, as a serious advantage of online testing.

Important: the article appeared 10 years ago (the date above is not the date of writing, but the date of the last edition). The article is terribly outdated and we do not remove it at all because, for some unknown reason, no one has written better and newer in Russian over the years.

Usability testing has appeared in highly bureaucratic areas - in the military industry and in the area of risky civilian production (aircraft, power plants, etc.). As a consequence, usability testing itself was extremely bureaucratic - not just finding a problem, but more importantly, proving that the problem really exists. In addition, usability testing inherited the techniques and rules of scientific research, in which it is not only important to discover a phenomenon, but also to make sure that it is not the result of extraneous circumstances. Due to the forced complication of the testing process, entire teams of narrow specialists were engaged in testing: one writes test tasks, the other actually tests, the third analyzes the data, and another one drags and pulls the respondents.

It is easy to understand what testing is, for all its merits, it is very time consuming and unacceptably expensive.

In recent years, the situation has begun to change: instead of extremely formal and extremely expensive tests, informal and cheap tests are becoming popular. Instead of a small crowd with a bunch of expensive equipment, a single person with a minimum of things comes to the fore. The quintessence of this approach is described in the book Web Design: Steve Krug's Book or Don't Make Me Think! .

Of course, simple tests are not universal. For the dashboard of an airplane, they probably won't fit. But for ordinary software, simple tests are much better, if only because more of them can be done.

Here you will learn how to conduct usability testing, using just rapid testing techniques. More complex and more formal methods can be found in specialized literature or you can invent yourself.

What is usability testing

Usability testing is any experiment aimed at measuring the quality of an interface or looking for specific problems in it.

The benefits of usability testing are many. Testing allows:

To understand how badly or well the interface works, which may either induce to improve it, or, if it is already good enough, calm down; in any case, benefit is achieved.
Compare the quality of the old and new interfaces and thereby justify changes or implementations.
Find and identify problematic fragments of the interface, and, with a sufficient sample size, also evaluate their frequency.

At the same time, usability testing cannot turn a bad product into a good one; it just makes the product better.

The first testing of large systems always shows that the interface works much worse than its owner or creator thinks, and much better than the tester initially believes.

Why on the cheap

Three approaches can dramatically reduce the labor intensity, and hence the cost of usability testing:

Some simplification of the concept of usability.
Refusal to collect quantitative data.
Reducing the cost of equipment and reducing the payment for the time of respondents.

The third approach is to use only mobile laboratories, which is detailed in the section . The remaining two approaches are described below.

Efficiency and efficiency

In the main and most commonly used interpretation (ISO 9241-11 standard), the concept of usability is defined as

The extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.

My translation of this wording into Russian:

The degree of efficiency*, effort** and satisfaction with which a product can be used by certain users in a certain context of use to achieve certain goals.

* For example, the speed of work or the number of human errors and the duration of their correction.

** For example, the number of operations that need to be performed to achieve a result or the amount of information that needs to be processed to make a decision. Term efficiency still often translated into Russian as productivity; in my opinion, this is a gross mistake, since in the original ISO 9241-11 standard under efficiency something close to the concept of efficiency is understood.

The main performance indicators are user speed, learning speed and the number of human errors (for a more detailed list of metrics, see the section). These are the necessary indicators, as well as influencing the design. What's nice is that measuring them is not particularly problematic.

With indicators of labor intensity, everything is somewhat more complicated. This group includes:

Success, i.e. the ratio of completed test tasks to those not completed or completed completely incorrectly.
Power, i.e. the ratio of tasks from the user's activity to the tasks for which this product is intended.
Load on the user (both mental load and, for example, the number of user actions per test task).

There are two problems here. Power is not a measurable indicator at all (more or less power of a product is entirely a matter of design). The load on the user is either hard to measure or irrelevant, since it still manifests itself in performance indicators - with an intense load, the speed of work decreases. Thus, from all indicators of labor intensity in practice, only success remains.

There is nothing seditious in this simplification. In real testing, all the same, the entire set of indicators is never collected - after all, only some of them are important for each specific system. However, the principal consequence of this simplification of the wording - a significant simplification of the usability testing methodology - cannot be achieved in any other way.

The degree of efficiency, success and satisfaction with which a product can be used by specific users in a specific context of use to achieve specific goals.

No quantity!

The second possibility to reduce the labor intensity of usability testing is to stop collecting most of the quantitative data. This is done for two reasons:

Each specific test can be aimed either at obtaining quantitative data or qualitative data. Qualitative data, as a rule, is more in demand in the design, so it is better to plan the test based on it.
Quantitative data is still unreliable. They can be measured reliably, but the matter is extremely laborious.

Quality or Quantity?

Usability testing can be focused either on obtaining quantitative data (needed to measure the usability of an interface) or on obtaining qualitative data (necessary in order to understand what exactly is bad and how to fix it). As a rule, it is impossible to achieve both goals in a single test.

Suppose we measure the speed of work with the system. In this case, the test should be planned in such a way as to exclude any user slowdowns that are unusual for real work. For example, when a user makes a mistake, it will not be possible to ask him any questions to find out the reasons for this error. On the other hand, if you focus on qualitative data, all quantitative results will be questionable.

In fact, quantitative data is generally a luxury of usability, since it is more pleasant than it is really needed. Need more dynamics, i.e. the degree of change in these data, which greatly simplifies life. No matter how accurately the errors are calculated - if before and after the interface optimization they were calculated in the same way, the dynamics seem to be correct. For example, if the number of human errors was halved in counting, you can expect (just don’t ask me to substantiate this judgment) that they actually halved, although it is impossible to say exactly how many there were and how many there were in reality.

Unreliable numbers

In addition, the question of whether to trust the results of usability testing at all deserves consideration. After all, testing is not magic, so assuming that testing can be bad, you must conclude that testing can be bad.

The answer to this question is simple and sad - there is no reason to believe in the results of usability testing at all.

Indeed, despite the fact that in usability testing we have absolutely real data as input, our inevitable voluntarism does not allow us to fully trust them. We have too many potential sources of errors:

Actual users may differ from our selected respondents. In a small sample, even a slight fluctuation in the behavior of respondents can lead a usability specialist to false conclusions.
Test tasks may not adequately reflect the real activities of users in the system.
A usability specialist may miss parts of the problems or misunderstand the essence of the problems noticed.

Rolf Molich regularly benchmarks usability testing itself. The results are shocking. So, in the second test, in which nine groups of usability specialists of various levels tested the HotMail service, the scatter of results was very large, even though the test tasks were identical. All groups found a total of 310 interface issues. But three-quarters of the problems were found by only one group and not found by the rest of the groups (these percentages include twenty-nine really serious problems).

In general, usability testing may be scientific research, to which all the requirements for scientific research in general are fully applicable.

For example, let's compare usability testing with sociological research. The sociologist takes special measures, I note - "complicated and labor-intensive" - to ensure the correctness of the choice of respondents. We - no. The sociologist uses proven, statistically correct tools, both in collecting data and in analyzing them. We - no.

So whenever we try to measure something accurately, we are wrong. When we don't even try to be accurate, we are also wrong, maybe not so much, because the bar is already set almost to the floor.

What does this mean exactly? As a rule, we cannot say with certainty, for example, that we have removed all the causes of human errors from the interface. Simply because with other, perhaps more correctly matched, respondents, we would - again perhaps - - find more errors. The same consideration is true for other interface quality indicators, and even more so for other test tasks. And what would happen if the testing was planned and conducted by someone more experienced than us! And it's scary to imagine.

Thus, we must measure ergonomics only to compare with the new interface, while recognizing that our measurements are in no way correct. They are needed only for planning the next optimizations. We can only say one thing with certainty - no matter how much we test, there is always room for improvement. As a reserve for improving the interface itself, as well as our testing methods.

Why do we need numbers anyway?

Having scolded quantitative data as uninteresting and unreliable, I cannot but point out their true purpose. Quantitative data is absolutely essential in comparative usability testing. If there are several solutions to choose from - there is simply no alternative to quantitative data, it is imperative to collect them and take all possible measures to ensure their reliability (especially since comparative usability testing does not need qualitative data at all). However, comparative testing - rarity, respectively, this topic is not considered here at all. If you need to do benchmarking, please contact .

What exactly to measure?

The number of indicators measured in a particular test can be quite large, but all of them, as a rule, come down to a set of five basic characteristics. Below is a list of these characteristics with examples of metrics based on them.

User speed. Metrics: the duration of the operation; time spent on error detection; time spent correcting errors; the number of commands executed during the operation (it is understood that the more commands, the longer it takes to issue them); the duration of the search for information in the documentation; the number of commands that are more efficient than those used by the user; decrease in performance during prolonged operation.
Mistakes. Metrics: percentage of operations that caused an error; the average number of errors per operation for experienced users (precisely for experienced users, because for inexperienced factors from the learning rate group can also act); the number of bugs not detected and not corrected by users.
Learning how to work with the system. Metrics: number and frequency of references to the help system; the duration of the period between the start of using the system and the point at which the speed of work / the number of user errors stops increasing; the difference in the number of errors / speed of work for users with experience in using the system and without such experience.
Subjective user satisfaction. The measurement of this characteristic is associated with certain difficulties that deserve separate consideration. See the metrics for this property below.
Preservation of skills in working with the system. Metrics: The difference in speed/errors between a user after an hour of using the system and the same user at the start of using the system after a long break.

As you can see, measuring the quality of an interface can be quite difficult, for example, a skill retention test can take more than a month. But these are only tests for the second and third components of usability, namely efficiency and satisfaction. In addition to them, there is also success, which is more important and much easier to measure - you just need to calculate what percentage of tasks the user either completes completely incorrectly or cannot complete at all. This greatly simplifies the life of a usability specialist.

How deep is the satisfaction?

Unlike other characteristics, satisfaction is not in the real world, but in the user's head. As a result, it is impossible to “feel” it, and therefore objectively measure it. But at least it can be measured indirectly.

There are two possible courses of action. First, the respondent can be asked how satisfactory the interface seems to him. Secondly, by the behavior of the respondent it is possible to determine whether he likes or dislikes the interface at any particular moment in time; by counting the number of reactions shown, satisfaction can be assessed. Of course, these estimates are relative; their value is only shown in comparison with the new interface or in comparison with competitors.

The following are some methods for measuring satisfaction.

Questionnaire

If you try to determine satisfaction through a survey of respondents, you can not do without formal questionnaires. Indeed, if the survey format is not fixed, there can be no certainty that the respondents are asked the same question, which means that the answers become doubtful.

Unfortunately, questioning has a major drawback in Russian conditions - reliable questionnaires simply do not exist. Despite the fact that many fully functional questionnaires have been created in the countries of the decaying West (SUMI, QUIS, MUMMS, IsoMetrics, etc.), none of them has been translated into Russian and has not been tested again. As a result, these questionnaires, while, by the way, very expensive, are not more reliable than any questionnaires that you can come up with on your own.

Unfortunately, the development and testing of reliable questionnaires is a very long and laborious process, so one cannot count on the rapid appearance of high-quality domestic questionnaires.

Below are two workable, albeit unreliable, questionnaires.

Questionnaire by words

This questionnaire was first proposed by researchers at the Microsoft Usability Laboratory as a way to very quickly, albeit notoriously unreliable, measure satisfaction. The questionnaire is very simple. The respondent is presented with a sheet of paper with a set of randomly selected adjectives, one half of which is rather positive, the other half is negative. The respondent is asked to underline the words that, in his opinion, are applicable to the product (the similarity of the questionnaire with the questionnaires used in the semantic differential method is not significant - these are completely different methods). After the questionnaire is completed, the difference between the number of negative and positive terms is calculated.

I use the following set of adjectives:

Outdated-efficient-unclean-unequal-dull-bright-clean-straight-inconsistent-unconscious-attractive-standard-controlled-good-intuitive-cheerful-amateur-dangerous-boring-joyful-hard-hard-hard-hard-hard-hard-hard-hard-hard-hard-hard-hard-hard-hard-hard-hard Annoying-unpleasant-comfortable-cold-smart-useless-hackshot-warm-light-consistent-mysterious-high-quality-interesting-flexible-beautiful-ugly-unattractive-stupid-confused-convenient-unpredictable-unpredictable-unpredictable-unpredictable Clear-heavy-modern-light-friendly-non-standard-bad-reliable-complex-simple-dark-professional-slow-round-sad-unfriendly-predictable-incomprehensible-fast-sad-pleasant-pleasant

Please note that the words are not randomly given mixed up, this is how they should be presented to the respondents.

Formal Questionnaire

Unlike the questionnaire by words, this questionnaire cannot be used without adaptation for a specific project. Some of her questions are sometimes irrelevant, sometimes in need of a change in wording. In any case, for female respondents, it is necessary to change the gender of the questionnaire wording.

The questionnaire consists of several questions, for each of which the respondent can choose one of five answers. Please note that I designed this questionnaire only as a post-test, its use in any other capacity is doubtful.

Questionnaire questions:

I made mistakes during the tasks No/Yes

The system is able to do everything I need and even more No/Yes The system is fast enough No/Yes

I like the look of the interface No/Yes

I feel that if I get to know the system better, I can do things in it that I don't even know about now No/Yes

The system can be easily customized to my needs No/Yes

Getting started was easy; I have not encountered significant difficulties No/Yes

Whenever I made a mistake, I easily noticed and corrected my mistake No/Yes

I am satisfied with my work speed No/Yes

I felt quite confident during the tasks No/Yes

At any given time, I knew what I had to do next No/Yes

The system seems useful to me, I would be happy to use it to solve my problems No / Yes

The results should be calculated according to the following algorithm: the central value gives zero points, the extreme values give either -2 points (left answer) or +2 points (right answer), intermediate values either -1 or +1 point, respectively. The score is the value being compared.

Watching emotional reactions

In addition to the survey, you can count the emotional reactions of the respondent. For example, the respondent smiled - put a plus, swore or grimaced - put a minus. The number and sign of reactions is the desired value of the indicator.

There are problems with this method as well.

First, it is not clear how to count reactions of different strengths. How many times does the respondent have to smile to balance eight seconds of cursing? And nine seconds of scolding?

Secondly, the same person should count the reactions of all respondents, since it is impossible for several people to synchronize their ideas about what, in fact, is included in the concept of an emotional reaction. As a result, the resource intensity of the test increases greatly.

Don't take this quiz if you're a little unsure about your ability to pick up other people's emotions (for example, if you're Em and not Jo).

The second problem is the uncertainty of the test. Only operational satisfaction, that is, pleasure, is observed, and perceived satisfaction, almost always more important, remains behind the scenes.

What you need for testing

Now that the fundamental issues have been sorted out in general, we can move on to practice. You should start with a list of what you need to collect in one place for usability testing (these items will be described in more detail below). So what do we need:

respondents
test method
test scenarios
workplace for the test and a well-established method of fixing the material
tested test.

Respondents

When choosing respondents for testing, it is first convenient to determine the general requirements for respondents, and only then select respondents from the target audience, using the generated requirements.

Keep in mind that selecting respondents who are not in the target audience is much more dangerous than it seems at first glance. You will either identify non-existent problems, or you will not identify existing ones. In the worst case, you will simplify the interface so much that it will be difficult for even average users, who are in fact the majority, to use it.

General requirements for respondents

The first item is whether respondents need experience with the system. The general rule is that if the interface of an existing system is being optimized, half of the respondents should have work experience (they can identify relearning problems during implementation), and half should not (they determine the learning rate). If there are competing systems, another proportion is better: a third with experience with the previous version, another third with experience in using competing systems, the remaining third with no experience with the system.

The second point is the level of computer literacy. Other things being equal, the preferred choice is the real one, i.e. coinciding with the experience of the target audience, the level of three-quarters of the respondents and a low level - - for the remaining quarter (it can identify more problems).

The level of computer literacy is conveniently determined by the following scale:

Tall. The respondent has a computer at work and at home, most of the work is done on the computer, the respondent uses the computer as a means of self-development, actively uses Internet services (for example, regularly buys goods and services in online stores).
Above average. The respondent has a computer at work and at home, most of the work is done on the computer, but the respondent does not use the computer to solve tasks that go beyond his main activity (works on the computer "from call to call" and no more).
Average. Computer use has been part of normal (work or personal) activities for two years or more.
Short. Either at work or at home there is a computer, but the experience of working with a computer does not exceed two years and the computer is not a significant tool in the work.
Very low. The experience of using a computer is sporadic, the duration does not exceed three years. The computer is not used either at work or at home.

In third place is age. Optimal proportion: three-quarters of the respondents are of the age of the target audience of the system, the remaining quarter is older (more problems can be identified on it).

The gender of the respondents has less influence on the results, but this does not mean that it is not necessary to select respondents of the correct gender. It is worth increasing the number of women among respondents compared to the proportion in the target audience, since it is easier to identify problems during implementation on women (women, in general, learn more slowly, but, having learned, work better).

The last significant characteristic is the level of emotional openness of the respondents. The more constrained the respondent, the less he is able to tell you valuable. Even if you determine that there is a problem, you will not be able to get any information from him about what caused this problem. There is a great way to solve the problem of insufficient emotional openness - it is worth having a base of respondents and using them again. A respondent who already knows from experience that there is nothing wrong with usability testing is much more willing to make contact and is generally more talkative.

Finally, when the properties of users desired for the test have already been determined, it is time to select such respondents who not only meet the above requirements, but also belong to the target audience of the system.

How many respondents do you need

In 1992, Robert Virzi in Refining the test phase of usability evaluation: how many subjects is enough? assumed that five respondents were sufficient for the test. A year later, Jakob Nielsen and Thomas K. Landauer took over the baton with the article A mathematical model of the finding of usability problems, in which they argued that five respondents are enough to catch 70% of the problems and three more respondents are needed in order to bring the effectiveness to 85%.

The usability community loved these numbers with all their hearts. Since then, the phrase "5-8 respondents" has become almost a mantra. Alas, this mantra is false.

First, all three authors wrote only about testing small systems. What if the system is too large to fit the test on each respondent in an hour and a half (this is the maximum that a person, both respondent and experimenter, can endure; tests of 40 minutes are much better). In this case, you will have to perform several different tests on different respondents; without this, it will be simply impossible to cover the entire interface of the system. How many respondents will be needed in this case depends on the system, there can be no clear gradations here. So, to test a large corporate site, in a good way, you need twenty people in several series of 5 people.

Secondly, eight people - this is too little to talk about at least some accuracy in measuring ergonomic characteristics. You need at least twelve people to measure.

Thirdly, eight people cannot accommodate either gender, age, or any other diversity of respondents. If you want to test an interface designed for several distinct audience groups, each group should get its own five to eight respondents.

On the other hand, after all, the first few respondents make it possible to identify the lion's share of problems. Therefore, the only really possible way out is testing in series: the first series passes, the identified problems are solved, then the second series, the problems are solved again, and so on. If all types of respondents are used in the first series, the rest of the series can be safely interrupted when the amount of data to be detected falls sharply. The first series should be larger, the rest smaller.

Organizational matters

In addition to the actual requirements for respondents, the question remains: how to convince a potential respondent to participate in testing?

If you are designing an interface to order, try to shift the search for respondents to the customer. Almost always, the system has real or potential users with whom the customer has a special relationship and who - which is nice - particularly interested in the new interface, which is why they are very sociable.

If you are designing an interface for a system with a wide (normal) target audience, do not neglect your loved ones. They are both communicative and easily accessible.

Many recruiting companies are recruiting for focus groups, so they can recruit respondents for usability testing as well. Unfortunately, the focus group is a one-time, relatively short event. Usability testing will require scheduling interviews with respondents in turn, one person at a time, which makes the process very difficult.

Maintain a database of people you have already used for testing. As a rule, it is easier to negotiate with them than with those who do not have experience in participating in testing.

When arranging a meeting with the respondent, be as flexible and accommodating as possible. The respondent, even if his time is paid, is doing you a favor by agreeing to participate in testing.

If you found the respondent yourself, even if it is your friend or relative, he should be rewarded for the time he spent (you can do without remuneration if the respondents were provided by the customer of the interface, but if you used the services of recruitment services, the remuneration should be discussed with the representative of the service) . For a non-specific audience, the best incentive is money - the optimal amount of payment is twice the hourly rate of a particular respondent, taking into account the time that the respondent spent on the road. For a specific audience, very large amounts of rewards are often needed, in such conditions it is reasonable to reward respondents with valuable gifts, for which you can get a large wholesale discount (I personally prefer to use expensive alcohol).

Test Methods

There are only three main methods of usability testing: passive observation of test tasks, stream of consciousness and active intervention; the first one is for collecting quantitative data, the last one is for qualitative data:

Passive monitoring of test tasks. The essence of the method is very simple: the respondent performs test tasks, his actions are analyzed (during the test or after, according to protocols), which allows both finding problematic fragments and measuring the ergonomic characteristics of the interface.
Stream of consciousness (think aloud). Corresponds to a passive observation test, but the respondent is also asked to verbally comment on their actions. The comments are then analyzed. The method is rather unstable, but sometimes gives interesting results (very much depends on the talkativeness of the respondent). A major minus of the stream of consciousness - Measurements of the ergonomic characteristics of the interface is highly questionable.
active intervention. In this method, the usability specialist does not expect favors from nature in the person of the respondent, but tries to take them himself. After each action of the respondent, the experimenter asks him why the respondent acts in this way; on each screen, the experimenter asks how exactly the respondent understands the purpose and functions of this screen. This method is closer to a focused interview than to actual testing - for example, the method can be used even without test items, as long as there is an interface for discussion. It is clear that with active intervention, no measurements are simply possible, but the amount of qualitative data obtained is the largest.

Test Scenarios

A test case - is the testable aspect of the system. In my opinion, adequately selected test scenarios are the most important prerequisite for quality testing.

Scenarios consist of a custom task and its companions:

significant ergonomic metrics
test tasks for respondents (there may be several tasks)
signs of successful completion of the task.

Let's analyze them in detail.

Custom task

The first step in defining scenarios is to identify meaningful user tasks. These tasks are the source material for scripting.

What is a custom task? This is the task that their activity poses to users, and which has an independent value for users. A custom task is executed as one or more operations (a custom operation has no independent value). For example, for a mail client program, the tasks are:

writing and sending a letter
receiving messages from the server
customizing the program to suit your needs (for example, setting up automatic mail reception at specified intervals).

But choosing an addressee from the address book when writing a new letter is no longer a task, because this action is not valuable in itself. This is an operation consisting of many actions (click on the To... button > select a contact > confirm the selection).

When choosing tasks for testing, two considerations should be taken into account:

All tasks must be realistic, i.e. revealed from the actual activities of users: the desire to make tasks more difficult to find many problems at once, it is inconsistent - tasks should be ordinary, since there is no point in looking for problems that no one is facing.
Since it is only ideally possible to test the entire interface on all user tasks, you have to limit yourself and select only important tasks. Important problems are, firstly, frequency problems, i.e. which are performed by all users and/or are performed frequently, secondly, all other tasks that you suspect are performed poorly in the system, and finally, tasks whose incorrect execution leads to major problems.

Significant ergonomic task metrics

For each task, you need to choose the characteristics of the interface that are significant for it. Of course, we have metrics at our disposal from the section “What exactly to measure”. However, these metrics are inconvenient: they are difficult to measure and difficult to understand (though they are easier to compare). From a practical point of view, more mundane characteristics are much more convenient.

For example, you can count the number of human errors. But the influence of this metric on the integral result is so complex that all the same, the analysis cannot do without elements of voluntarism and subjectivism. It is much easier to determine the impact of the same errors immediately, even before the test, for example, by setting a significant interface characteristic "The user is able to install the program in less than five minutes, while making no more than two minor errors" (where a minor error means an error that he himself user noticed and corrected). Of course, there is a lot of arbitrariness in such a requirement. But it is still less than in dubious speculation about the number of identified human errors. Thus, "mundane" metrics are more convenient than universal metrics.

In addition, simple declarative metrics are also convenient because they allow you to understand when you need to continue optimizing the interface, and when you can stop. Returning to the example from the previous paragraph, we can say that we need to optimize the interface until the installation takes less than five minutes. If, after optimizing the interface, the program installs in six minutes, this is a reason to modify the interface again and test it again.

Here are examples of such metrics:

Success - respondents correctly perform 90% of tasks.
Efficiency - speed of the user experience: registration on the site is completed in less than 7 minutes.
Efficiency - errors: when entering 10 forms, the number of input errors does not exceed two.
Efficiency - ability to learn how to work with the system: when performing task 9, which differs from task 2 only in the input data, the respondents do not make a single mistake (except for typos).
Satisfaction - according to the results of the survey, the number of points increased by 20% compared with the previous results.

Test tasks

A test task is what the respondent receives from you, a task that allows you to lead the respondent through a fragment of the system interface and determine the characteristics of this fragment.

Test tasks, in addition to the fact that they must correspond to user tasks, must also have the following properties:

Unambiguity. Tasks should be formulated in such a way as to exclude their misinterpretation by the respondent. If the respondent misunderstands the task, you will almost certainly not be able to put him on the right path without prompting him at the same time the sequence of the task.
Completeness. The text of the task should contain all the information necessary to complete this task.
brevity. If you are measuring the speed of tasks, the tasks should be short enough that the length of time respondents read the tasks does not affect the duration of the tasks themselves (people read at different speeds). If the task text is large in volume, you will have to manually cut off the reading duration for each task, which is very laborious.
Lack of hints. According to the text of the task, it should not be clear how this task should be performed. For example, it is unacceptable to use the terminology of the system - instead of each term, you need to describe its meaning, otherwise the respondents will simply click the buttons with the same words and you will not identify any problems.
The assignment must include task start point, i.e. the window or screen on which the respondent should be at the beginning should be written. If such information is not provided, the respondents will inevitably move on to other fragments of the interface, which means that the task will be performed differently by different respondents, which makes all statistical calculations meaningless. You need to fix the starting point of the task at the end of the previous task. If the task starts with a blank slate, at the end of the previous task it should say “return to the main screen”. If a task is to start where the previous task ended, the previous task must end with the words "when finished, do not close the current window/stay on this screen".

In addition to these general requirements, the following must also be taken into account:

It is possible that several test tasks will need to be written for one user task. Typical case - The task is too big to fit into one task. Also, if a user task is a frequent one, you shouldn't care too much about how it gets done the first time - it's much more interesting to know how users will do it the second, third, fourth (and so on) times. In this case, within the test on one respondent, this task will need to be run several times, each time varying the tasks.
In addition to tasks in which the respondent must perform some action, dual tasks are acceptable and desirable, in which the respondent must first decide whether he needs to perform this action at a given time. For example, if we are testing a disk defragmenter, instead of the "Defragment the computer disk" task, it is better to use the task "Check the degree of disk fragmentation and, if you find it necessary, defragment the computer disk." Such tasks should be designed in such a way that the respondent could not refuse to make a decision without looking, saying that, they say, everything is fine and defragmentation is not needed. Also, before such a test, it is reasonable to intentionally fragment the disk so that the respondent cannot avoid the task.
Sometimes in the course of a task, you need to forcibly change the state of the system. For example, if you want to know how users solve a particular problem, you have to create that problem. It is unacceptable to interrupt the test for this, as this will distract the respondent. In such cases, another task can be inserted before the corresponding item, in which the respondent must create the problem on their own. Of course, such a task will not provide any information about the interface.
Analysis of the results and summing up statistics are greatly simplified if you do not a small number of long tasks, but a large number of short tasks that require moving only a couple of screens or filling out one or two forms.
The first task of the test should be an introductory one, designed solely to introduce the respondent into the process. Accordingly, it should be simple, and its results can be ignored.

Be sure to check that your scenarios can be completed by the respondents in the expected test time. Probably, the list of scenarios will have to be reduced.

Signs of successful completion of a task

The last component of the scenario is signs of successful completion of tasks. Here's the thing: not always the same task can be done in the only way. It is incorrect to run the test without knowing all these methods, since further analysis will turn out to be doubtful. Suppose Respondent A completed the task in A way, and Respondent B in B way. Both respondents coped with the task, but one is still better than the other. After all, different methods seem to have different efficiencies, for example, the number of actions included in method B is one and a half times higher than the number of actions of method A. Method A in this situation is preferable, in an ideal system (which you need to strive for) all users should only use it.

In addition, sometimes the correct test result from the point of view of the experimenter is not really correct, especially if the subject area is complex, and the usability specialist does not know it well enough. To make sure that the correct result is exactly what the usability specialist believes, he needs to find a system and domain specialist, and ask. Without knowing firmly all the ways to complete the task, you simply will not be able to identify the errors.

Workplace and ways of fixing data

There are two approaches to organizing a workplace for usability testing: a stationary workplace and a mobile one. Only recommendations pertaining to mobile workspaces are provided here, as mobile labs are themselves cheaper and allow for lower rewards for respondents (albeit at the cost of the usability specialist's time, who has to travel to the respondents himself).

So, what you need to have for full testing:

1. A laptop. The requirements for a laptop are simple. First, the most powerful processor possible for simultaneous screen recording (although even the weakest Intel Centrino processors allow you to record a video stream in the background, the interface under test will work faster on more powerful processors, and the video quality will be higher). Secondly, if you are going to record video with the respondent himself (see below), you will need a larger screen to fit both the tested interface and the window with the respondent's face.

2. Webcam if you are going to record facial expressions and gaze direction of the respondent. As a rule, the more expensive the camera, the better the image quality it gives. A camera with a laptop screen mount is desirable, as it is more convenient to use.

3. Microphone. Basically, anyone will do. Personally, I use a regular Genius microphone that costs seventy rubles. If your webcam has a built-in microphone, that's fine too. On the other hand, a better microphone gives better recording quality, so there will be less hiss (but it doesn't interfere with anything).

4. Screen recorder. The de facto standard is TechSmith Camtasia, but if funds permit, invest in TechSmith Morae, which is specifically designed for usability testing (it records not only the contents of the screen, but also logs user actions, which can greatly speed up subsequent analysis - On the other hand, Morae in four times more expensive than Camtasia, which is already expensive).

Before the first test, get to know your equipment as best as possible. Explore how best to position your camera and microphone for the best results. Learn the hot keys of the screen recording program, learn how to quickly launch it in any mode. Nothing undermines a respondent's confidence in testing more than the sight of a bustling experimenter trying to get the recording right at the last moment before a test.

5. If you are going to record the duration of tasks, it is useful to have sports stopwatch with the recording of circles (Laps), which allows you to remember a series of intervals. Otherwise, you will have to watch the videos again to calculate the duration of the tasks, which is very tedious.

6. Test tasks to present to the respondents. As a rule, the best option is to print out each task on a separate sheet so that the respondent cannot run ahead and read the tasks that he has not yet completed. On the first sheet you need to display introductory form. An example of such a form (in square brackets - variable data):

Dear [Respondent's name],

We invite you to complete a series of tasks designed to evaluate the simplicity and ease of use of [Name of the system]. Feel free to complete tasks. The purpose of the study is to evaluate the qualities of the studied interface, and not you personally. If you do something wrong, it will mean that the interface and only the interface needs improvement.

When performing tasks, you must act as you see fit. For example, if you choose to use Help, you can do so without asking the experimenter's permission.

Please note that your actions and words are recorded for further study, but all collected data will remain strictly confidential and will be available only to researchers.

Read the assignment carefully and follow the instructions in it exactly.

Try to complete each task to the end, but if during the task you realize that you cannot or do not want to complete it, inform the experimenter about this and move on to the next task.

Please turn the page with the task only when you complete the task on the open page.

If you do not understand any task, feel free to ask the tester again.

On the other hand, in some cases it is much more efficient to issue assignments to respondents not on paper, but in a way that is closer to reality. For example, when testing a POS interface, it's best to pretend to be a customer, and customers rarely state their needs in writing.

7. If you are going to survey users, you will need printed forms.

8. If you are going to be near the respondent and immediately fix any parameter, you will need tablet with paper and pen. It is convenient to pre-print several sheets with the name of the respondent and page numbers - if you are doing several tests in a row, this is guaranteed to avoid the torment of mixed up papers.

As you can see, not much is needed. The cost of the necessary equipment and software (not including a laptop, in the 21st century it is not a luxury) is no more than $450 in an economical version. The advantages of such a solution are reliability and ease of operation; in addition, mobility allows testing with the respondents themselves, which significantly increases their number (many potential respondents will not go to the office of a usability specialist under any circumstances).

Recording the facial expressions of the respondents

If you are going to analyze the results after the test (rather than during it), it will be extremely useful to make a video recording of the facial expressions and gestures of the respondents. Without a video with the respondent, you will have to analyze the movements of the cursor (accompanied, since the sound is also recorded, with hissing and screaming). With the recording, it will be possible to analyze the interaction of a person with the interface, since the prerequisites for gestalt will appear. An objectively insignificant difference in the method will come back to haunt an objectively significant improvement in the results.

The problem is that recording a video with the respondent himself is fraught with a certain difficulty - you need to automatically synchronize the face of the respondent and the recording of his actions.

Do not place the microphone near test printouts. Go deaf while watching the video.

In a stationary usability laboratory, this is achieved by hardware mixing the video from the camera with the video stream from the computer's video card. The major drawbacks of this solution are the high cost and low quality of the recorded screen image (the number of pixels on the screen is several times greater than what can be recorded on a tape recorder). In addition, such records are inconvenient to work with.

You can also synchronize the recording manually by recording streams from the camera and from the screen on the computer at the same time. But in this case, after each test, you will have to spend some time on the boring job of mixing two different video files.

Previously, only a dumb, albeit workable, solution was available (the new, better way is described below). Before testing:

Turn off graphics acceleration in Windows (in Control panels select Screen, on the tab Options press the button Additionally, in the opened window on the tab Diagnostics move the slider Hardware acceleration to the left side). After the test, acceleration can be enabled again.
Run any program that can display video coming from the camera. Such programs are attached to webcams, in addition, video chat programs are suitable.
Set the window with the picture coming from the camera to the lower right edge of the screen (there it distracts the respondent least of all), and position the window of the system under test so that it does not obscure the pictures. The window with the video should be sealed with a piece of paper (thanks to Dmitry Satin for the idea).
Ask the respondent not to resize the window of the system under test.
Turn on all screen recording.

Due to this, during the test, the entire screen content is recorded, including both the actions of the respondent and his image from the camera.

Screen view when recording facial expressions in this way.

Addition: In the third version of TechSmith Camtasia Studio, a picture-in-picture mode (a stream from a video camera is inserted into the corner of the video with screen content) is introduced, so now everything has become much simpler.

The only problem with recording the respondent's facial expressions is that if you go to the responders yourself, no one can guarantee that there won't be some nasty stuff in the background of the video - respondents often like to meet usability specialists just in bedbugs.

Tested test

Finally, we need to validate the test itself. You need to make sure that:

equipment is operational
you know how to treat her like a young demigod
all default settings are correct
you have enough blank tapes or disk space
all the necessary papers are printed and checked for relevance and errors
test tasks contain all the necessary information and do not require additional explanations
there are no hidden clues in test tasks
you can quickly bring the system under test to its original state so that the next respondents do not see the changes made by the previous participants
your idea of what constitutes proper task performance is true
the test on one respondent can be conducted in a reasonable time (no more than one and a half hours).

As you can see, there are so many points in which blunders can be made that checking them in itself will become a source of errors (difficult tasks are difficult to complete without errors at all). Accordingly, a reliable verification method is needed. This method is testing the test itself, i.e. running the test on someone you don't feel sorry for and who is easy to catch (for example, a colleague). If the test run showed at least one error in test preparation, correct it and repeat the run again.

Note, however, that a test run does not replace the need to test the test yourself, as some aspects cannot be tested by a test run.

Testing

So, the test is ready and you can proceed. The procedure is simple. Turning on the recording and seating the respondent at the computer:

enter the respondent in the task
find out from him his expectations from the system
test the interface
find out how the respondent's expectations were met
complete the test.

These steps are described in detail below.

Introduction to the problem

The introduction of the respondent into the task consists in the fact that you consistently explain the rules of testing to him. All these explanations are extremely important, if you miss even one point, the results will be distorted.

Explain to the respondent what usability testing is and why it is needed.
Explain to the respondent (here it is permissible to lie) that it is he and only he who is needed for testing - feeling his need, the respondent will cheer up.
Mention that you did not develop the interface (you can and should lie), so you will not be offended if the respondent scolds the interface.
Before testing, do not forget to turn off your cell phone and ask the respondent to do the same.
Explain to the respondent that you are not testing him, but the system. Warn him that all his problems are really problems of the system, and that if he makes a mistake, no one will blame him, on the contrary, you will know that the problem is not in him, but in the system.
Sorry for having to record his actions. Reassure the respondent that the collected data will remain with you and you will transfer the test results to the customer, having anonymized them before that. If you are recording screen content, additionally ask the respondent not to enter their last name in screen forms (so that your client to whom you will give the video recording does not see it).
Explain to the respondent that they can refuse to continue the test at any time and that in this case they will still be paid a reward. Explain that the respondent can ask to stop the test at any time to rest.
Finally, explain to the respondent that it is useless for you to ask questions about the interface, but you can and should be asked if any task is not clear to the respondent.

Memorize a list of things to say before testing. This also affects the results.

Identification of expectations from the system

Regardless of the type and purpose of the test, when testing a new interface, it is useful to determine how well it meets user expectations. If expectations are met, implementation and initial support will be greatly facilitated; if the expectations were deceived, the system will immediately cause rejection.

Expectation elicitation should generally be done before interface design, but unfortunately it is extremely difficult in the early stages of development and requires, in addition, an extraordinary talent for listening and asking the right questions. However, at stages where there is already something to test, such as testing prototypes, it is easier to identify user expectations, so it is foolish not to take advantage of this opportunity.

The procedure for identifying expectations consists of two steps:

Before conducting the test, the respondent should be asked what he expects from the system. You need to listen to the respondent with an attentive look, and everything that he says can be safely forgotten, since all his words are nothing more than fantasies. You need to ask not in order to find out something, but in order to prepare the respondent for the second stage.
After the test, the respondent should be asked how the shown interface meets his expectations. Here the respondent can already be trusted: firstly, he is prepared by his previous answers, and secondly, the interface shown to him may prompt him to formulate requirements that he was not aware of before.

Testing

When testing, the following six "never" should be followed:

Never apologize for the imperfection of the system under test.
Never say "We'll fix it later."
Don't blame anyone that the interface is bad ("The developers, of course, are idiots, and created something awkward, but we'll fix it right now").
Never call the testing process "user testing" - the respondent will think that they are testing it and will be afraid. It's ideal if you always refer to the procedure as "interface usability testing" or simply "interface testing".
Never interrupt a respondent. Even if he says something irrelevant, let him talk fully and only then ask your questions.
Never shape the respondent's behavior. Some people adjust to the experimenter's expectations, for example, if they feel that you want to find more errors in the interface, they will constantly err themselves, even if the interface does not have the prerequisites for this. To avoid this result, all your words should be emphatically neutral. There are two simple methods to achieve neutrality. First, do not ask questions with a single answer. Instead of asking the respondent how simple the system seemed to them (this is clearly a leading question, since it can be asked differently with a different attitude to the topic - "how complicated did the system seem to you?"), it is better to ask if the system has a simple or complex interface. Secondly, often respondents ask you questions themselves, trying to avoid having to make a decision yourself. Answering such questions is easy, only the answers, because of their spontaneity, will be suggestive. In such cases, the best answer is a counter-question. Am I doing well? -What do you think? Did I complete the task correctly? -What do you think? And so on, until the respondent settles down. Impolite, but effective.

In addition, there are a few not so categorical rules:

If you are monitoring an interface property during a test, such as counting respondent errors, you should not monitor more than one metric. For example, if you count errors, it is not worth counting the execution time - the probability of your own error increases too much. In my opinion, during the test, you can only write down your hypotheses about potential improvements to the interface - i.e. what you see immediately. It is better to calculate interface indicators from video recordings.
Even with active intervention, try not to ask the respondent questions that are not directly related to their current operation. It is better to ask them after the test.
If possible, sit to the right behind the respondent - so that he can see your face with his head turned slightly. Your presence is burdensome for the respondent, but in this position he will at least be less tense.
During the test, you often cannot see problems with the interface as a whole. For example, you notice a user error. But what explains it? Is this an anomaly caused by the fact that the user is less prepared than the rest? Are you sure that everyone repeats this mistake? Because of this, you need to record a maximum of observations. Some you will discard later, but that is better than missing the problem.

Work speed. Between tasks, move the stopwatch to a new circle. If the respondent becomes distracted for any reason, pause the stopwatch.
Mistakes. On a sheet of paper, put a dash for each human error. It is convenient to put small dashes for small errors and long lines for large errors. After the test, it is enough to count the number of dashes. If you count errors of different types separately (for example, simple errors and separately incorrectly selected menu items), it is better to use different codes, for example, the same dashes for simple errors and the letters M for menu-related errors.
Problems that you see right away. Briefly write down on a piece of paper the essence of the problem and the current time (time first!). If you know exactly when the problem occurred, it will be easier for you to find the relevant video clip.
Emotional reactions of the respondent. Put a plus sign for positive reactions and a minus sign - for negative ones. Reactions that occur at the moment of completion of test tasks are not considered.

Completion of the test

After finishing the test:

Ask the respondent questions.
Have the respondent complete the questionnaires if you are conducting a survey.
Ask the respondent if they liked the interface; Regardless of the answer, ask for clarification on what exactly you liked and what you didn’t like.
Pay the respondent.
Thank him for taking the test. Reassure the respondent that they did a great job and that you were able to identify many interface problems thanks to them (do this even if the respondent turned out to be a withdrawn, unpleasant type who did not reveal anything new on the test).
If the respondent is particularly good, ask him if he can be contacted in the future for new testing tasks. A respondent with testing experience is always better than one with no experience.

Prototype testing

Particular attention should be paid to testing on prototypes. When testing on prototypes, you have two options:

You can limit yourself to an active intervention test, without the possibility of obtaining any quantitative data. To do this, you do not need any special prototype, since you always have the opportunity to explain to the respondent the meaning of service information and the reasons for gaps in the prototype.
In addition to the usual, you can create a test prototype and get some quantitative data, losing a lot of resources spent on a test prototype.

A test prototype is a type of prototype in which the respondent can perform test tasks. For example, if the prototype consists of a sequence of screen images:

In a typical prototype, each screen image will represent all possible interface fragments under different circumstances; instead of real data, generalized data will be displayed with notes about their possible maximum and minimum volume.
In the test prototype, the same interface will be presented as it is displayed at any time during the execution of test tasks: if data is edited on any screen according to the test task, in the prototype you will have to draw both versions of this screen, and possibly more (state before, during and after editing).

Creating a test prototype is always labor intensive. It's good that the screens in such a prototype have to be drawn several times. Worse than that, you won't be able to limit yourself to creating only a test prototype, so you will have to make both a regular and a test prototype. And if you are iteratively testing, you will still have to make corrections to both prototypes at once.

In fairness, it must be said that sometimes you still have to create test prototypes. For example, if the interaction on some screen is too complex and variable, the customer will not be able to understand it (it will be difficult for you to do it yourself). The only way out in such a situation would be to draw this screen in all possible states with all possible interaction options, i.e. create the same test prototype.

Thus, the use of non-functional test prototypes is a manageable, but still a nightmare (in fact, this is the main argument in favor of creating functional prototypes that are initially test ones). Do you need this nightmare - decide for yourself.

Analysis of results

Finally, it's time to analyze the test results. Three things are important here:

when to start analysis
how to analyze the respondent's actions
what can be gleaned from quantitative data.

Of particular note is the question of when to start optimizing the interface.

When to start analysis

You can analyze the results both during and after the test. Analysis during testing has advantages and disadvantages. The advantages include that it:

Allows you to save time at the analysis stage, because part of the analysis is done during an earlier step.
Gives the most direct impression of the test (gestalt), which allows you to see problems that are not noticed in any other way.

There are also disadvantages:

It does not allow recording more than one ergonomic indicator at a time, and even then in practice it is possible to measure only the speed of the user's work and the number of human errors (although these are the most popular indicators).
It is possible only with significant experience of a usability specialist.
It is impossible if the test is carried out by one specialist, and the analysis is done by another (on the other hand, the observations of the one who conducted the test will definitely come in handy for this other person).

Analysis after testing is devoid of these disadvantages and advantages. It allows you to carefully and thoughtfully analyze the material, regardless of the number and nature of the measured indicators. In addition, it is easily scalable to any number of performers.

Thus, the optimal strategy seems to be to start the analysis during the test. In some cases, this analysis can be limited. If this turns out to be impossible, you can always analyze the video test protocols as well.

Analysis of respondents' actions

Almost all usability testing is aimed at finding and identifying problems. But how to see the problem in the actions of the respondents?

Mistakes

Not every respondent's mistake is explained by interface problems, for example, the respondent could show elementary inattention. However, any error requires consideration:

If the error is critical, i.e. the respondent made a mistake due to a misunderstanding of the interface structure and the error led to other errors (for example, a site visitor went to a section he did not need and got lost there), the corresponding fragment should be redone: steps should be taken to resolve the ambiguity, add hints, etc.
If the error is non-critical, i.e. the respondent immediately noticed it himself and corrected it himself, you need to decide whether to correct it or leave it unattended. Correcting a problem is worth it if you feel you understand why the error occurred (only experience will help you here). If you don't feel it, leave the interface as it is. Of course, if the problem reoccurs, the error needs to be fixed - but then you will have more information about it, so it will be easier to fix.
Perhaps the error is due to the imperfection of the test task. Be sure to make sure that this is not so - ask the respondent to retell the task in his own words and if he made a mistake, then he understood the task incorrectly and the task needs to be urgently redone, the error can be ignored.

Job slowdowns

If the respondent paused for no apparent reason, this means that he is trying to figure out what he needs to do next. The interface is probably not self-explanatory or unambiguous enough. The problem needs to be corrected.

You can see the slowdown not only by the slowdown itself, it is just not always clearly noticeable, but by the random movements of the mouse cursor accompanying it (many people, having lost the thread of the action, automatically move the cursor).

Preparation, interviews and data collection

To bookmarks

Natalia Sprogis, head of UX research at Mail.Ru Group, spoke on the company's blog on Habrahabr about preparing and conducting usability testing: what to include in a test script, how to choose a data collection method, compose tasks and collect respondents' impressions.

The test plan is, on the one hand, a set of tasks, questions and questionnaires that you give to each respondent, and on the other hand, the methodological basis of the study: metrics and hypotheses that you test and fix, the selected toolkit.

Is testing necessary?

To begin with, you must be sure that at this stage the project needs usability testing. Therefore, specify for what purpose the project team is contacting you. Usability testing is not omnipotent, and already at the start you need to understand what it can bring to the product. Immediately prepare the project team on what questions you can answer and what you can’t. There have been cases when we either suggested to customers a different method (for example, in-depth interviews or diary studies are better now), or recommended that they abandon the study altogether, and instead do a split test.

For example, we never undertake in qualitative research to test the "attractiveness" of a feature or design option. We can collect feedback from users, but the risk is too great that their responses will be influenced by social desirability. People are always inclined to say that they would use even what they will not use. Moreover, the small sample size does not allow one to trust such answers. For example, we had a bad experience in testing gaming landings: the landing that was chosen as the most attractive on the test worked much worse during A / B testing.

Testing prototypes and concepts also has a number of limitations. When planning, you must understand what can really be squeezed out of this test. It's great when a project has the opportunity to test prototypes or designs before implementation. However, the less detailed and working prototype, the higher the level of abstraction for the respondent, the less data can be obtained from the test. Best of all, testing prototypes reveals the problems of naming and metaphors of icons, that is, all issues of understandability. The ability to test something beyond this is highly dependent on the essence of the project and the detail of the prototype.

Basis for writing a usability test script

Test planning does not begin with writing the text of tasks, but with a detailed study of the goals and questions of the study together with the project team. Here is the basis for making a plan:

important scenarios. These are those user scenarios (tasks, or use cases) that affect the business or are related to the purpose of testing. Even if the team suspects problems in specific places, it is often worth checking the main cases. In this case, the following scenarios can be considered important for the test:

the most frequent (for example, sending a message in a messenger);
affecting business goals (for example, working with a payment form);
related to the update (those affected by the redesign or the introduction of new functionality).

Known Issues Often the research is to answer the causes of a service's business problems. For example, a producer is concerned about a large churn of players after the first hour of a game. And sometimes the problem areas of the interface are already known to the team, and you need to collect details and specifics. For example, the support service is often contacted with questions about the form of payment.

Questions. The team may also have questions to investigate, such as whether users notice a banner advertising additional services; whether a certain section is clearly named.

Hypotheses. This is what the team's known issues and questions translate into. It's good if the customer comes to you with ready-made hypotheses - for example, “Our clients pay only from the phone with a commission. Perhaps users do not see the choice of a better payment method.” If there are no hypotheses, but there is only a desire to test the project abstractly “for usability”, your task is to formulate these hypotheses.

Think with the project team about places where users do not behave as expected (if such information is available). Find out if there are design elements that have been discussed a lot and that could be problematic. Do your own audit of the product to find potential user challenges that are important to test. All this will help you make a list of those elements (tasks, questions, checks) that should be included in the final script.

Data collection method

It is important to consider how you will collect data about what happens during the test for later analysis. The following options are traditionally used:

observation. During the execution of tasks, the respondent remains alone with the product and behaves as he sees fit. The respondent's comments are collected through questionnaires and communication with the moderator after the test. This is the most "clean" method, it provides more natural behavior of the respondent and the ability to correctly measure a number of metrics (for example, task completion time).

However, a lot of useful qualitative data remains behind the scenes. Seeing this or that behavior of the respondent, you cannot understand why he acts this way. Of course, you can ask about this at the end of the test, but, most likely, the respondent will only remember the last task well. In addition, during the execution of tasks, his opinion about the system may change, and you will only get the final picture, and not first impressions.

Think Aloud (thinking out loud). For a long time, this method was used in usability testing most often. Jakob Nielsen once called it the main tool for assessing usability. The bottom line is that you ask the respondent to voice all thoughts that arise while working with the interface, and to comment on all their actions. It looks something like this: "Now I'm going to add this item to my shopping cart. Where is the button? Ah, here she is. Oops, I forgot to see what color it was."

The method helps to understand why the user behaves in one way or another and what emotions the current interaction evokes in him. It is cheap and simple, even an inexperienced researcher can handle it.

However, it has its drawbacks. First, it's not natural for people to "think out loud" all the time. They will often become silent, and you will have to constantly remind them to keep talking. Secondly, tasks with this method are performed somewhat longer than in real life. In addition, some respondents begin to use the product more thoughtfully. Speaking the reasons for their actions, they try to act more rationally, and they simply do not want to look like idiots, and you may not catch some intuitive moments of their behavior.

Active moderator intervention. The method is ideal for testing concepts and prototypes. During the execution of tasks, the moderator actively interacts with the user: finds out the reasons for his behavior at the right time and asks clarifying questions. In some cases, the moderator may even issue unplanned tasks arising from the dialogue.

This method allows you to collect the maximum amount of qualitative data. However, it can only be used if you trust the professionalism of your moderator. Incorrectly worded or inopportunely asked questions can greatly influence the behavior and impressions of the respondent, and even make the test results invalid. Also, almost no metrics can be measured using this method.

Retrospective think aloud, RTA (retrospective). This is a combination of the first two methods. The user first performs all tasks without intervention, and then a video recording of his work is played in front of him, and he comments on his behavior and answers questions from the moderator. The main drawback of the method is that the testing time is greatly increased. However, there are times when it is optimal.

For example, once we were faced with the task of testing several types of mobs (game monsters) in one RPG. Naturally, we could neither distract the respondents with questions nor force them to comment on their actions during the battle. It would make it impossible for a game where concentration is needed to win. On the other hand, the user would hardly be able to remember after a series of fights whether he noticed the first rat's ax lit up red. Therefore, in this test, we used the RTA method. With each user, we reviewed his fights and discussed what monster effects he noticed and how he understood them.

Try to think about how to get enough data while keeping the respondent's behavior as natural as possible. Despite the simplicity and versatility of the “thinking out loud” method, which has long been the most popular in usability testing, we are increasingly trying to replace it with observation. If the moderator sees interesting behavior of the respondent, he will wait until he completes the task and ask a question after. Immediately after the task, it is more likely that the respondent remembers why he did it.

The eye tracker helps a lot in this matter. By seeing the focus of the respondent's current attention, you can better understand their behavior without asking too many questions. Eyetracker generally significantly improves the quality of moderation, and this role, in my opinion, is no less important than the ability to build hitmaps.

Metrics

Metrics are quantitative measures of usability. As a result of testing, you always get a set of problems found in the interface. Metrics, on the other hand, allow you to understand how good or bad everything is, as well as compare it with another project or previous versions of the design.

What are the metrics

According to ISO 9241-11, the main characteristics of usability are efficiency, productivity and satisfaction. Different metrics may be relevant for different projects, but all of them are somehow tied to these three characteristics. I will write about the most commonly used indicators.

Successful completion of tasks. You can use a binary code: coped with the task or failed. We more often adhere to the Nielsen approach and distinguish three types of success assessments:

coped with the task with almost no problems - 100%;
ran into problems, but completed the task on their own - 50%;
did not cope with the task - 0%.

If out of 12 respondents 4 coped with the task easily, 6 - with problems, and 2 failed, then the average success on this task will be 58%.

Sometimes you will encounter a situation where the middle group includes respondents with very different degrees of “problemness”. For example, one respondent struggled with each field of the form, and the second made only a slight mistake at the very end. You can give a mark at your own discretion, depending on what happened on the test. For example, 25% - if the respondent has just started to complete the task, or 80% - if he made a minor mistake.

To avoid too much subjectivity, consider rating scales ahead of time rather than deciding for each respondent after the test. It is also worth thinking about what to do with errors. For example, you gave the task to buy movie tickets on the Mail.Ru Cinema project. One of the respondents accidentally bought a ticket not for tomorrow, but for today, and did not notice it. He is sure that he coped with the task and has a ticket in his hands. But his mistake is so critical that he will not get into the cinema, so I would put "0%", despite the fact that the ticket was bought.

The success rate is a very simple and visual metric, and I recommend using it if your assignments have clear goals. A glance at the success graph by tasks allows you to quickly identify the most problematic places in the interface.

Task execution time. This metric is indicative only in comparison. How do you know if it's good or bad for a user to complete a task in 30 seconds? But the fact that the time has been reduced compared to the previous version of the design is already good. Or the fact that registration on our project takes less time than competitors. There are interfaces where reducing the time to complete tasks is critical - for example, the working interface of a call center employee.

However, this metric is not applicable to all tasks. Let's take the task of selecting goods in an online store. Users should quickly find filters and other interface elements related to the search for products, but the selection process itself will take them different times, and this is completely normal. Women, when choosing shoes, are ready to look at 20 pages of issue. And it doesn't necessarily mean that there were no suitable products on the first pages or that they don't see the filters. Often they just want to see all the options.

Problem frequency. Any usability testing report contains a list of problems that the respondents encountered. The number of respondents who encountered a problem is an indicator of its frequency in the test. This metric can only be applied if your users have completed exactly the same tasks.

If there were variations in the test or the tasks were not clearly formulated, but compiled on the basis of the interview, then it will be difficult to calculate the frequency. It will be necessary not only to count those who have encountered, but also to estimate how many respondents could have encountered a problem (performed a similar task, went to the same section). Nevertheless, this characteristic allows the team to understand which problems should be fixed first.

subjective satisfaction. This is a subjective assessment by the user of the convenience or comfort of working with the system. It is revealed with the help of questionnaires that respondents fill out during or after testing. There are standard questionnaires. For example, System Usability Scale, Post-Study Usability Questionnaire or Game Experience Questionnaire for games. Or you can create your own questionnaire.

These are far from the only possible metrics. For example, a list of 10 UX metrics highlighted by Jeff Sauro. For your product, the metrics may be different: for example, at what level do respondents understand the rules of the game, how many mistakes they make when filling out long forms. Remember that the decision to use many metrics imposes a number of limitations on testing. Respondents should act as naturally as possible and under the same conditions. Therefore, it would be good to provide:

Single starting points. The same tasks for different respondents should start from the same interface point. You can ask respondents to return to the main page after each task.
No intervention. Any communication with the moderator can affect the performance metrics if the moderator unwittingly prompts something to the respondent, and increases the time to complete the task.
Order of tasks. To compensate for the learning effect of benchmarking, be sure to change the order in which you see the products being compared for different respondents. Let half start with your project, and half start with a competitive one.
Criteria for success. Consider in advance what kind of behavior you consider successful for the task: for example, is it acceptable for the respondent not to use filters when selecting a product in an online store.

Interpretation of metrics

Remember that classic usability testing is a qualitative study and the metrics you get are primarily illustrative. They give an overview of the different scenarios in the product, allowing you to see pain points. For example, account settings cause more difficulties than registration in the system. They can show the dynamics of change if you measure them regularly. That is, the metrics make it possible to understand that the task has become faster in the new design. It is these ratios that are much more indicative and reliable than the found absolute values of the metrics.

Jeff Sauro, UX research statistician, advises not to represent metrics as averages, but always to use confidence intervals. This is much more correct, especially if there is scatter in the results of the respondents. To do this, you can use his free online calculators: for success and for task completion time. You can not do without statistical processing and when comparing the results.

When are metrics needed?

Not every usability testing report contains metrics. Their collection and analysis takes time and imposes limitations on test methods. Here are the cases where they are really needed:

Prove. Often there is a need to prove that the product needs to be changed - especially in large companies. For decision makers, the numbers are clear, understandable and familiar. When you show that 10 out of 12 respondents were unable to pay for an item, or that it takes twice as long on average to sign up as competitors, it gives more weight to the results of the study.
Compare. If you are comparing your product to others in the market, you also need metrics. Otherwise, you will see the advantages and disadvantages of different projects, but you will not be able to assess what place your product occupies among them.
See the changes. Metrics are good for regularly testing the same product after changes have been made. They allow you to see the progress after the redesign, pay attention to those places that were left without improvement. Again, you can use these indicators as an evidence base that will show management the weight of investments in the redesign. Or just to understand that you have achieved results and are moving in the right direction.
Illustrate, emphasize. The numbers do a good job of illustrating important issues. Sometimes we count them for the most striking and important moments of the test, even if we do not use metrics in all tasks.

However, we do not use metrics in every test. You can do without them if the researcher works closely with the project team, there is internal trust and the team is mature enough to correctly prioritize problem solving.

Data fixing method

It would seem, why a notepad and a pen or just an open Word document are bad? In today's agile world of development, UX researchers need to get their observations out to the team as quickly as possible.

To optimize the time for analysis, it is good to prepare a template in advance for entering notes during the test. We tried to do this in specialized software (for example, Noldus Observer or Morae Manager), but in practice tables turned out to be the most flexible and versatile. Mark in advance in the table the questions that you plan to ask exactly, the places for entering the problems found in the tasks, as well as the hypotheses (on each respondent you will mark whether it was confirmed or not). Our boards look like this:

What else can you use:

. Customizable Excel template for entering observations for each respondent. A built-in timer that measures the time to complete tasks, time and success graphs are automatically generated.
Rainbow Spreadsheet by Tomer Sharon of Google. A visual table for the collaboration of the researcher and the team. The link leads to an article describing the method, there is also a link to a Google spreadsheet with a template.

With experience, most entries can be made right during the test. If you didn’t make it on time, it’s better to write down everything you remember right after the test. If you return to the analysis in a few days, you will most likely have to review the video and spend much more time.

Preparing for testing

In addition to the method, metrics and testing protocol itself, you need to decide on the following things:

Format of communication with the moderator. The moderator can be in the same room as the test participant. In this case, it will be easy for him to ask questions in time. However, the presence of a moderator can influence the respondent: he will start asking questions to the moderator, provoking him, either explicitly or implicitly, to prompt him.

We try to leave the respondent alone with the product for at least part of the test. So his behavior becomes more relaxed and natural. And in order not to run back and forth, if something goes wrong, you can leave any messenger with audio communication enabled so that the moderator can contact the respondent from the observation room.

Way of setting tasks. Tasks can be voiced by the moderator. But in this case, despite the unified testing protocol, the text of the task can be pronounced a little differently each time. This is especially true if the test is run by several moderators. Sometimes even small differences in wording can put respondents in different initial conditions.

To avoid this, you can either “train” the moderators to always read the texts of the task, or give the respondents tasks on leaflets or on the screen. The difference in wording ceases to be a problem if a flexible scenario is used, when tasks are formulated during the test, based on an interview with a moderator.

You can use the product tools to set jobs. For example, when testing ICQ, respondents received tasks through a chat window with a moderator, and when testing Mail.Ru Mail, they came in letters. This way of setting tasks was as natural as possible for these projects, and we also tested basic correspondence scenarios many times.

Creating a natural context. Even if we are talking about laboratory research, think about how to bring the use of the product on the test closer to real conditions. For example, if you are testing mobile devices, how will respondents hold them? For a good image on a video, it is better when the phone or tablet is fixed on a stand or lying on a table. However, this does not make it clear whether all zones are accessible and convenient to press, because phones are often held with one hand, and with tablets they lie on the couch.

It is worth thinking about the environment in which the product will be used: is something distracting a person, is it noisy, is the Internet good. All this can be simulated in the laboratory.

Test plan for the customer. This is also an important preparation stage as it involves the project team. You don't have to tell the customer about all the methodological features of the test (how you will communicate with the respondent, record data, etc.). But be sure to show him what the tasks will be and what you are going to check on them. Perhaps you did not take into account some features of the project, or maybe the project team will have additional ideas and hypotheses. We usually get something like this:

Report plan. Naturally, the report is written according to the results of the study. But there is a good practice - to plan the report even before the tests, based on the goals and objectives of the study. With this plan in front of you, you can check your script for completeness, as well as prepare the most convenient forms to capture data for later analysis. You may decide that a report is not needed and that a shared observation file is sufficient for you and the team. And if you motivate the team to complete it with you, it will be absolutely great.

Of course, you can just “let your friend use the product” and see what difficulties they have. But a well-written script will allow you not to miss important problems and not accidentally push the respondent to the answers you need. After all, usability testing is a simplified experiment, and preliminary preparation is important in any experiment.

Any usability testing protocol consists of the following parts:

Briefing or briefing (greeting, description of the event, signing of documents).
Introductory interview (screening check, short interview about product use, context and scenarios).
Working with the product (testing tasks).
Collection of final product impressions based on testing experience.

Instruction or briefing

Regardless of the subject of testing, any study begins the same way. What should be done:

Create an atmosphere. Get to know the person, offer him tea, coffee or water, show where the toilet is. Try to relax the respondent a little, because he may be nervous before the event. Find out if it was easy to find you, ask how you are.

Describe the process. Tell what kind of event the respondent is waiting for, how long it will take, what parts it consists of, what you will do. Be sure to point out to the respondent that their input will help improve the product and that you are not testing human ability. If you are videotaping, alert the respondent and tell them that the data will not appear online. I say something like this:

We are located in the office of Mail.Ru Group. Today we will talk about the project XXX. This will take about an hour. First we will have a little chat, then I will ask you to try something in the project itself, and then we will discuss your impressions. We will be videotaping what is happening in the room and on the computer screen. The record is needed solely for analysis, you will not see yourself on the Internet.

We are conducting research to make the XXX project better, to understand what needs to be fixed in it and in what direction it should develop. Therefore, I kindly ask you to openly express any comments: both positive and negative. Don't be afraid to offend us. If something doesn’t work out while studying the project, take it easy. So, you and I have found a problem that the project team needs to fix. The main thing - remember that we are not testing you, you are testing the product. If you are ready, I suggest you start.

To sign documents. As a rule, this is consent to the processing of personal data, and sometimes also an agreement on non-disclosure of information about testing. Tests with minors require parental consent for their child to participate in the study. Usually we send it to parents in advance and ask them to bring it with them. Be sure to explain why you are asking for documents to be signed, and give them time to study them. In Russia, people are wary of any papers that need to be signed.

Set up equipment. If you're using eye tracking, biometric equipment, or just videotaping, it's time to turn it all on. Warn the respondent when you start recording.

introductory interview

It solves the following tasks:

Check recruiting. Just in case, always start there - even if you trust the agency or the person who found the respondent. More than once during the test, we found out that the respondent misunderstood the questions and actually uses the product not quite the way we need. Avoid formality and avoid asking questions from the screening questionnaire: the person may already know what to answer.

Scenarios and context for using the product. Even if you don't have much time to test, don't skip this step. At least in general, find out from the respondent what tasks he solves with the help of the product, whether he uses similar projects, in what conditions he interacts with them and from what devices. The answers will help you better understand the reasons for the respondent's behavior, and if you use a flexible script, then formulate appropriate tasks. If there is enough time, ask the respondent to show what and how he usually does. This can be a source of further questions and insights.

expectations and attitudes. The start of testing is a good time to find out what the respondent knows about the product, how they feel about it and what they expect from it. After the test, you will be able to compare expectations with the final impression.

For most tests, this structure of the introductory interview will work. If you're testing a new product, you might want to skip the introductory questions. If you start discussing a topic in too much detail, this can create certain expectations from the user from the product. Therefore, leave only a couple of general questions to establish contact with the respondent, and immediately move on to tasks, and it is better to discuss scenarios, relationships and context after the user first explores the product.

Working with the product, compiling tasks

What are the tasks

Let's imagine that you want to test an online store. You have important scenarios (search and selection of products, the checkout process), known problems (common errors in the payment form), and even the hypothesis that the designer did something with the price filter. How to formulate assignments?

Focused tasks. It seems obvious to do something like this: “Choose a dishwasher 45 centimeters wide with a beam on the floor function that costs no more than 30 thousand rubles.” This motivates the respondent to use filters and compare products with each other. You will be able to check the price filter on all respondents and look at the key product selection scenario. Such tasks have the right to life and are good for testing specific hypotheses (as with a filter by price).

However, if the entire test consists of them, then you risk the following:

Spot check of the interface. You will only find issues related to job details (filter by price and width). You won't see other problems - like product sorting or other filters - unless you point them out as well. And you can hardly do tasks for all elements of the site.
Lack of involvement. Users often perform such tasks mechanically. When they see the first product that matches the criteria, they stop. It is possible that the respondent has never chosen a dishwasher in his life and he does not care what “a beam on the floor” is. The more the task resembles a real-life situation and the more context it has that the user can understand, the higher the chances of engaging the respondent who will imagine that he actually chooses the product. And the engaged user “lives” the interface better, leaves more comments, increases his chances of finding problems and providing useful knowledge about the behavior and characteristics of the audience.
Narrowed range of insights. In real life, the user might have chosen the product in a completely different way. For example, I would not use filters at all (and here you pointed them out). Or I would search for a product according to criteria that are not on the site. By giving tough, focused assignments, you won't learn about the real context of using the product, you won't find scenarios that the project team might not have foreseen, you won't collect data on content and functionality needs.

Tasks with context. One way to better engage users is to add real story and context to the dry job. For example, instead of “Find a recipe for plum pie on the website,” suggest the following: “In an hour you will have guests. Find something to bake in that time. You have everything for a biscuit in the refrigerator, as well as a little bit of plums. But, unfortunately, there is no butter.”

A similar approach can be used with an online store. For example: “Imagine you are choosing a gift for your sister. Her hair dryer recently broke, and she would be happy to get a new one. You need to meet the 7 thousand rubles. It is important that the respondent really chooses a real person for whom he will “buy” a gift (if there is no sister, suggest another relative or girlfriend). The key factor for such assignments is the reality and clarity of the context. It is easy to imagine that you are choosing a gift for your relatives, much more difficult - that you are "an accountant compiling an annual report."

A striking example of this approach is the “Bollywood method”, which was invented by Indian UX expert Apala Lahiri Chavan. She argues that Indians, like many Asians, find it difficult to openly express an opinion about the interface. But, imagining themselves as heroes of fictional dramatic situations (as in their favorite films), they open up and begin to actively participate in testing. Therefore, tasks for Indians should look something like this:

Imagine that your favorite young niece is about to get married. And then you find out that her future husband is a fraudster, and even married. You urgently need to buy two tickets for a flight to Bangalore for yourself and for the wife of a deceiver in order to upset the wedding and save the family from shame. Hurry up!

Tasks based on the experience of the respondents. Recall that for successful testing, respondents must match the audience of the project. Therefore, to check the online store of household appliances, we recruit those who have recently chosen appliances or are choosing them now. This is what we will use when compiling tasks based on the experience of the respondents. There are two options for using this approach:

Respondent parameters. In this case, you adapt the fixed tasks to the respondents. For example, in the case of a household appliance store and the task of working with filters, you ask the person what exactly he recently purchased. Find out the criteria (price, features) and offer to repeat the "purchase" on your site.
Respondent scenarios. Tasks are completely formed based on the experience of the participants. To understand which scenarios to check, the moderator finds out exactly how the person solved the problem in life, and suggests doing it on the site. For example, before choosing, a buyer compared several models with each other for a long time. Even if the site does not have a suitable function, invite the respondent to compare products to understand what parameters he will rely on. Perhaps you will get ideas of what the comparison function should look like, and also adapt the product page for this scenario.

Jobs like this provide many real-life examples of basic product operations. This often gives rise to a much wider range of problems and findings. In addition, it allows you to test the product on new scenarios that you did not consider basic or even thought through.

When we tested the Real Estate Mail.Ru project, many discoveries helped us to make exactly the tasks based on the experience of the respondents. We saw that people, when searching for an apartment in the Moscow region, indicate the end stations of the metro in the geofilter, meaning that these are stations that can be reached from the region. We expected that the subway filter was looking for an apartment near the station. We also learned how the scenarios for searching for new buildings differ from resale properties, which helped move the search for new buildings to another section on the site - with its own filters and its own concept of describing apartments. I also recommend reading Jared Spool's excellent book on the benefits of such assignments.

Tasks without tasks. Sometimes it’s better not to offer users tasks to work with the project at all, but to see how they themselves begin to get acquainted with the product. Give the respondent an introduction: “Imagine that you decide to try this product. I'll leave you for a few minutes. Do what you would do in real life. I'm not giving you any assignments."

It is important that the moderator leaves the room at the same time. Otherwise, the user is tempted to immediately ask something, clarify: “Do I need to register? But how to do this? etc.

This type of job is useful for completely new products. We often use it for mobile applications and games. This is how we find out whether users read training materials, what details immediately attract attention, what people understand about the concept of the product, how they later describe its capabilities. After the free task, there are planned specific scenarios.

Another area of application for free tasks is content projects. If you want to understand how your articles are read (where they linger for a long time, what they skip, what elements on the page they pay attention to), then just leave the respondent alone with the project for a few minutes. Only without a moderator looking over their shoulder, the user will relax and read the text in the same way as usual. This is how we test the Mail.Ru News, Lady Mail.Ru and others projects. This approach allowed us to identify different patterns of behavior on the site, different patterns of reading articles and understand what types of materials should be formatted differently.

Making good assignments

The first task is simple. Start testing with introductory and simple tasks. The respondent needs to get used to the test format, especially if you use the “think aloud” method: he needs to get used to the need to voice his thoughts and feelings. Do not immediately dump on him all the pain and suffering of the interface.

Don't suggest. Formulate the tasks in such a way that you do not prompt the respondent to do the right thing. If you want to test the ability to add products to favorites in an online store, do without the task “Let's add this TV to favorites”, especially if the button is called that. The respondent, after reading the task, will simply find a button on the screen with the desired signature - perhaps without even understanding what exactly he is doing.

It is better to explain the meaning of the task without resorting to terms in the interface. For example: “The site has the ability to save your favorite products and then choose what to order from them. Let's try to do it with such and such a TV.

Watch out for terminology. Do not use incomprehensible words and symbols. It seems obvious, but we, getting used to some terms, often forget that few people outside the IT community know them. For example, when testing the new functionality of threads (chains of letters) in Mail.Ru Mail, we had a difficult time. After all, users who are unfamiliar with this function simply do not have a term in their heads that would denote threads.

In the end, we didn't name them at all. We simply showed respondents a box with connected threads and discussed this new feature, and let users choose a word for threads themselves. This helped us to later use the most understandable texts in educational promo materials.

Follow not only the tasks, but also the moderator's questions, especially those that come from the team during testing. For example, when discussing functions, you should not use the word "toolbar": it is not familiar to everyone. A few years ago, not all users even knew the word "browser". The best way to formulate tasks depends on the audience of testing. Do not rush to the other extreme, explaining all the terms in a row. For example, experienced players do not need to explain what a “buff”, “frag”, “respawn” and so on are.

Less test. It is often tempting to make a test account for the respondent in the system and conduct testing on it. After all, you can run everything in advance in this account, avoid overlaps and not waste time registering or authorizing a respondent. It is also often technically much easier to incorporate a new design on test data rather than on real data.

However, with this approach, you run the risk of getting much less useful results, because test actions have no real consequences. The situation becomes completely artificial, it is difficult for users to project it into real experience.

For example, when working on their own account on a social network, respondents, as in real life, will carefully do everything that their friends can see (post links, send messages). When setting up their own inbox in the mail, they will try not to delete important letters. When testing online stores, an approach is sometimes used when the reward needs to be spent directly on the test. In this case, the respondent does not poke at the first product that matches the task, but picks up what he really needs.

Having only test data, you will find problems related only to them, and not test the functionality on different variations. For example, when we tested the social panel of the Amigo browser, one of the respondents, who connected his VKontakte account to the panel, immediately noted that it was inconvenient for him to read in this way. Almost the entire tape consisted of subscriptions to groups with erotic photographs. And in the narrow panel in the pictures it was simply impossible to see anything.

Another problem with test data is that it is difficult to understand the system, since everything around is unusual. For example, a social network user is used to recognizing his page by his own photo. Even testing prototypes, we try to personalize them as much as possible. For example, when testing clickable prototypes in Odnoklassniki, we always adapt them for each user, inserting their name and photo on the page, and sometimes the latest news.

Don't limit yourself to the interface. Do not forget that interaction with a product is often not limited to just one interface. If possible, test related products or services and the links between them. When testing games, we try to check not only the game, but also its website and related downloads, registration in the game, search for information on the forum. And when testing one online store, I also checked the operator's call after placing an order, which gave recommendations for the call center.

Think about timing. For a good script, it is important to prioritize tasks. Most likely, if the system is large and the test has many goals, you will want to do a lot of tasks. However, a tired respondent will no longer be useful. A good test lasts no more than an hour and a half, two is the maximum. The only exception is games. And remember that your goals are not only tasks, but also interviews, questionnaires, setting up equipment and signing documents. All this usually takes at least half an hour.

If there are too many tasks, but you don’t want to refuse some, you can rotate the least priority ones, that is, show only parts of the respondents. Or make part of the test mandatory for everyone, and watch the rest only with those with whom you have enough time. But these are likely to be the most successful respondents.

Evaluate the usefulness of the task. Consider whether it really matches your hypotheses. For example, you want to test the news subscription feature on a website. The task "Subscribe to the newsletter" will only check whether those who will search for it can find the newsletter. However, people rarely come to the site to subscribe to news. The task does not apply to real life. You also need to understand whether those who perform completely different tasks notice the possibility of subscribing.

You can check this in different ways - depending on the implementation of the function. If a person was engaged in tasks in which he could see the opportunity to subscribe, ask him if it is on the site. Just be sure to specify where he saw this opportunity or how it is implemented to make sure that the respondent does not just agree with you.

If the subscription offer is built into the registration or checkout process, see if the respondent uses it, and discuss it after the assignment. There is very little chance that in laboratory conditions people will actually subscribe to mailing lists, but you can check whether a person paid attention to this possibility, what he expects from the mailing list, and so on.

Collection of final impressions

The purpose of the final phase of testing is to collect impressions from working with the product, to understand what the user liked and what upset him, to evaluate subjective satisfaction. Typically, this part of the test uses a combination of an interview with a moderator and the completion of formal questionnaires.

Interview with moderator

In the final interview, we always ask respondents about the same questions: “What impressions did you have?”, “What did you like and what didn’t you like?”, “Was there something that seemed difficult or uncomfortable?”, “What was missing?” , “What would you like to change in the product?”. It's time to clarify incomprehensible moments of the respondent's behavior, if you did not do this during the test. If before the test you found out from users the attitude towards the brand or product and the expectations from it, find out if anything has changed. When interviewing, pay attention to the following:

social desirability. Handle interview results very carefully. If during the test you often hear impulsive comments influenced by problems, then social desirability flourishes with might and main in the final interview.

Some people think that when they talk about problems in a product, they are admitting their own incompetence. Others simply don't want to upset a pleasant moderator. Very often, respondents (especially women) who have suffered through the entire test say that everything is, in principle, normal. Negative feedback can also be dictated by social desirability: if the respondent is sure that the purpose of the test is to find flaws, he diligently tries to find them.

Quotes and Priorities. Although all the words of the test participants in the final interview often need to be divided by two, or even by ten, this does not mean that they are useless. By the way respondents summarize impressions, you can infer priorities. Is the product a "bullshit"? What exactly influenced this? Which of the many problems did the respondent remember the most and find the most annoying?

However, make allowances for the fact that the last task is best remembered. It is also very useful to track what adjectives respondents describe the product with, what they compare their experience with.

Let's not forget the good. Very often, a usability test report is a long list of problems found during the test. In general, the search for problems is one of the main tasks of the study. But do not forget about the positive aspects of the product.

Firstly, a report without positive results simply demotivates the team. And secondly, it is useful to know what users like about the product: what if, during the next redesign, they decide to remove the feature that everyone liked so much. Therefore, be sure to ask respondents about the positive aspects of the product, even if they scolded the interface during the entire testing.

Attitude towards "Wishlist". Most likely, the respondents, in addition to their impressions, will express wishes and ideas. Your task is to understand what the problem is behind the proposals. Because the solutions that users will offer, with a high probability, will not suit you. After all, test participants are not designers, they are not aware of the features and limitations of development. However, behind any such request, there is a need that you must capture. If the respondent says that he definitely needs a big green button here, be sure to ask: why?

Satisfaction measure

Often, according to the respondent in the final interview, it is difficult to understand whether he liked the product or not, and even more so it is difficult to compare the attitude of several respondents who both noted pluses and found shortcomings. This is where questionnaires come to the rescue. Firstly, when filling out the questionnaire (especially before talking with the moderator), the influence of the notorious social desirability is slightly less, although you will not completely get rid of it. Secondly, the questionnaire gives you clear parameters for comparing scenarios, products or project stages.

Compiling a good questionnaire is a separate and very large topic. Here, wording, scales, and much more are important. Ready-made and proven questionnaires can be a good help: they have already been perfected and repeatedly tested. The only problem is that almost all of these questionnaires do not have official translations into Russian. Naturally, you can translate them yourself, but from a methodological point of view, the translations need to be tested to check the correctness of the wording. Nevertheless, the questionnaires can become a guide when compiling your own questionnaires.

There are questionnaires that are given after each task to assess satisfaction with specific scenarios. For example:

After Scenario Questionnaire (ASQ). Three questions about complexity, productivity, and hints in the system.
Single Ease Question (SEQ). One question about the complexity of the script.

And there are questionnaires that are used already in the final phase of testing. Here are some examples that we use when needed:

System Usability Scale and Post-Study System Usability Questionnaire. Two classic and popular questionnaires created over 20 years ago. Both are made up of statements. Respondents must indicate the degree of agreement with them. All these statements from different sides characterize the usability of the product. For example: “I could easily find the information I needed”, “Various features of the system are easily accessible”, and so on.
. A questionnaire that often helps us in tests. The user is provided with a set of adjectives from which he chooses those that can characterize the product. As a result, you get a cloud of words - the characteristics of your project. Often this technique brings very interesting results.
Game Experience Questionnaire. Classic usability questionnaires cannot be applied to games: involvement in the gameplay is much more important than the understandability of interfaces. Therefore, for games, you should always make special questionnaires or use the Game Experience Questionnaire. The questionnaire contains several modules: a basic module, an in-game block, a post-questionnaire, and a questionnaire for the game's social features.

The material was published by the user.