2.2.1. Usability Evaluation

HCI became important in the late 1970s as part of the development of the personal computer. Work at Xerox PARC, Apple, Loughborough University, Carnegie-Mellon University and the University of California at San Diego considered how people worked with existing computer systems and how to make that interaction more efficient and easier. The development of systems like the Xerox Star and the Apple Macintosh were considerably based in the evaluation of "usability" - a slippery term, frequently equated with the notion of "user-friendliness" (i.e. the initial ease of use of a system) but more recently with something like Nielsen's (1993:26) five-dimensional definition of learnability, efficiency, memorability, errors (or lack thereof) and satisfaction.

The bulk of early HCI designers and especially evaluators were cognitive psychologists. Cognitive models like GOMS (Card et al, 1983) were very influential, as were laboratory experiments such as those conducted in the formative evaluation of the Xerox Star's interface (Bewley et al., 1983). The link between laboratory experiments and user evaluation was quite explicit, and is seen in the editorial comment in a set of readings from the 1980s, which talks of "carrying out one's own evaluation experiments" (Preece and Keller, 1990:325).

That there has been a change from this position is due to four factors: the coming into HCI of researchers from other disciplines than cognitive psychology (notably social psychology and sociology) with different methods; the mid-1980s critique of cognitive psychology and its related discipline of artificial intelligence by philosophically and sociologically-informed writers such as Winograd and Flores (1986) and Suchman (1987); a pragmatic sense by various writers, notably Neilsen (1993) that full-scale evaluation of usability is too complicated in many cases, so that 'discount' methods are useful instead; and finally the focus on user participatory in the design and evaluation process (Greenbaum and Kyng, 1991) has changed the relationship between user and evaluator.

These trends lead to an enormous number of different methods in regular use for the evaluation of usability. They are well reviewed in much greater detail by Crellin et al (1990), Preece et al (1994), Dix et al (1993) and Holyer (1993). We might see three main kinds of usability evaluation - expert, empirical and contextual. Broadly speaking, the last two correspond to the quantitative and qualitative methods of programme evaluation; the first is specific to HCI.

Expert evaluation, that is the evaluation of the system's usability by experts rather than real users, comes in two varieties: formal and heuristic. Formal evaluation relates to the 'proof' of the usability of a system using some abstract technique: mathematical, psychological, or analytic. Typical of formal evaluation are the system and user models produced by computer scientists such as Michael Harrison (Harrison and Thimbleby, 1990); of the second, the modelling of users based on abstract (usually cognitive) psychological approaches, and the comparison of some aspect of the system against those models, such as GOMS (Card et al., 1983); and of the third, the method of task analysis (Diaper, 1989), which decomposes a set of user actions from the complexity of daily activity to the level of the individual mouse-click, to determine inefficiencies and inappropriatenesses in their activities.

These formal methods will not be considered any further here. They all are to do with abstract models of a kind which no longer seem very easy to develop, especially since they are almost always rooted in cognitive psychology and the information processing model, which has become increasing discredited through the work mentioned above and tends only now to be found in 'post-cognitive' approaches such as distributed cognition (Rogers and Ellis, 1994). It seems clear to me that the best way to find out whether a system is usable is to use it, and this is even more the case with the increased complexity of people and organisational systems found in CSCW.

Heuristic evaluation has more usefulness to it. It relates to the evaluation of the system by its designers according to their own intuitive understandings of quality, published guidelines for good design such as the Apple Human Interface Guidelines, or checklists of broad signs of usability. The two keys points about this is that it is what most designers do anyway and that it's easy. On the first point, most designers of modern computer systems follow user interface guidelines published by Apple, IBM, Microsoft, Sun and others; and may have to hand usability criteria representing the collected wisdom of many researchers (as found in a book like Nielsen, 1993); and almost all designers make decisions as they go along as to what interface options would be best for the users they're designing for. Secondly, this can be easily done, by an individual across a desk, by a colleague strolling over and saying "that doesn't look very good" at a particular bit, by a meeting where a group of designers working together argue out particular options. And as Nielsen discusses in his concept of "discount usability engineering", using some method, however inadequate, is better than using none at all.

Moving on, then, to the second broad category of HCI methods, we find the empirical techniques - that is, those which are intended to test usability of systems with real users but in laboratory settings rather than real-world settings (the term sometimes used is "user testing"). This is typically part of a formative evaluation to redesign the interface, or the whole application, based on users' responses, experiences and problems. These methods tend to derive from cognitive and social psychology, but the defining characteristic is their place of study, being the laboratory rather than the workplace. The semi-situated studies that are found in HCI as much as in CSCW evaluation (ie of people doing their real work but in laboratories, which is easier for single-user systems as desituation is less problematic) can be of either this type or the following, contextual, type. These methods are popular among commerical software companies (organisations such as Microsoft have quite elaborate usability testing labs) but are by no means ubiquitous there.

There are many methods in use in laboratory studies of usability. These include: observation of users performing set tasks (in person, through one-way glass or on video-tape), think-aloud, and cooperative evaluation. The first of these is the classic laboratory study - get a user to perform a task, real or made-up to prove a point, measure what they do and draw conclusions (often statistical), based on times, results or logs of keystrokes and the like. One problem is that it may not be clear what they are doing, or think they are doing - this is solved by having them "think-aloud", talking along what they're doing as they do it - which provides the context for their actions but often radically affects what they do. Cooperative evaluation attempts to make this more natural by altering the talking to a conversation with the experimenter, who may explain why things are happening the way they are or gently guide the subject to solutions.

These may be supplemented by interviews, focus groups and questionnaires. These are all ways of determining subjective reactions after the event (or before, for comparison), through individual or group discussion or through written answers. The discussions will produce qualitiative data about what users thought of the system, how they might change it and so on; the questionnaires can also do this but might in addition provide statistically analysable data (if a large sample is being used).

Laboratory studies have their place in HCI. They are relatively easier to do (certainly compared to workplace studies), and often produce highly useful results. A classic example is the design of the Xerox Star user interface, which was based on a series of often highly quantitative lab studies (to determine which icons worked best and the like) - to evaluate the transfer from a large and then mostly unmapped design space to practical use (Bewley et al., 1983). Another good example is the design of the Olympic Messaging System (Gould et al., 1987), which employed a large amount of in-house (ie at IBM) evaluation as part of iterative design, often highly creatively.

A mention must be made here of the connection between HCI evaluation and iterative design based on several prototypes. This has been mentioned above in cases such as the Olympic Messaging System. The modern form from the commercial software industry is the issuing of "beta" versions of software to a wide audience who may then comment on it for changing before the final version. While this is also concerned with matters other than usability testing (eg whether the system crashes all the time), that is part of it. The recent beta-testing of Microsoft Windows 95 took place over about eighteen months and involved tens of thousands of people. This fulfils the requirements of Brooks' (1975) much-quoted phrase, "plan to throw one away; you will, anyhow". It is certainly better to test over several prototypes than to sell that throw-away version to the public and then solve the errors in an "update".

The third broad category of usability evaluation methods are those I have labelled 'contextual' above: that is, those which to a greater extent take the context of users and try and build that into their evaluations. Methods used here include interviews, questionnaires and focus groups again, but also direct observation and user feedback. The more explicitly ethnographic approaches fall into this category.

To briefly describe a couple of these - direct observation (sometimes using video-tape) is an important component of the elucidation of what people actually do and how that relates to the computer systems they use. It is sometimes (flippantly) called "hanging around", as it involves unobtrusively observing what is happening. Portions of what is occurring may be recorded on audio or video tape, for later analysis using theoretical approaches like conversation analysis (Woofitt, 1991), interaction analysis (Suchman and Trigg, 1991) or distributed cognition (Hutchins, 1995). Of course, ethnography has been used in CSCW evaluation (see eg Hughes et al, 1994) but it has a similarly useful place in usability evaluation.

Feedback can be more important in the formative evaluation of systems than might be immediately apparent. This is likely to be the case in beta-testing, where feedback is the point of the exercise, but it also is a way to conduct formative evaluation in that user comments can used in determining changes to be made in the new release. Sommerville (1992:284) suggests the incorporation of a "gripe" facility in software, to allow direct communication (somehow) with the developers.

Finally, an important part of contextual usability evaluation are the iterative methods found within participatory design. Although to an extent these are naturally concerned with design rather than evaluation, the fact they tend to involve rapid prototyping necessarily brings in a good deal of formative evaluation. Indeed, a method such as the cooperative prototyping of Bodker and Gronbaek (1991), where prototypes are produced on the spot, will necessarily involve moment-to-moment participatory usability evaluation. Similarly, while PETRA (Ross et al, in press) is concerned with CSCW evaluation, the participatory redesign session we termed the Playschool is also a form of usability evaluation by users.

The advantages of the contextual methods are the extent to which they reflect real work done by real people, rather than either imposing abstract models on that work (as the formal methods do), taking a designer-knows-best stance (as the heuristic methods do) or cutting off the person from their work and environment (as the empirical methods do). In that sense, they are both pragmatic and empowering. However, it can't be denied that they are probably harder to do than the other methods, requiring negotiation of access to work settings and leading to enormous amounts of data to be analysed. They often can be less conclusive than an empirical study, in that they don't provide neat and easy (albeit frequently meaningless) numbers and results to be waved around.

Heterogeneity of usability evaluation methods is just as possible as any other sort of evaluation, of course. The Xerox Star evaluations used timed tests, but also asked users to specify which icons they preferred; conducted interviews; and video-taped usage to determine critical incidents. While all these fit into the empirical category above, and were all conducted in the laboratory, they are by no means all cognitive psychology methods or quantitative methods.

Finally, I have glossed over the purpose of HCI evaluation rather, suggesting that it is simply concerned with usability. While this is substantially the case, Vainio-Larsson and Orring (1990:327) comment that "while theoretically it may be possible to separate the concepts of functionality, usability and acceptability, this separation is more difficult in practice ... [this] may also reflect users' greater concern with social, organisational and functional aspects of the system than with details of interaction". That is, the same issues in CSCW evaluation of whether the system works and what users think of the whole system embedded in its socio-technical context applies just as much to single-user evaluation. This has tended not to be considered much in HCI, but rather under the next banner, that of Information Systems.


Next Section (2.2.2) / Next Chapter (3) / Previous Section (2.1.2) / Previous Chapter (1) / Contents / References

Go to the Evaluation of Cooperative Systems home page


Cooperative Systems Engineering Group | Computing Department | Lancaster University
Magnus Ramage 10 October 1995