Introduction

In order to evaluate the quality of human-system interaction, testers commonly need to measure usability, workload, training, and trust. As is the case for all measurement, testers should measure these concepts as precisely as possible, using validated scales to minimize measurement error. In the sections that follow, we identify validated scales designed to measure each of the concepts identified above and provide helpful information about their use, including:

  • Name(s), including acronyms
  • What it measures
  • Reference(s)
  • Information for creating your own survey forms including questions, anchors, and how to administer them
  • Instructions on scoring. If there are multiple, valid ways to score then they are listed.
  • Pseudocode (not specific to any computer language) to see how you would score scales in programs like Excel, SPSS, STATA, R, and Python.

If you have any questions, please contact the Test Science team, testscience2@ida.org for advice.

Overview

This provides an overview of the validated scales approved by DOT&E for use in operational test and evaluation.

Note: There are no scales that measure situational awareness in a valid and reliable way. Scales exist which measure perceived situational awareness and are briefly discussed as a final section. But while potentially valuable, these measures are not valid for evaluating a requirement to increase operator situational awareness. If testers need to measure real (as opposed to perceived) situational awareness, they should look into a behavioral measure.

Measures Links Acronym Scale Name Advantages Disadvantages Subscales Num Qs
Usability S P SUS System Usability Scale Widely given Long. More complicated scoring Overall 10
S P UMUX Usability Metric for User Experience Shorter than SUS. Based on ISO9241 definition of usability. Reverse-scored items can confuse people Overall 4
S P UMUX-LITE Usability Metric for User Experience Lite Short. Predicts SUS scores with high accuracy and correlates with NPS Fewer outcome scores Overall 2
Workload S P I NASA-TLX NASA Task Load Index Free app. Task agnostic Long. Original scoring is complicated. Overall 6
Weights* 15
S P ARWES/CSS AFFTC Revised Workload Estimate Scale Short (1 Q) Small pool of data for comparison Overall 1
Training Effectiveness S  OATS Operational Assessment of Training Scale Construct subscales Currently undergoing validation Relevance 9
Efficacy 6
S DSoT Diagnostic Survey of Training Helpful for improving training Not validated. Only used as a supplement Course 8
Instructor 1
Trust S P TOAST Trust of Automated Systems Test Subscales Currently undergoing validation Understanding 4
Performance 5

Key: I = Instruction manual. NPS = Net promoter score. P = Paper. S = Scale. * = Weights only need to be filled out once for each task type.

Scale Details

Information for administering each scale is included below. This includes the title, citation information, individual items, scoring criteria, and any other details.

Usability

SUS

UMUX

UMUX-LITE

Workload

NASA-TLX

ARWES/CSS

Training Effectiveness

OATS

DSoT

Trust

For information about the importance of trust in automation see Lee & See (2004):

Lee, J.D., & See, K.A. (2004). Trust in Automation: Designing for Appropriate Reliance. Human Factors, 46(1), 50–80. doi: 10.1518/hfes.46.1.50_30392

TOAST

Situational Awareness

As mentioned previously, we highly recommend measuring situational awareness (SA) using behavioral measures tied to mission-critical outcomes. Techniques to measure real SA typically do not involve scales, and so we do not include them in this repository. For an overview of these techniques, their benefits, and limitations (e.g., Situation Awareness Global Assessment Technique or SAGAT), please see this external repository: However, not all of these techniques are appropriate for all systems or tests, and details should be worked out at the program level.

https://ext.eurocontrol.int/ehp/?q=taxonomy/term/104

In certain situations it may be important to measure perceived situational awareness. Perceived SA is a concept that can be measured with a scale. However, we do not include these measures here as in most cases this is not what testers desire, and efforts to validate commonly-used perceived SA scales have often found they measure other HSI concepts (e.g., workload).

Technical Note