Introduction

In order to evaluate the quality of human-system interaction, testers commonly need to measure usability, workload, training, and trust. As is the case for all measurement, testers should measure these concepts as precisely as possible, using validated scales to minimize measurement error. In the sections that follow, we identify validated scales designed to measure each of the concepts identified above and provide helpful information about their use, including:

Name(s), including acronyms
What it measures
Reference(s)
Information for creating your own survey forms including questions, anchors, and how to administer them
Instructions on scoring. If there are multiple, valid ways to score then they are listed.
Pseudocode (not specific to any computer language) to see how you would score scales in programs like Excel, SPSS, STATA, R, and Python.

If you have any questions, please contact the Test Science team, testscience2@ida.org for advice.

Overview

This provides an overview of the validated scales approved by DOT&E for use in operational test and evaluation.

Note: There are no scales that measure situational awareness in a valid and reliable way. Scales exist which measure perceived situational awareness and are briefly discussed as a final section. But while potentially valuable, these measures are not valid for evaluating a requirement to increase operator situational awareness. If testers need to measure real (as opposed to perceived) situational awareness, they should look into a behavioral measure.

Measures	Links	Acronym	Scale Name	Advantages	Disadvantages	Subscales	Num Qs
Usability	S P	SUS	System Usability Scale	Widely given	Long. More complicated scoring	Overall	10
	S P	UMUX	Usability Metric for User Experience	Shorter than SUS. Based on ISO9241 definition of usability.	Reverse-scored items can confuse people	Overall	4
	S P	UMUX-LITE	Usability Metric for User Experience Lite	Short. Predicts SUS scores with high accuracy and correlates with NPS	Fewer outcome scores	Overall	2
Workload	S P I	NASA-TLX	NASA Task Load Index	Free app. Task agnostic	Long. Original scoring is complicated.	Overall	6
	S P I	NASA-TLX	NASA Task Load Index	Free app. Task agnostic	Long. Original scoring is complicated.	Weights*	15
	S P	ARWES/CSS	AFFTC Revised Workload Estimate Scale	Short (1 Q)	Small pool of data for comparison	Overall	1
Training Effectiveness	S	OATS	Operational Assessment of Training Scale	Construct subscales	Currently undergoing validation	Relevance	9
	S	OATS	Operational Assessment of Training Scale	Construct subscales	Currently undergoing validation	Efficacy	6
	S	DSoT	Diagnostic Survey of Training	Helpful for improving training	Not validated. Only used as a supplement	Course	8
	S	DSoT	Diagnostic Survey of Training	Helpful for improving training	Not validated. Only used as a supplement	Instructor	1
Trust	S P	TOAST	Trust of Automated Systems Test	Subscales	Currently undergoing validation	Understanding	4
Trust	S P	TOAST	Trust of Automated Systems Test	Subscales	Currently undergoing validation	Performance	5

Key: I = Instruction manual. NPS = Net promoter score. P = Paper. S = Scale. * = Weights only need to be filled out once for each task type.

Scale Details

Information for administering each scale is included below. This includes the title, citation information, individual items, scoring criteria, and any other details.

Usability

SUS

Information for Administrators

The SUS is the tried-and-true workhorse of the usability industry. It’s longer, but gets the job done with higher precision.

Full title: System Usability Scale
In-text citation: Brooke (1986)
Full citation: Brooke, J. (1986). SUS: a “quick and dirty” usability scale. In P. W. Jordan, B. Thomas, B. A. Weerdmeester, & A. L. McClelland (eds.). Usability Evaluation in Industry. London, England: Taylor and Francis.
Reading scores: Higher scores indicate more usability. The overall average is ~ 68.
- Note that scores are not percentages and should not be interpreted as such. When communicating with people unfamiliar with the SUS it can be useful to convert scores to percentiles.
Variations:
- A SUS with all positive items has also been validated. Items are available here and the full validation conducted by Kortum, Acemyan, & Oswald (2020) can be viewed here.

Information for Survey Forms

Title: SUS Scale
Scale anchors: 1 (Strongly Disagree), 5 (Strongly Agree)
Directions: Read each statement carefully and indicate the extent to which you agree or disagree using the scale provided.

Individual Items

Number	Item
1	I think that I would like to use this system frequently.
2	I found the system unnecessarily complex.
3	I thought the system was easy to use.
4	I think that I would need the support of a technical person to be able to use this system.
5	I found the various functions in this system were well integrated.
6	I thought there was too much inconsistency in this system.
7	I would imagine that most people would learn to use this system very quickly.
8	I found the system very cumbersome to use.
9	I felt very confident using the system.
10	I needed to learn a lot of things before I could get going with this system.

Scoring

Even numbered items are reverse scored and one is subtracted one from each item to put it on a 0 - 6 scale. Scores are then summed and multiplied by 2.5 to convert it to an overall score on a 0 - 100 scale.

In other words, odd numbered items are scored as Response - 1 and even numbered items are scored as 5 - Response, which we will refer to below as Score.

More formulaically, scored items i, this can be expressed as:

[latex]Final Score = \sum_{i=1}^{10} Score_{i} \times{} 2.5 = ((SUS1 - 1) + (5 - SUS2) + (SUS3 - 1) + ... + (5 - SUS10)) \times{} 2.5[/latex]

$$= (20 + (SUS1 + SUS3 + ... + SUS9) - (SUS2 + SUS4 + ... + SUS10)) \times{} 2.5$$

Pseudocode

// Assumes your items are numbered the same with the variable names SUS##
// where ## represents the item number. Individual items are first scored and
// have an 'r' appended to their name. Then a final score is calculated. 

// Create reverse-scored items
SUS01r = SUS01 - 1
SUS02r = 5 - SUS02
SUS03r = SUS03 - 1
SUS04r = 5 - SUS04
SUS05r = SUS05 - 1
SUS06r = 5 - SUS06
SUS07r = SUS07 - 1
SUS08r = 5 - SUS08
SUS09r = SUS09 - 1
SUS10r = 5 - SUS10

// Compute overall score
SUS_Overall = (SUS01r + SUS02r + SUS03r + SUS04r + SUS05r + 
  SUS06r + SUS07r + SUS08r + SUS09r + SUS10) * 2.5

Reference Scores / “Grading”

Sauro & Lewis (2012) suggested a grading criterion that may be used.

SUS Score Range	Grade	Percentile Range
84.1 -– 100.0	A+	96 -– 100
80.8 -– 84.0	A	90 -– 95
78.9 -– 80.7	A-	85 -– 89
77.2 -– 78.8	B+	80 -– 84
74.1 -– 77.1	B	70 -– 79
72.6 -– 74.0	B-	65 -– 69
71.1 -– 72.5	C+	60 -– 64
65.0 -– 71.0	C	41 -– 59
62.7 -– 64.9	C-	35 -– 40
51.7 -– 62.6	D	15 -– 34
0.0 -– 51.7	F	0 -– 14

Bangor, Kortum, & Miller (2008) developed a process for adding more intuitive verbal labels to scores. This is their equivalent breakdown.

SUS Score Range	Adjective
85.59 -– 100.00	Best imaginable
72.76 -– 85.58	Excellent
52.02 -– 72.75	Good
39.18 -– 52.01	OK
25.01 -– 39.17	Poor
0.00 -– 25.00	Worst imaginable

This figure from Bangor et al. (2008) shows a comparison.

UMUX

Information for Administrators

The UMUX is useful when you want more granularity in measuring usability than the UMUX-LITE.

Full title: Usability Metric for User Experience
In-text citation: Finstad (2010)
Full citation: Finstad, K. (2010). The usability metric for user experience. Interacting with Computers, 22, 323-327. doi: doi:10.1016/j.intcom.2010.04.004
Reading scores: Higher scores indicate more usability.
Variations: There is also a 5-point UMUX-LITE in use.

Information for Survey Forms

Title: UMUX Scale
Scale anchors: 1 (Strongly Disagree), 7 (Strongly Agree)
Directions: Read each statement carefully and indicate the extent to which you agree or disagree using the scale provided.

Individual Items

Number	Item
1	[This system’s] capabilities meet my requirements.
2	Using [this system] is a frustrating experience.
3	[This system] is easy to use.
4	I have to spend too much time correcting things with [this system].

Scoring

Items 2 and 4 are reverse coded and all items are converted onto a 0 - 6 scale for scoring. For comparison to the SUSs’ 0 - 100 scale the UMUX score is then divided by the maximum and multiplied by 100.

In other words, items 2 and 4 are scored as Response - 1 and items 1 and 3 are scored as 7 - Response, which we will refer to below as Score.

In formulaic terms, for each item, i, in the four items: $$Final Score = \frac{\displaystyle \frac{\sum_{i=1}^{4} Score_{i}}{4}}{6} \times{} 100 = \frac{\displaystyle \frac{(7 - UMUX1) + (UMUX2 - 1) + (7 - UMUX3) + (UMUX4 - 1)}{4}}{6} \times{} 100$$ $$= \frac{12 - UMUX1 - UMUX3 + UMUX2 + UMUX4}{24} \times{} 100 $$

Pseudocode

// Assumes your items are numbered the same with the variable names UMUX#
// where # represents the item number. Individual items are first scored and
// have an 'r' appended to their name. Then a final score is calculated. 

// Create reverse-scored items
UMUX1r = 7 - UMUX1
UMUX2r = UMUX2 - 1
UMUX3r = 7 - UMUX3
UMUX4r = UMUX4 - 1

// Compute overall score
UMUX_Overall = (((UMUX1r + UMUX2r + UMUX3r + UMUX4r) / 4) / 6) * 100

UMUX-LITE

Information for Administrators

The UMUX-LITE is a good, quick survey for measuring usability.

Full title: Usability Metric for User Experience LITE
In-text citation: Lewis, Utesch, & Maher (2015)
Full citation: Lewis, J.R., Utesch, B.S., & Maher, D.E. (2013). UMUX-LITE: When there’s no time for the SUS. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Chicago, IL (pp. 2099 - 2102). doi: 10.1145/2470654.2481287
Reading scores: Higher scores indicate more usability.

Information for Survey Forms

Title: UMUX-LITE Scale
Scale anchors: 1 (Strongly Disagree), 7 (Strongly Agree)
Directions: Read each statement carefully and indicate the extent to which you agree or disagree using the scale provided.

Individual Items

Number	Item
1	[This system’s] capabilities meet my requirements.
2	[This system] is easy to use.

Scoring

The UMUX-LITE can be reported as the average of the two items. $$Final Score = \frac{UMUXLITE1 + UMUXLITE2}{2}$$
For comparison to the SUSs' 0 - 100 scale the UMUX-LITE score is calculated using the equation below, which the authors call the UMUX-Liter. $$Final Score = 0.65 \times{} ((UMUXLITE1 + UMUXLITE2 - 2) \times{} \frac{100}{12}) + 22.9$$ If using this format note that the UMUX-LITE's two questions do not cover the full range of the SUS. The range of the UMUX-LITE for comparison to the SUS is [22.9, 87.9].
As percentile ranks on each item where item one measures usefulness and item two measures usability as shown here.

Image via MeasuringU

Pseudocode

// Assumes your items are numbered the same with the variable names UMUXLITE#
// Where # represents the item number
UMUXLITE_Overall = 0.65 * ((UMUXLITE1 + UMUXLITE2 - 2) * (100 / 12)) + 22.9

Workload

NASA-TLX

Information for Administrators

The NASA-TLX has been performing well for decades, but the original scoring method is complicated so we highly recommend the raw TLX scoring method or using the app on NASA’s web site.

Full title: NASA Task Load Index
In-text citation:
- Original Chapter: Hart & Staveland (1988)
- Raw TLX scoring: Hart (2006)
Full citations
- Original Chapter: Hart, S. G. & Staveland, L. E. (1988) Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In P. A. Hancock and N. Meshkati (Eds.) Human Mental Workload. Amsterdam: North Holland Press.
- Raw TLX scoring: Hart, S. G. (2006, October). NASA-task load index (NASA-TLX); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting (Vol. 50, No. 9, pp. 904-908). Sage CA: Los Angeles, CA: Sage Publications.
Reading scores: Higher scores indicate higher workload.
Variations:
- (Recommended) A raw TLX version without any weighting can be used to simplify administration and scoring
- There is an app available through NASA
- If task load precludes you from administering the scale during an operational test, it may be administered retrospectively, see administration manual for details.
Raters should be given the rating scale definitions for the duration of the time they are filling out ratings or weights associated with the NASA-TLX.
Administrative Note: The NASA-TLX is administered as part of a process during each task type¹.
1. DEFINE THE TASK(S).
2. (Optional) CONDUCT A HIERARCHICAL TASK ANALYSIS (HTA) FOR THE TASK(S) UNDER ANALYSIS.
3. SELECT PARTICIPANTS based on the goals of the analysis.
4. BRIEF PARTICIPANTS by explaining the purpose of the study and the basics of the NASA-TLX method. A workshop on mental workload and a brief run-through of the NASA-TLX may be useful.
5. PERFORM TASK UNDER ANALYSIS. The participants should perform the tasks and fill out the NASA-TLX form either during the trial or immediately post-trial.
6. FOLLOW WEIGHTING PROCEDURE. Present the 15 pairwise comparisons to the participants, asking them to select from each of the 15 pairs the subscale from each pair that contributed the most to the workload of the task.
7. COMPLETE NASA-TLX RATING. Ask participants to give a rating for each subscale from 0 (low) to 20 (high).
8. TLX SCORE CALCULATION. The TLX software can calculate the overall workload score between 0 and 100.
- Weights only need to be completed once for each task type.

Example Scripts

These example scripts should be read before the first time a rater fills out the weights and ratings. Slight wording changes may be applied, e.g., change “experiment” to “test” or “evaluation”, or change, “You will evaluate the task by putting an ‘X’ on each of the six scales…” to the appropriate way information is being collected.

There are two sets of instructions included on NASA’s web site:

Information before filling out weights for the first time.
Information before filling out ratings for the first time.

Information for Survey Forms

Reference Table

Title: NASA-TLX Reference Sheet Definitions
Scale anchors: N/A for this reference sheet
Directions: N/A for this reference sheet
Note: You can give this form to raters for the duration of the task.

Factor	Endpoints	Description
MENTAL DEMAND	Low/High	How much mental and perceptual activity was required (e.g., thinking, deciding, calculating, remembering, looking, searching, etc.)? Was the task easy or demanding, simple or complex, exacting or forgiving?
PHYSICAL DEMAND	Low/High	How much physical activity was required (e.g., pushing, pulling, turning, controlling, activating, etc.)? Was the task easy or demanding, slow or brisk, slack or strenuous, restful or laborious?
TEMPORAL DEMAND	Low/High	How much time pressure did you feel due to the rate or pace at which the tasks or task elements occurred? Was the pace slow and leisurely or rapid and frantic?
PERFORMANCE	Good/Poor	How successful do you think you were in accomplishing the goals of the task set by the experimenter (or yourself)? How satisfied were you with your performance in accomplishing these goals?
EFFORT	Low/High	How hard did you have to work (mentally and physically) to accomplish your level of performance?
FRUSTRATION LEVEL	Low/High	How insecure, discouraged, irritated, stressed, and annoyed versus secure, gratified, content, relaxed, and complacent did you feel during the task?

Individual Items

Sources of Load (Weights)

Title: Sources of Workload
Scale anchors: 0 (Left factor is more important), 1 (Right factor is more important)
Directions: Read each statement carefully and indicate which item is more important in your evaluation of this task.

	Left Factor	Right Factor
1	Mental Demand	Physical Demand
2	Mental Demand	Temporal Demand
3	Mental Demand	Performance
4	Mental Demand	Effort
5	Mental Demand	Frustration Level
6	Physical Demand	Temporal Demand
7	Physical Demand	Performance
8	Physical Demand	Effort
9	Physical Demand	Frustration Level
10	Temporal Demand	Performance
11	Temporal Demand	Effort
12	Temporal Demand	Frustration Level
13	Performance	Effort
14	Performance	Frustration Level
15	Effort	Frustration Level

Rating Scale

Title: NASA Task Load Index
Scale anchors: 0 (Low), 20 (High)
Directions: Read each statement carefully and indicate your response to each question.

Number	Factor	Left Anchor	Right Anchor
1	Mental Demand	Low	High
2	Physical Demand	Low	High
3	Temporal Demand	Low	High
4	Performance	Good	Poor
5	Effort	Low	High
6	Frustration Level	Low	High

Scoring

There are two ways to score the NASA-TLX: The unweighted procedure (recommended) and the weighted procedure.

Unweighted procedure, also referred to as the raw TLX

In the unweighted procedure, simply sum all ratings. $$Final Score = \sum_{i = 1}^{6}{Response_i} = TLX1 + TLX2 + ... + TLX6$$

Weighted Procedure

The original NASA-TLX includes a process to get an overall workload score where each workload subtype was weighted depending on its importance to the task. For example, mental workload is not very relevant to lifting heavy objects, and so you would want it to contribute less to your workload score. To calculate this rating, respondents are first given a pair of subtypes (e.g., Mental vs. Physical) and asked which of those two choices is more important for this task. They make these comparisons for every possible pair of subtypes. Weights are calculated based on the number of times a subtype was chosen as more important than another. So for each weight item, j, score it as follows: $$Weight_i = \sum_{j=1}^{5}{Score_j}$$

This will result in six scores, namely one for each factor.

You will then calculate a final score by multiplying each rating, i, by its relevant weight and dividing by 15 to keep in on a 0 - 100 scale. $$Final Score = \frac{\sum_{i=1}^{6}{Response_i \times{} Weight_i}}{15}$$

Pseudocode

// UNWEIGHTED PROCEDURE
// Assumes your rating items are numbered the same with the variable names TLX#
// where # represents the item number. 
TLX_Overall = (TLX1 + TLX2 + TLX3 + TLX4 + TLX5 + TLX6)


// WEIGHTED PROCEDURE
// Assumes your rating items are numbered the same with the variable names TLX#
// where # represents the item number and weight items are numbered
// as the table above with the variable names TLX_WEIGHT##. Weights are 
// first scored 0 when the left factor was selected and 1 when the right factor
// was selected. A final score is calculated.

// Create weighting variables
TLX_WEIGHT_Mental_Demand =     (1 - TLX_WEIGHT01) + (1 - TLX_WEIGHT02) + 
                               (1 - TLX_WEIGHT03) + (1 - TLX_WEIGHT04) + 
                               (1 - TLX_WEIGHT05)
TLX_WEIGHT_Physical_Demand =   TLX_WEIGHT01 + (1 - TLX_WEIGHT06) + 
                               (1 - TLX_WEIGHT07) + (1 - TLX_WEIGHT08) + 
                               (1 - TLX_WEIGHT09)
TLX_WEIGHT_Temporal_Demand =   TLX_WEIGHT02 + TLX_WEIGHT06 + 
                               (1 - TLX_WEIGHT10) + (1 - TLX_WEIGHT11) + 
                               (1 - TLX_WEIGHT12)
TLX_WEIGHT_Performance =       TLX_WEIGHT04 + TLX_WEIGHT07 + TLX_WEIGHT10 + 
                               (1 - TLX_WEIGHT13) + (1 - TLX_WEIGHT14)
TLX_WEIGHT_Effort =            TLX_WEIGHT04 + TLX_WEIGHT08 + TLX_WEIGHT11 + 
                               TLX_WEIGHT13 + (1 - TLX_WEIGHT15)
TLX_WEIGHT_Frustration_Level = TLX_WEIGHT05 + TLX_WEIGHT09 + TLX_WEIGHT12 + 
                               TLX_WEIGHT14 + TLX_WEIGHT15

// Compute overall score
TLX_Overall = ((TLX1 * TLX_WEIGHT_Mental_Demand) + 
               (TLX2 * TLX_WEIGHT_Physical_Demand) +
               (TLX3 * TLX_WEIGHT_Temporal_Demand) +
               (TLX4 * TLX_WEIGHT_Performance) +
               (TLX5 * TLX_WEIGHT_Effort) +
               (TLX6 * TLX_WEIGHT_Frustration_Level)) / 15

ARWES/CSS

Information for Administrators

The ARWES is great for quickly getting at workload with a single question.

Full title: Air Force Flight Test Center (AFFTC) Revised Workload Estimate Scale
In-text citation: Ames & George (1993)
Full citation: Ames, L.L., & George, E.J. (1993). Revision and verification of a seven-point workload estimate scale. Edwards AFB, CA: Air Force Flight Test Center.
Reading scores: Higher scores indicate higher workload.
Note: The first, unvalidated version of this scale called the Crew Status Survey (CSS) is sometimes confused for this validated, revised version. If your questions do not match this one then discard them and only use these.

Information for Survey Forms

Title: Crew Status Survey
Scale anchors: N/A
Directions: Read each statement carefully and indicate the one that is most representative of your workload.

Individual Items

Number	Item
1	Nothing to do; No system demands.
2	Light Activity; minimal demands.
3	Moderate activity; easily managed considerable spare time.
4	Busy; Challenging but manageable; Adequate time available.
5	Very busy; Demanding to manage; Barely enough time.
6	Extremely busy; Very difficult; Non-essential tasks postponed.
7	Overloaded; System unmanageable; Essential tasks undone; Unsafe.

Scoring

Since the ARWES/CSS is a one-item scale so there is no scoring necessary.

Pseudocode

Since the ARWES/CSS is a one-item scale so there is no scoring necessary.

Training Effectiveness

OATS

Information for Administrators

The OATS helps you benchmark or find problems in training without having to use open-ended questions.

Full title: Operational Assessment of Training Scale
Status: The OATS is currently under joint validation by DOT&E, ATEC, and JITC. For this reason there are currently no citations.
In-text citation: N/A
Full citation: N/A
Reading scores: Higher scores indicate that training is more effective or relevant.
Note: This scale is still undergoing validation and will likely change in the future.

Information for Survey Forms

Title: Operational Assessment of Training Scale (OATS)
Scale anchors: 1 (Strongly Disagree), 7 (Strongly Agree)
Directions: Please indicate the extent to which you agree or disagree with the following statements about the training you just completed. Your responses will be used to improve training for {INSERT PROGRAM NAME} and to develop a tool that {INSERT ORGANIZATION NAME} can use when testing future systems. Your responses will be completely anonymous.

Individual Items

Number	Subscale	Item
1	E	I’d be (I’m) confident using the system during real operations without additional training.
2	R	Training accurately portrayed operations in the field.
3	R	I would not make changes to the course content.
4	E	The training prepared me to easily use the system to accomplish my mission.
5	R	I can see myself using what I learned in training during real operations.
6	E	Training prepared me to solve common problems.
7	R	The course’s level of difficulty was appropriate for someone in my position.
8	R*	The course covered topics I don’t think should have been covered.
9	R	All of the information covered was relevant to how I interact with the system.
10	E	The training improved my understanding of how to interact with the system.
11	E*	I’d (I) want additional training before using the system during real operations.
12	E	The training prepared me to properly interact with the system.
13	R*	The training had a lot of information that wasn’t relevant to me.
14	R*	Training did not cover important ways I interact with the system.
15	R	Training adequately covered all important ways I interact with the system.

Key: E = Efficacy. R = Relevance. $^{*}$ Denotes that the item is reverse-scored.

Scoring

There are two subscales in the OATS and a few reverse-coded items.

Items denoted “reverse-scored” above are scored as 8 - Response, which we will refer to below as Score.

More formulaically, scored items i in each subscale with s total items can be expressed as: $$Subscale Score = \frac{\sum_{i=1}^{s} Score_{i}}{s}$$ $$Relevance Score = \frac{OATS2 + OATS3 + ... + (8 - OATS14) + OATS15}{9}$$ $$Efficacy Score = \frac{OATS1 + OATS4 + ... + (8 - OATS11) + OATS12}{6}$$

Pseudocode

// Assumes your items are numbered the same with the variable names OATS##
// where ## represents the item number. Individual items are first scored and
// have an 'r' appended to their name. Then a final score is calculated. 

// Create reverse-scored variables
OATS01r_Efficacy  = OATS01
OATS02r_Relevance = OATS02
OATS03r_Relevance = OATS03
OATS04r_Efficacy  = OATS04
OATS05r_Relevance = OATS05
OATS06r_Efficacy  = OATS06
OATS07r_Relevance = OATS07
OATS08r_Relevance = 8 - OATS08
OATS09r_Relevance = OATS09
OATS10r_Efficacy  = OATS10
OATS11r_Efficacy  = 8 - OATS11
OATS12r_Efficacy  = OATS12
OATS13r_Relevance = 8 - OATS13
OATS14r_Relevance = 8 - OATS14
OATS15r_Relevance = OATS15

// Calculate overall scores
OATS_Relevance = (OATS02r_Relevance + OATS03r_Relevance + 
                  OATS05r_Relevance + OATS07r_Relevance + 
                  OATS08r_Relevance + OATS09r_Relevance + 
                  OATS13r_Relevance + OATS14r_Relevance + 
                  OATS15r_Relevance) / 9
OATS_Efficacy  = (OATS01r_Efficacy + OATS04r_Efficacy + 
                  OATS06r_Efficacy + OATS10r_Efficacy + 
                  OATS11r_Efficacy + OATS12r_Efficacy) / 6

DSoT

Information for Administrators

The DSoT quickly focuses you on ways to improve a training. It should be used as a supplement to the OATS or other training diagnostics; the DSoT should not be used instead of validated training measures.

Full title: Diagnostic Survey of Training
In-text citation: N/A
Full citation: N/A
Reading scores:
- Higher scores indicate indicate that people would like more information on the topic
- Scores of four indicate that people do not want you to change the amount of material
Note: This scale is not validated, but is a useful instrument as a supplement to the OATS.

Information for Survey Forms

Title: Diagnostic Survey of Training (DSoT)
Scale anchors: 1 (Significantly Decrease), 7 (Significantly Increase)
Directions: Choose the option that best describes what you think should happen to each of the aspects of training in the list below.

Individual Items

Number	Item
1	Amount of hands-on training
2	Amount of lecture
3	Detail of course training content
4	Pace of the course training
5	Amount of reference materials provided
6	Amount of time for questions
7	Reinforcement of course training content
8	Overall training length

Scale anchors: 1 (Strongly Disagree), 7 (Strongly Agree)

Individual Items

Number	Item
1	The instructor did a good job overall

Scoring

The DSoT does not necessarily need to be scored and can rather be used as a diagnostic instrument. If you see that you have a lot of low scores for pace, perhaps you should discuss ways to draw things out longer such as adding more hands-on materials.

Pseudocode

There is no pseudocode necessary.

Trust

For information about the importance of trust in automation see Lee & See (2004):

Lee, J.D., & See, K.A. (2004). Trust in Automation: Designing for Appropriate Reliance. Human Factors, 46(1), 50–80. doi: 10.1518/hfes.46.1.50_30392

TOAST

Information for Administrators

The TOAST gives you a quick sense of whether people dislike the system to the point that they either don’t use it or override its automated responses.

Full title: Trust of Automated Systems Test
In-text citation: Wojton et al. (2020)
Full citation: Wojton, H.M., Porter, D., Lane, S.T., Bieber, C., & Madhavan, P. (2020). Initial validation of the trust of automated systems test (TOAST). Journal of Social Psychology, 160(6), 735-750. doi:10.1080/00224545.2020.1749020
Reading scores: The TOAST has two subscales that should not be combined and each has a separate interpretation.
- Higher scores on the understanding subscale indicate that people trust the system more because they understand it.
- Higher scores on the performance subscale indicates that the system helps them perform their job duties.

Information for Survey Forms

Title: TOAST Scale
Scale anchors: 1 (Strongly Disagree), 7 (Strongly Agree)
Directions: Read each statement carefully and indicate the extent to which you agree or disagree using the scale provided.

Individual Items

Number	Subscale	Item
1	U	I understand what the system should do.
2	P	The system helps me achieve my goals.
3	U	I understand the limitations of the system.
4	U	I understand the capabilities of the system.
5	P	The system performs consistently.
6	P	The system performs the way it should.
7	P	I feel comfortable relying on the information provided by the system.
8	U	I understand how the system executes tasks.
9	P	I am rarely surprised by how the system responds.

Key: U = Understanding subscale. P = Performance subscale

Scoring

Each subscale can be scored by calculating the mean of the subscale.

For each item, i, in a subscale with s items: $$Subscale Score = \displaystyle \frac{\sum_{i=1}^{s} TOAST_{i}}{s}$$ Each scale separately: $$Understanding = \displaystyle \frac{TOAST1 + TOAST3 + TOAST4 + TOAST8}{4}$$ $$Performance = \displaystyle \frac{TOAST2 + TOAST5 + TOAST6 + TOAST7 + TOAST9}{5}$$

Pseudocode

// Assumes your items are numbered the same with the variable names TOAST#
// Where # represents the item number
TOAST_Understanding = (TOAST1 + TOAST3 + TOAST4 + TOAST8) / 4
TOAST_Performance   = (TOAST2 + TOAST5 + TOAST6 + TOAST7 + TOAST9) / 5

Do not create a total, composite TOAST score as validation showed that it was not a reliable measure.

Situational Awareness

As mentioned previously, we highly recommend measuring situational awareness (SA) using behavioral measures tied to mission-critical outcomes. Techniques to measure real SA typically do not involve scales, and so we do not include them in this repository. For an overview of these techniques, their benefits, and limitations (e.g., Situation Awareness Global Assessment Technique or SAGAT), please see this external repository: However, not all of these techniques are appropriate for all systems or tests, and details should be worked out at the program level.

https://ext.eurocontrol.int/ehp/?q=taxonomy/term/104

In certain situations it may be important to measure perceived situational awareness. Perceived SA is a concept that can be measured with a scale. However, we do not include these measures here as in most cases this is not what testers desire, and efforts to validate commonly-used perceived SA scales have often found they measure other HSI concepts (e.g., workload).

Technical Note

via HealthIT.gov ↩

Subscribe

Validated Scales Repository

Introduction

Overview

Scale Details

Usability

SUS

Information for Administrators

Information for Survey Forms

Individual Items

Scoring

Pseudocode

Reference Scores / “Grading”

UMUX

Information for Administrators

Information for Survey Forms

Individual Items

Scoring

Pseudocode

UMUX-LITE

Information for Administrators

Information for Survey Forms

Individual Items

Scoring

Pseudocode

Workload

NASA-TLX

Information for Administrators

Information for Survey Forms

Individual Items

Scoring

Pseudocode

ARWES/CSS

Information for Administrators

Information for Survey Forms

Individual Items

Scoring

Pseudocode

Training Effectiveness

OATS

Information for Administrators

Information for Survey Forms

Individual Items

Scoring

Pseudocode

DSoT

Information for Administrators

Information for Survey Forms

Individual Items

Individual Items

Scoring

Pseudocode

Trust

TOAST

Information for Administrators

Information for Survey Forms

Individual Items

Scoring

Pseudocode

Situational Awareness

Technical Note