Multi-Method Approach: Evaluating Human-System Interactions

The quality of human-system interactions is a key determinant of mission success for military systems. Often, testers evaluate human-system interactions using solely survey instruments, excluding other methods entirely. Multi-method approaches are more comprehensive than single-method approaches and yield richer datasets. They reduce the risk that testers will report erroneous effects and provide greater confidence in the test results. The following example used a multi-method approach in order to achieve the test’s goal in a more rigorous and defensible way.

Scenario & Test Goal

Ten attack helicopter pilots identified and responded to   under four conditions: high vs. low threat density, and presence vs. absence of a threat detection technology. Testers recorded two primary measures of pilot workload: an empirically-vetted survey (the NASA-Task Load Index) and operator performance (time to detect a threat). The following case study presents a situation where testers used those two quantitative methods, the NASA-Task Load Index (NASA-TLX) and time to detect first threat, to evaluate how pilot workload differed under different combat conditions.

Method

The test consisted of 22 operationally realistic attack helicopter missions. The goal of these missions was to detect and destroy threats in the environment with or without the aid of a new threat detection technology. Each mission included groups of 2 helicopters with 2 pilots in each helicopter. The test was a 2(Technology: absent, present) X 2(Threat Density: low, high) D-optimal design, controlling for the time of day that the mission took place (day or night) as prior testing has demonstrated that pilots may find some piloting tasks more difficult at night. The number of missions under each set of test conditions is provided in Table 1.

Table 1. Number of missions conducted under each set of test conditions
Technology Absent Technology Present
Low Threat Density High Threat Density Low Threat Density High Threat Density
Day 3 2 6 3
Night 1 2 2 3

During each mission, testers captured data on how quickly the team of pilots detected the first threat in the environment using the helicopter’s targeting software. The time began when the helicopters reached the combat area and ended when the team of pilots “locked on” to the threat. This resulted in a total of 22 observations, one for each mission. Directly following each mission, the pilots completed the NASA-TLX – a short, 6-item survey designed to assess their workload during the mission. The 6 items are combined to create a composite score, ranging from 0 to 100, with higher scores indicating greater workload.

Results

The two measures of workload, the NASA-TLX and time to detect first threat, were evaluated separately and compared qualitatively because they were collected at the individual and group levels, respectively. A total of 74 surveys were completed from 10 pilots – testers were unable to collect survey data on 14 occasions as a result of participant choice or confusion, and operational factors. Each pilot completed between 3 and 10 missions during the test, with the majority completing between 7 and 8 missions. The number of surveys completed in each condition is provided in Table 2.

Table 2. Number of surveys collected under each set of test conditions
Technology Absent Technology Present
Low Threat Density High Threat Density Low Threat Density High Threat Density
Day 12 8 20 12
Night 4 4 4 10

The 6 items from the NASA-TLX demonstrated high levels of  and consequently, were averaged to compute a single measure of workload for each pilot. Pilots reported a relatively low level of workload (M = 22.43, SD = 9.05) across test conditions. Table 3 shows that pilots reported the lowest levels of workload when the threat detection technology was absent and the threat density was high, and the highest levels of workload when the threat detection technology was absent and the threat density was low. Workload scores when the threat detection technology was present fell between these extremes under conditions of both high and low threat density.

         Table 3. Descriptive statistics for the NASA-TLX

Mean SD
Low Threat Density Technology Absent 24.88 11.88
Technology Present 23.25 7.92
High Threat Density Technology Absent 17.67 8.00
Technology Present 22.36 7.97

A mixed effects model  was used to determine whether pilots’ workload scores differed statistically after controlling for the time of day that the mission took place. A mixed effects model was chosen to account for dependency in the data that occurred because the same pilots provided multiple ratings of their workload throughout the test. Although a repeated measures ANOVA is also designed to deal with such dependencies, it cannot deal with differences in the number of ratings provided by each pilot.

The fixed effects – presence of the threat detection technology, threat density, and time of day – were regressed on NASA-TLX scores simultaneously, with pilot entered as a random effect. Together, the fixed and random effects accounted for 47.74 percent of the variance in pilots’ workload scores, with the fixed effects accounting for nearly half of that value (marginal R2 = 0.21). The regression coefficients for the fixed effects are presented in Table 4.

Table 4. NASA-TLX model results
Coefficient SE t-value
Time of Day -7.56*** 1.84 -4.10
Threat Density 7.50** 2.69 2.79
Technology Presence 7.89** 2.74 2.88
Threat Density X Technology Presence -9.56** 3.43 -2.79
***p<.001, **p<.01

Pilots rated their workload  , and under conditions of low threat density than high threat density. Pilots also rated their workload higher when the new threat technology was available in the cockpit. The reason for this finding becomes clearer when we consider the nature of the interaction between threat density and technology presence (see Figure 1).

In particular, pilots reported similar levels of workload under conditions of high and low threat density when the threat detection technology was present. When the threat detection technology was absent, however, pilots reported higher levels of workload when threat density was low than when it was high. These findings suggest that the threat detection technology is helping pilots manage their workload when threat density is low, but is actually contributing to the difficulty of detecting threats when threat density is high. The qualitative data that is available suggests that elements of the interface may be driving this effect. In particular, when the system detects a potential threat, an icon pops up that has to be manually investigated by the pilot. To do so, the pilot must hover over the icon with the cursor and select it to read information about the threat. When threat density was high, icons cluttered the screen, making it more difficult for pilots to perform the detection task using the technology than simply looking out the window.

Regarding the data collected on the amount of time it took pilots to detect a threat, normalization (converted to z-scores) was needed in order to protect sensitive information. This places the data on a scale where the mean detection time is 0 and the standard deviation of the distribution is 1. Negative values represent detection times that were quicker than the mean, whereas positive values represent detection times that were slower than the mean. Consistent with the NASA-TLX data presented above, pilots were slowest detecting a threat under low threat density when the threat detection technology was absent, and were quickest at detecting a threat under high threat density when the threat detection technology was absent. Detection times when the technology was present fell between these two extremes (Table 5).

         Table 5. Descriptive statistics for the threat detection task

Mean SD
Low Threat Density Technology Absent 1.51 1.28
Technology Present -0.35 0.70
High Threat Density Technology Absent -0.64 0.38
Technology Present -0.11 0.31

A linear regression model was used to determine whether threat detection time differed statistically by condition after controlling for the time of day that the mission took place. A mixed effects model was originally considered to account for the fact that the same pilots completed missions throughout the test; however, the random effect of pilot did not significantly improve model fit (p > .90) and was therefore, discarded in favor of a simpler, fixed effects only model.

The three predictors – presence of the threat detection technology, threat density, and time of day – were regressed on threat detection times simultaneously. Together, these predictors accounted for 47.13 percent of the variance in threat detection time, a similar number to that reported above for NASA-TLX scores. The threat density by technology presence interaction was the only significant predictor of time to detect a threat. None of the main effects remained significant predictors after accounting for this interaction. The regression coefficients for the threat detection model are presented in Table 6.

Table 6. Threat Detection Task model results
Coefficient SE t-value
Time of Day -0.20 0.33 -0.61
Threat Density -0.30 0.40 -0.74
Technology Presence -0.53 0.47 -1.13
Threat Density X Technology Presence 2.39** 0.65 3.70
***p<.001, **p<.01

Mirroring the NASA-TLX findings, time to detect a threat was similar under conditions of high and low threat density when the threat detection technology was present. When the threat detection technology was absent, however, pilots took longer to detect threats when threat density was low than when it was high (Figure 2). Again, these findings suggest that the threat detection technology is helping pilots manage their workload when threat density is low, but is actually contributing to the difficulty of detecting threats when threat density is high.

The fact that we were able to replicate this same pattern of results using both a survey and human performance gives us confidence that these results reflect reality rather than chance or measurement error – the threat detection technology improves workload under some conditions, but not others. Furthermore, it serves as a more rigorous test of these effects and is, therefore, more defensible than reporting results from either of these measures of workload alone and provides multiple pieces of evidence to support the idea that pilots may benefit from altering their tactics, techniques, and procedures when using the threat detection technology under conditions of high threat density.