Once the test is conducted we must know what to do with the data and how to handle it so we can gather as much information as possible without spending extra time or resources. Consequently, best practices are provided below for analyzing and evaluating the data. These strategies range from managing the data to interpreting the results in the reports.
Best Practices: Analyze
Collect and store data in a systematic way
Consider using advanced analysis methods when appropriate
Adjust by covariates
Provide a summary of the analysis approach taken
- For most audiences, a summary of the analysis might be appropriate. Explain in a few words what statistical approach was taken and why. For example: “To assess the accuracy of the new turret, the authors conducted a logistic regression.” If needed, and depending on the audience, explain more technical details (e.g. model selection technique, assumptions, data transformation) in a footnote or appendix.
Best Practices: Analyze & Evaluate
Be consistent
Best Practices: Evaluate
Remind the audience about the DOE when summarizing the results
Report conditions that might have affected the results
“Aircraft configuration was tactically varied at the discretion of the mission commander. The mission commander chose to use the Reconnaissance configuration three times as often as the Strike configuration, so even though we saw large differences in time to prosecute target for these two configurations, an insufficient number of Strike missions were conducted to determine whether or not the observed differences are statistically significant. The upcoming FOT&E should dictate the aircraft’s configuration to guarantee a minimum number of missions using the Strike configuration, generate mission scenarios in which the Strike configuration is more likely to be chosen, or provide additional test time to ensure sufficient Strike missions are conducted to assess the mission impact of configuration with statistical significance.”
Include figures and tables
“Figure 5 shows how the probability of detection changes as the distance between the weapon and the Q-53 counterfire radar increases when the system is in the 360-degree operating mode observing single-fire artillery engagements. The data also revealed that radar-weapon range and quadrant elevation (QE) had large effects on Q-53’s ability to detect incoming projectiles.”

Include raw data when appropriate
“This graph compares the model predictions (with confidence intervals) to the raw data across all 3 factors of interest (array type, noise profile, and submarine type). Predictions tend to match the data quite well across all conditions.”

Use the correct terminology
- Binary responses with zero failures (or zero successes). In cases where there are no failures or no successes, a one-sided confidence interval should be reported. For example, “Fourteen of fourteen shots hit the target, so we are 80 percent confident that probability of hit is at least 89.1 percent”.
- Reliability estimates with zero failures. In this case, constructing a two-sided confidence interval is not possible, and the phrase “we are x percent confident that the threshold was met” should be avoided. Instead, you can use the phrase “We observed 0 failures during 124 hours of testing, so we are x percent confident that the MTBF is at least 77 hours”.
Use other information to reinforce the results of a p-value
- When comparing results to a threshold, provide a detailed description of the supporting in formation. “The estimated National Imagery Interpretability Rating Scale (NIIRS) rating of images for the electro-optical (EO) sensor was 6.8 [95% CI: 6.2, 7.4], which is significantly worse than the key performance parameters (KPP) value of 8 (p-value = 0.0047),” instead of “The estimated NIIRS rating of images for the EO sensor was 6.8, which is significantly worse than the KPP value of 8 (p-value = 0.0047)”.
- In cases where performance is being compared, the following interpretations could be used: “Under the conditions specified in the KPP, missiles equipped with the new seeker will hit the target 78 percent of the time vice 64 percent of the time for the legacy seeker, which is a statistically significant difference (p-value=0.036),” and “The upgraded wings provided the air vehicle with an average of 1.4 hours with a 95% C.I. of (0.5, 2.3) more endurance than the non-upgraded system.”
Consider both, practical and statistical significance
- If a result is statistically significant, but the difference is not important in practice, then the following can be used: “On average, the upgraded payload was able to detect tactical vehicle-sized targets at 4,153m while the legacy could detect the same targets at 3,978m. The difference in detection range was statistically significant, but none of the participants in the test thought the 175m delta would have a substantial effect on their ability to execute their mission.”
- In cases where the results were not statistically significant but they might affect operations, this can be used: “On average, the upgraded payload was able to detect tactical vehicle-sized targets at 4,153m while the legacy could detect the same targets at 3,078m. While this difference is not statistically significant, participants in the test thought a 1000m difference in detection range could improve their ability to execute their mission. To verify this improvement, the IOT&E should include a higher proportion of convoy overwatch missions.”
Above all, remember that you are working in a team. In order to maximize the team’s effort, communication among experts is important. For example, the test might not go as planned. In this case, communication between operational testers, users, and analysis experts is key in order to properly interpret the results and to understand what information is lost or affected by changes in test execution.