Best Practices Analyze & Evaluate

Once the test is conducted we must know what to do with the data and how to handle it so we can gather as much information as possible without spending extra time or resources. Consequently, best practices are provided below for analyzing and evaluating the data. These strategies range from managing the data to interpreting the results in the reports.

Best Practices: Analyze

Collect and store data in a systematic way

Sometimes analysts must spend additional time trying to understand the data collected. Cleaning and reformatting data can be time consuming. Coming to a consensus during test planning about what data to collect, in which way, and how to organize it will save time later. Moreover, an explanation of the information in the dataset (a codebook or database dictionary) will be helpful if the data are shared with someone new. Make sure that the columns/factors are labeled appropriately. Within a factor, it will be helpful if each level is coded the same way. For example, rather than using “L”, “low”, and “lo”, make sure there is a consensus and everyone enters the data the same way. Also, make sure acronyms are clearly defined so that everyone knows what data are stored. Creation of a template as soon as the factors and levels are determined promotes uniformity. If everybody that enters data uses the same template, chances are the data are being collected and recorded systematically.

Consider using advanced analysis methods when appropriate

In special cases, traditional analyses may not be appropriate and a more advanced technique may be needed. For example, Bayesian analysis incorporates prior information or different data sources, and survival analysis allows evaluation of time to detect data rather than binary detect/non-detect data.

Adjust by covariates

Include in the analysis the multiple factors used in the design of experiments (DOE). Avoid assessing the relationship of the response variable with one factor alone. Instead, fit models that will explain the relationship of those variables while being affected by other factors. In the submarine sonar performance example, multiple factors were included as part of the DOE. These factors were used when fitting a lognormal model, and by doing so a third order interaction was found to be important.

Provide a summary of the analysis approach taken

For most audiences, a summary of the analysis might be appropriate. Explain in a few words what statistical approach was taken and why. For example: “To assess the accuracy of the new turret, the authors conducted a logistic regression.” If needed, and depending on the audience, explain more technical details (e.g. model selection technique, assumptions, data transformation) in a footnote or appendix.

Best Practices: Analyze & Evaluate

Be consistent

Reporting the results in an inconsistent way might confuse the readers. Therefore, if you are using a significance level of 0.20 and reporting two-sided intervals, keep using those terms throughout the report. If there is a good reason to use a different set of confidence level and interval type, explain that reason. Do not adjust confidence levels (usually referred as ) to get confidence intervals (lower and upper bounds) that contain or exclude the requirement because this practice is misleading and confusing.

Best Practices: Evaluate

Remind the audience about the DOE when summarizing the results

Mention the factors that were used when constructing the DOE, and indicate if the analyses showed that they had a statistically significant effect in the response variable. For example, “Detection range did not depend on the time of day, suggesting the sensor performance is unaffected by lighting conditions.”

Report conditions that might have affected the results

Identify for future tests those recorded conditions that the analysis finds statistically significant, but the DOE did not systematically vary or explore. “Aircraft configuration was tactically varied at the discretion of the mission commander. The mission commander chose to use the Reconnaissance configuration three times as often as the Strike configuration, so even though we saw large differences in time to prosecute target for these two configurations, an insufficient number of Strike missions were conducted to determine whether or not the observed differences are statistically significant. The upcoming FOT&E should dictate the aircraft’s configuration to guarantee a minimum number of missions using the Strike configuration, generate mission scenarios in which the Strike configuration is more likely to be chosen, or provide additional test time to ensure sufficient Strike missions are conducted to assess the mission impact of configuration with statistical significance.”

Include figures and tables

Figures and tables can summarize and highlight the main results of the test, especially if an interaction was found. Include titles, clear labels, and legends; use colors to highlight points and aide interpretation. Reference and explain the figure or table in the body of the report. “Figure 5 shows how the probability of detection changes as the distance between the weapon and the Q-53 counterfire radar increases when the system is in the 360-degree operating mode observing single-fire artillery engagements. The data also revealed that radar-weapon range and quadrant elevation (QE) had large effects on Q-53’s ability to detect incoming projectiles.”

Include raw data when appropriate

Plotting raw data with the model results strengthens the analysis. Note that if there are too many factors involved, plotting the raw data might not be useful because of challenges with showing many factors in a single plot. “This graph compares the model predictions (with confidence intervals) to the raw data across all 3 factors of interest (array type, noise profile, and submarine type). Predictions tend to match the data quite well across all conditions.”

Use the correct terminology

After computing the point estimate and the confidence intervals, make sure that the interpretation of those is accurate. For example: “The percent accuracy seen during the test was 97 percent [94, 99]. This meets the requirement of 90 percent accuracy with statistical confidence”. In the case where the threshold is within the interval, some precautions need to be taken when interpreting the results: “Based on an observed mean time to configure of 18 minutes [7, 42], the time to configure requirement of 20 minutes or less was demonstrated. Since only five system configurations were observed during testing, this estimate is based on only five observations, thus the confidence interval is quite wide and contains the requirement. Users believed that a mean configuration time of 22 minutes would not prevent the aircraft from reaching the area of operation on time”. There are also some special cases

Binary responses with zero failures (or zero successes). In cases where there are no failures or no successes, a one-sided confidence interval should be reported. For example, “Fourteen of fourteen shots hit the target, so we are 80 percent confident that probability of hit is at least 89.1 percent”.
Reliability estimates with zero failures. In this case, constructing a two-sided confidence interval is not possible, and the phrase “we are x percent confident that the threshold was met” should be avoided. Instead, you can use the phrase “We observed 0 failures during 124 hours of testing, so we are x percent confident that the MTBF is at least 77 hours”.

Use other information to reinforce the results of a p-value

Reporting a p-values alone will not give much information about the relevance of the result. Therefore, provide confidence intervals or a context along with the p-value.

When comparing results to a threshold, provide a detailed description of the supporting in formation. “The estimated National Imagery Interpretability Rating Scale (NIIRS) rating of images for the electro-optical (EO) sensor was 6.8 [95% CI: 6.2, 7.4], which is significantly worse than the key performance parameters (KPP) value of 8 (p-value = 0.0047),” instead of “The estimated NIIRS rating of images for the EO sensor was 6.8, which is significantly worse than the KPP value of 8 (p-value = 0.0047)”.
In cases where performance is being compared, the following interpretations could be used: “Under the conditions specified in the KPP, missiles equipped with the new seeker will hit the target 78 percent of the time vice 64 percent of the time for the legacy seeker, which is a statistically significant difference (p-value=0.036),” and “The upgraded wings provided the air vehicle with an average of 1.4 hours with a 95% C.I. of (0.5, 2.3) more endurance than the non-upgraded system.”

Consider both, practical and statistical significance

Present results with an explanation in the context of the test at hand.

If a result is statistically significant, but the difference is not important in practice, then the following can be used: “On average, the upgraded payload was able to detect tactical vehicle-sized targets at 4,153m while the legacy could detect the same targets at 3,978m. The difference in detection range was statistically significant, but none of the participants in the test thought the 175m delta would have a substantial effect on their ability to execute their mission.”
In cases where the results were not statistically significant but they might affect operations, this can be used: “On average, the upgraded payload was able to detect tactical vehicle-sized targets at 4,153m while the legacy could detect the same targets at 3,078m. While this difference is not statistically significant, participants in the test thought a 1000m difference in detection range could improve their ability to execute their mission. To verify this improvement, the IOT&E should include a higher proportion of convoy overwatch missions.”

Above all, remember that you are working in a team. In order to maximize the team’s effort, communication among experts is important. For example, the test might not go as planned. In this case, communication between operational testers, users, and analysis experts is key in order to properly interpret the results and to understand what information is lost or affected by changes in test execution.

Subscribe