Title Authors Type Tags

A Comparison of Ballistic Resistance Testing Techniques in the Department of Defense

This paper summarizes sensitivity test methods commonly employed in the . . . Department of Defense. A comparison study shows that modern methods such as Neyer's method and Three-Phase Optimal Design are improvements over historical methods.

This paper summarizes sensitivity test methods commonly employed in the Department of Defense. A comparison study shows that modern methods such as Neyer's method and Three-Phase Optimal Design are improvements over historical methods.

Thomas Johnson, Laura J. Freeman, Janice Hester, Jonathan BellResearch Paper

A First Step into the Bootstrap World

Bootstrapping is a powerful nonparametric tool for conducting statistical . . . inference with many applications to data from operational testing. Bootstrapping is most useful when the population sampled from is unknown or complex or the sampling distribution of the desired statistic is difficult to derive. Careful use of bootstrapping can help address many challenges in analyzing operational test data.
Bootstrapping is a powerful nonparametric tool for conducting statistical inference with many applications to data from operational testing. Bootstrapping is most useful when the population sampled from is unknown or complex or the sampling . . . distribution of the desired statistic is difficult to derive. Careful use of bootstrapping can help address many challenges in analyzing operational test data.
Matthew AveryTechnical Briefing

A Multi-Method Approach to Evaluating Human-System Interactions during Operational Testing

The purpose of this paper was to identify the shortcomings of a single-method . . . approach to evaluating human-system interactions during operational testing and offer an alternative, multi-method approach that is more defensible, yields richer insights into how operators interact with weapon systems, and provides a practical implications for identifying when the quality of human-system interactions warrants correction through either operator training or redesign.
The purpose of this paper was to identify the shortcomings of a single-method approach to evaluating human-system interactions during operational testing and offer an alternative, multi-method approach that is more defensible, yields richer . . . insights into how operators interact with weapon systems, and provides a practical implications for identifying when the quality of human-system interactions warrants correction through either operator training or redesign.
Dean Thomas, Heather Wojton, Chad Bieber, Daniel PorterResearch Paper

A Review of Sequential Analysis

Sequential analysis concerns statistical evaluation in situations in which the . . . number, pattern, or composition of the data is not determined at the start of the investigation, but instead depends upon the information acquired throughout the course of the investigation. Expanding the use of sequential analysis has the potential to save a lot of money and reduce test time (National Research Council, 1998). This paper summarizes the literature on sequential analysis and offers fundamental information for providing recommendations for its use in DoD test and evaluation.
Sequential analysis concerns statistical evaluation in situations in which the number, pattern, or composition of the data is not determined at the start of the investigation, but instead depends upon the information acquired throughout the . . . course of the investigation. Expanding the use of sequential analysis has the potential to save a lot of money and reduce test time (National Research Council, 1998). This paper summarizes the literature on sequential analysis and offers fundamental information for providing recommendations for its use in DoD test and evaluation.
Rebecca Medlin, John Dennis, Keyla Pagán-Rivera, Leonard Wilkins, Heather WojtonResearch Paper

A team-centric metric framework for testing and evaluation of human-machine teams

We propose and present a parallelized metric framework for evaluating . . . human-machine teams that draws upon current knowledge of human-systems interfacing and integration but is rooted in team-centric concepts. Humans and machines working together as a team involves interactions that will only increase in complexity as machines become more intelligent, capable teammates. Assessing such teams will require explicit focus on not just the human-machine interfacing but the full spectrum of interactions between and among agents. As opposed to focusing on isolated qualities, capabilities, and performance contributions of individual team members, the proposed framework emphasizes the collective team as the fundamental unit of analysis and the interactions of the team as the key evaluation targets, with individual human and machine metrics still vital but secondary. With teammate interaction as the organizing diagnostic concept, the resulting framework arrives at a parallel assessment of the humans and machines, analyzing their individual capabilities less with respect to purely human or machine qualities and more through the prism of contributions to the team as a whole. This treatment reflects the increased machine capabilities and will allow for continued relevance as machines develop to exercise more authority and responsibility. This framework allows for identification of features specific to human-machine teaming that influence team performance and efficiency, and it provides a basis for operationalizing in specific scenarios. Potential applications of this research include test and evaluation of complex systems that rely on human-system interaction, including—though not limited to—autonomous vehicles, command and control systems, and pilot control systems.
We propose and present a parallelized metric framework for evaluating human-machine teams that draws upon current knowledge of human-systems interfacing and integration but is rooted in team-centric concepts. Humans and machines working . . . together as a team involves interactions that will only increase in complexity as machines become more intelligent, capable teammates. Assessing such teams will require explicit focus on not just the human-machine interfacing but the full spectrum of interactions between and among agents. As opposed to focusing on isolated qualities, capabilities, and performance contributions of individual team members, the proposed framework emphasizes the collective team as the fundamental unit of analysis and the interactions of the team as the key evaluation targets, with individual human and machine metrics still vital but secondary. With teammate interaction as the organizing diagnostic concept, the resulting framework arrives at a parallel assessment of the humans and machines, analyzing their individual capabilities less with respect to purely human or machine qualities and more through the prism of contributions to the team as a whole. This treatment reflects the increased machine capabilities and will allow for continued relevance as machines develop to exercise more authority and responsibility. This framework allows for identification of features specific to human-machine teaming that influence team performance and efficiency, and it provides a basis for operationalizing in specific scenarios. Potential applications of this research include test and evaluation of complex systems that rely on human-system interaction, including—though not limited to—autonomous vehicles, command and control systems, and pilot control systems.
Jay Wilkins, David A. Sparrow, Caitlan A. Fealing, Brian D. Vickers, Kristina A. Ferguson, Heather WojtonResearch Paper

AI + Autonomy T&E in DoD

Test and evaluation (T&E) of AI-enabled systems (AIES) often emphasizes . . . algorithm accuracy over robust, holistic system performance. While this narrow focus may be adequate for some applications of AI, for many complex uses, T&E paradigms removed from operational realism are insufficient. However, leveraging traditional operational testing (OT) methods for to evaluate AIESs can fail to capture novel sources of risk. This brief establishes a common AI vocabulary and highlights OT challenges posed by AIESs by answering the following questions: 1. What is “Artificial Intelligence (AI)”? a. A brief “AI Primer” defines some common terms, highlights words that are used inconsistently, and discusses where definitions are insufficient for identifying systems that require additional T&E considerations. 2. How does AI impact T&E? a. AI isn’t new, but systems with AI pose new challenges and may require structural changes to how we T&E. 3. What makes DoD applications of AI unique? a. Many Silicon Valley applications of AI often lack the task complexity and severe consequences of risk faced by DoD. 4. What is the warfighter’s role? a. T&E must assure warfighters have calibrated trust & an adequate understanding of system behavior. 5. What is the state of DoD AI T&E in IDA and OED?
Test and evaluation (T&E) of AI-enabled systems (AIES) often emphasizes algorithm accuracy over robust, holistic system performance. While this narrow focus may be adequate for some applications of AI, for many complex uses, T&E . . . paradigms removed from operational realism are insufficient. However, leveraging traditional operational testing (OT) methods for to evaluate AIESs can fail to capture novel sources of risk. This brief establishes a common AI vocabulary and highlights OT challenges posed by AIESs by answering the following questions: 1. What is “Artificial Intelligence (AI)”? a. A brief “AI Primer” defines some common terms, highlights words that are used inconsistently, and discusses where definitions are insufficient for identifying systems that require additional T&E considerations. 2. How does AI impact T&E? a. AI isn’t new, but systems with AI pose new challenges and may require structural changes to how we T&E. 3. What makes DoD applications of AI unique? a. Many Silicon Valley applications of AI often lack the task complexity and severe consequences of risk faced by DoD. 4. What is the warfighter’s role? a. T&E must assure warfighters have calibrated trust & an adequate understanding of system behavior. 5. What is the state of DoD AI T&E in IDA and OED?
Brian VickersTechnical Briefing

An Expository Paper on Optimal Design

There are many situations where the requirements of a standard experimental . . . design do not fit the research requirements of the problem. Three such situations occur when the problem requires unusual resource restrictions, when there are constraints on the design region, and when a non-standard model is expected to be required to adequately explain the response.
There are many situations where the requirements of a standard experimental design do not fit the research requirements of the problem. Three such situations occur when the problem requires unusual resource restrictions, when there are . . . constraints on the design region, and when a non-standard model is expected to be required to adequately explain the response.
Douglas C. Montgomery, Bradley A. Jones, Rachel T. JohnsonResearch Paper

An Uncertainty Analysis Case Study of Live Fire Modeling and Simulation

This paper emphasizes the use of fundamental statistical techniques – design of . . . experiments, statistical modeling, and propagation of uncertainty – in the context of a combat scenario that depicts a ground vehicle being engaged by indirect artillery.
This paper emphasizes the use of fundamental statistical techniques – design of experiments, statistical modeling, and propagation of uncertainty – in the context of a combat scenario that depicts a ground vehicle being engaged by indirect . . . artillery.
Mark Couch, Thomas Johnson, John Haman, Heather Wojton, Benjamin Turner, David HigdonOther

Artificial Intelligence & Autonomy Test & Evaluation Roadmap Goals

As the Department of Defense acquires new systems with artificial intelligence . . . (AI) and autonomous (AI&A) capabilities, the test and evaluation (T&E) community will need to adapt to the challenges that these novel technologies present. The goals listed in this AI Roadmap address the broad range of tasks that the T&E community will need to achieve in order to properly test, evaluate, verify, and validate AI-enabled and autonomous systems. It includes issues that are unique to AI and autonomous systems, as well as legacy T&E shortcomings that will be compounded by newer technologies.
As the Department of Defense acquires new systems with artificial intelligence (AI) and autonomous (AI&A) capabilities, the test and evaluation (T&E) community will need to adapt to the challenges that these novel . . . technologies present. The goals listed in this AI Roadmap address the broad range of tasks that the T&E community will need to achieve in order to properly test, evaluate, verify, and validate AI-enabled and autonomous systems. It includes issues that are unique to AI and autonomous systems, as well as legacy T&E shortcomings that will be compounded by newer technologies.
Brian Vickers, Daniel Porter, Rachel Haga, Heather WojtonTechnical Briefing

Bayesian Reliability: Combining Information

One of the most powerful features of Bayesian analyses is the ability to combine . . . multiple sources of information in a principled way to perform inference. This feature can be particularly valuable in assessing the reliability of systems where testing is limited. At their most basic, Bayesian methods for reliability develop informative prior distributions using expert judgment or similar systems. Appropriate models allow the incorporation of many other sources of information, including historical data, information from similar systems, and computer models. We introduce the Bayesian approach to reliability using several examples and point to open problems and areas for future work.
One of the most powerful features of Bayesian analyses is the ability to combine multiple sources of information in a principled way to perform inference. This feature can be particularly valuable in assessing the reliability of systems . . . where testing is limited. At their most basic, Bayesian methods for reliability develop informative prior distributions using expert judgment or similar systems. Appropriate models allow the incorporation of many other sources of information, including historical data, information from similar systems, and computer models. We introduce the Bayesian approach to reliability using several examples and point to open problems and areas for future work.
Alyson Wilson, Kassandra FroncyzkResearch Paper

Censored Data Analysis Methods for Performance Data: A Tutorial

Binomial metrics like probability-to-detect or probability-to-hit typically do . . . not provide the maximum information from testing. Using continuous metrics such as time to detect provide more information, but do not account for non-detects. Censored data analysis allows us to account for both pieces of information simultaneously.
Binomial metrics like probability-to-detect or probability-to-hit typically do not provide the maximum information from testing. Using continuous metrics such as time to detect provide more information, but do not account for non-detects. . . . Censored data analysis allows us to account for both pieces of information simultaneously.
V. Bram LillardTechnical Briefing

Challenges and new methods for designing reliability experiments

Engineers use reliability experiments to determine the factors that drive . . . product reliability, build robust products, and predict reliability under use conditions. This article uses recent testing of a Howitzer to illustrate the challenges in designing reliability experiments for complex, repairable systems. We leverage lessons learned from current research and propose methods for designing an experiment for a complex, repairable system.
Engineers use reliability experiments to determine the factors that drive product reliability, build robust products, and predict reliability under use conditions. This article uses recent testing of a Howitzer to illustrate the challenges . . . in designing reliability experiments for complex, repairable systems. We leverage lessons learned from current research and propose methods for designing an experiment for a complex, repairable system.
Laura Freeman, Thomas Johnson, Rebecca MedlinResearch Paper

Characterizing Human-Machine Teaming Metrics for Test & Evaluation

This briefing defines human-machine teaming, describes new challenges in . . . evaluating HMTs, and provides a framework for the categories of metrics that are important for the T&E of HMTs.

This briefing defines human-machine teaming, describes new challenges in evaluating HMTs, and provides a framework for the categories of metrics that are important for the T&E of HMTs.

Heather Wojton, Brian Vickers, Kristina Carter, David Sparrow, Leonard Wilkins, Caitlan FealingTechnical Briefing

Choice of second-order response surface designs for logistic and Poisson regression models

This paper illustrates the construction of D-optimal second order designs for . . . situations when the response is either binomial (pass/fail) or Poisson (count data).

This paper illustrates the construction of D-optimal second order designs for situations when the response is either binomial (pass/fail) or Poisson (count data).

Rachel T. Johnson, Douglas C. MontgomeryResearch Paper

Circular prediction regions for miss distance models under heteroskedasticity

Circular prediction regions are used in ballistic testing to express the . . . uncertainty in shot accuracy. We compare two modeling approaches for estimating circular prediction regions for the miss distance of a ballistic projectile. The miss distance response variable is bivariate normal and has a mean and variance that can change with one or more experimental factors. The first approach fits a heteroskedastic linear model using restricted maximum likelihood, and uses the Kenward-Roger statistic to estimate circular prediction regions. The second approach fits an analogous Bayesian model with unrestricted likelihood modifications, and computes circular prediction regions by sampling from the posterior predictive distribution. The two approaches are applied to an example problem, and are compared using simulation.
Circular prediction regions are used in ballistic testing to express the uncertainty in shot accuracy. We compare two modeling approaches for estimating circular prediction regions for the miss distance of a ballistic projectile. The miss . . . distance response variable is bivariate normal and has a mean and variance that can change with one or more experimental factors. The first approach fits a heteroskedastic linear model using restricted maximum likelihood, and uses the Kenward-Roger statistic to estimate circular prediction regions. The second approach fits an analogous Bayesian model with unrestricted likelihood modifications, and computes circular prediction regions by sampling from the posterior predictive distribution. The two approaches are applied to an example problem, and are compared using simulation.
Thomas H. Johnson, John T. Haman, Heather Wojton, Laura FreemanResearch Paper

Comparing Computer Experiments for the Gaussian Process Model Using Integrated Prediction Variance

Space filling designs are a common choice of experimental design strategy for . . . computer experiments. This paper compares space filling design types based on their theoretical prediction variance properties with respect to the Gaussian Process model. https://www.tandfonline.com/doi/abs/10.1080/08982112.2012.758284
Space filling designs are a common choice of experimental design strategy for computer experiments. This paper compares space filling design types based on their theoretical prediction variance properties with respect to the Gaussian . . . Process model. https://www.tandfonline.com/doi/abs/10.1080/08982112.2012.758284
Rachel T. Johnson, Douglas C. Montgomery, Bradley Jones, Chris GotwaltResearch Paper

Comparing Normal and Binary D-Optimal Designs by Statistical Power

In many Department of Defense test and evaluation applications, binary response . . . variables are unavoidable. Many have considered D-optimal design of experiments for generalized linear models. However, little consideration has been given to assessing how these new designs perform in terms of statistical power for a given hypothesis test. Monte Carlo simulations and exact power calculations suggest that D optimal designs generally yield higher power than binary D-optimal designs, despite using logistic regression in the analysis after data have been collected. Results from using statistical power to compare designs contradict standard design of experiments comparisons, which employ D-efficiency ratios and fractional design space plots. Power calculations suggest that practitioners that are primarily interested in the resulting statistical power of a design should use normal D optimal designs over binary D-optimal designs when logistic regression is to be used in the data analysis after data collection.
In many Department of Defense test and evaluation applications, binary response variables are unavoidable. Many have considered D-optimal design of experiments for generalized linear models. However, little consideration has been given to . . . assessing how these new designs perform in terms of statistical power for a given hypothesis test. Monte Carlo simulations and exact power calculations suggest that D optimal designs generally yield higher power than binary D-optimal designs, despite using logistic regression in the analysis after data have been collected. Results from using statistical power to compare designs contradict standard design of experiments comparisons, which employ D-efficiency ratios and fractional design space plots. Power calculations suggest that practitioners that are primarily interested in the resulting statistical power of a design should use normal D optimal designs over binary D-optimal designs when logistic regression is to be used in the data analysis after data collection.
Addison D AdamsTechnical Briefing

Data Principles for Operational and Live-Fire Testing

Many DOD systems undergo operational testing, which is a field test involving . . . realistic combat conditions. Data, analysis, and reporting are the fundamental outcomes of operational test, which support leadership decisions. The importance of data standardization and interoperability is widely recognized by leadership in DoD, however, there are no generally recognized standards for the management and handling of data (format, pedigree, architecture, transferability, etc.) in the DOD. In this presentation, I will review a set of data principles that we believe DOD should adopt to improve how it manages test data. I will explain the current state of data management, each of the data principles, why the DOD should adopt them, and some of the difficulties of improving data handling.
Many DOD systems undergo operational testing, which is a field test involving realistic combat conditions. Data, analysis, and reporting are the fundamental outcomes of operational test, which support leadership decisions. The importance of . . . data standardization and interoperability is widely recognized by leadership in DoD, however, there are no generally recognized standards for the management and handling of data (format, pedigree, architecture, transferability, etc.) in the DOD. In this presentation, I will review a set of data principles that we believe DOD should adopt to improve how it manages test data. I will explain the current state of data management, each of the data principles, why the DOD should adopt them, and some of the difficulties of improving data handling.
John HamanTechnical Briefing

Demystifying the Black Box: A Test Strategy for Autonomy

The purpose of this briefing is to provide a high-level overview of how to frame . . . the question of testing autonomous systems in a way that will enable development of successful test strategies. The brief outlines the challenges and broad-stroke reforms needed to get ready for the test challenges of the next century.
The purpose of this briefing is to provide a high-level overview of how to frame the question of testing autonomous systems in a way that will enable development of successful test strategies. The brief outlines the challenges and . . . broad-stroke reforms needed to get ready for the test challenges of the next century.
Heather Wojton, Daniel PorterTechnical Briefing

Designed Experiments for the Defense Community

This paper presents the underlying tenets of design of experiments, as applied . . . in the Department of Defense, focusing on factorial, fractional factorial and response surface design and analyses. The concepts of statistical modeling and sequential experimentation are also emphasized.
This paper presents the underlying tenets of design of experiments, as applied in the Department of Defense, focusing on factorial, fractional factorial and response surface design and analyses. The concepts of statistical modeling and . . . sequential experimentation are also emphasized.
Rachel T. Johnson, Douglas C. Montgomery, James R. SimpsonResearch Paper

Designing Experiments for Model Validation

Advances in computational power have allowed both greater fidelity and more . . . extensive use of such models. Numerous complex military systems have a corresponding models that simulate its performance in the field. In response, the DoD needs defensible practices for validating these models. DOE and statistical analysis techniques are the foundational building blocks for validating the use of computer models and quantifying uncertainty in that validation. Recent developments in uncertainty quantification have the potential to benefit the DoD in using modeling and simulation to inform operational evaluations.
Advances in computational power have allowed both greater fidelity and more extensive use of such models. Numerous complex military systems have a corresponding models that simulate its performance in the field. In response, the DoD needs . . . defensible practices for validating these models. DOE and statistical analysis techniques are the foundational building blocks for validating the use of computer models and quantifying uncertainty in that validation. Recent developments in uncertainty quantification have the potential to benefit the DoD in using modeling and simulation to inform operational evaluations.
Heather Wojton, Kelly Avery, Laura Freeman, Thomas JohnsonOther

Designing experiments for nonlinear models—an introduction

This paper illustrates the construction of Bayesian D-optimal designs for . . . nonlinear models and compares the relative efficiency of standard designs to these designs for several models and prior distributions on the parameters.

This paper illustrates the construction of Bayesian D-optimal designs for nonlinear models and compares the relative efficiency of standard designs to these designs for several models and prior distributions on the parameters.

Rachel T. Johnson, Douglas C. MontgomeryResearch Paper

Determining How Much Testing is Enough: An Exploration of Progress in the Department of Defense Test and Evaluation Community

This paper describes holistic progress in answering the question of “How much . . . testing is enough?” It covers areas in which the T&E community has made progress, areas in which progress remains elusive, and issues that have emerged since 1994 that provide additional challenges. The selected case studies used to highlight progress are especially interesting examples, rather than a comprehensive look at all programs since 1994.
This paper describes holistic progress in answering the question of “How much testing is enough?” It covers areas in which the T&E community has made progress, areas in which progress remains elusive, and issues that have emerged since . . . 1994 that provide additional challenges. The selected case studies used to highlight progress are especially interesting examples, rather than a comprehensive look at all programs since 1994.
Rebecca Medlin, Matthew Avery, James Simpson, Heather WojtonResearch Paper

Developing AI Trust: From Theory to Testing and the Myths in Between

This introductory work aims to provide members of the Test and Evaluation . . . community with a clear understanding of trust and trustworthiness to support responsible and effective evaluation of AI systems.  The paper provides a set of working definitions and works toward dispelling confusion and myths surrounding trust.
This introductory work aims to provide members of the Test and Evaluation community with a clear understanding of trust and trustworthiness to support responsible and effective evaluation of AI systems.  The paper provides a set of working . . . definitions and works toward dispelling confusion and myths surrounding trust.
Yosef Razin, Kristen Alexander, John HamanResearch Paper

Trust; AI

Development of Wald-Type and Score-Type Statistical Tests to Compare Live Test Data and Simulation Predictions

This work describes the development of a statistical test created in support of . . . ongoing verification, validation, and accreditation (VV&A) efforts for modeling and simulation (M&S) environments. The test computes a Wald-type statistic comparing two generalized linear models estimated from live test data and analogous simulated data. The resulting statistic indicates whether the M&S outputs differ from the live data. After developing the test, we applied it to two logistic regression models estimated from live torpedo test data and simulated data from the Naval Undersea Warfare Center’s Environment Centric Weapons Analysis Facility (ECWAF). We developed this test to handle a specific problem with our data: one weapon variant was seen in the in-water test data, but the ECWAF data had two weapon variants. We overcame this deficiency by adjusting the Wald statistic via combining linear model coefficients with the intercept term when a factor is varied in one sample but not another. A similar approach could be applied with score-type tests, which we also describe.
This work describes the development of a statistical test created in support of ongoing verification, validation, and accreditation (VV&A) efforts for modeling and simulation (M&S) environments. The test computes a Wald-type . . . statistic comparing two generalized linear models estimated from live test data and analogous simulated data. The resulting statistic indicates whether the M&S outputs differ from the live data. After developing the test, we applied it to two logistic regression models estimated from live torpedo test data and simulated data from the Naval Undersea Warfare Center’s Environment Centric Weapons Analysis Facility (ECWAF). We developed this test to handle a specific problem with our data: one weapon variant was seen in the in-water test data, but the ECWAF data had two weapon variants. We overcame this deficiency by adjusting the Wald statistic via combining linear model coefficients with the intercept term when a factor is varied in one sample but not another. A similar approach could be applied with score-type tests, which we also describe.
Carrington A. Metts, Curtis Miller

Examining Improved Experimental Designs for Wind Tunnel Testing Using Monte Carlo Sampling Methods

In this paper we compare data from a fairly large legacy wind tunnel test . . . campaign to smaller, statistically-motivated experimental design strategies. The comparison, using Monte Carlo sampling methodology, suggests a tremendous opportunity to reduce wind tunnel test efforts without losing test information.
In this paper we compare data from a fairly large legacy wind tunnel test campaign to smaller, statistically-motivated experimental design strategies. The comparison, using Monte Carlo sampling methodology, suggests a tremendous opportunity . . . to reduce wind tunnel test efforts without losing test information.
Raymond R. Hill, Derek A. Leggio, Shay R. Capehart, August G. RoesenerResearch Paper

Handbook on Statistical Design & Analysis Techniques for Modeling & Simulation Validation

This handbook focuses on methods for data-driven validation to supplement the . . . vast existing literature for Verification, Validation, and Accreditation (VV&A) and the emerging references on uncertainty quantification (UQ). The goal of this handbook is to aid the test and evaluation (T&E) community in developing test strategies that support model validation (both external validation and parametric analysis) and statistical UQ.
This handbook focuses on methods for data-driven validation to supplement the vast existing literature for Verification, Validation, and Accreditation (VV&A) and the emerging references on uncertainty quantification (UQ). The goal of . . . this handbook is to aid the test and evaluation (T&E) community in developing test strategies that support model validation (both external validation and parametric analysis) and statistical UQ.
Heather Wojton, Kelly Avery, Laura J. Freeman, Samuel Parry, Gregory Whittier, Thomas Johnson, Andrew FlackHandbook

handbook, statistics

Hybrid Designs: Space Filling and Optimal Experimental Designs for Use in Studying Computer Simulation Models

This tutorial provides an overview of experimental design for modeling and . . . simulation. Pros and cons of each design methodology are discussed.

This tutorial provides an overview of experimental design for modeling and simulation. Pros and cons of each design methodology are discussed.

Rachel Johnson SilvestriniTechnical Briefing

Implementing Fast Flexible Space-Filling Designs in R

Modeling and simulation (M&S) can be a useful tool when testers and . . . evaluators need to augment the data collected during a test event. When planning M&S, testers use experimental design techniques to determine how much and which types of data to collect, and they can use space-filling designs to spread out test points across the operational space. Fast flexible space-filling designs (FFSFDs) are a type of space-filling design useful for M&S because they work well in design spaces with disallowed combinations and permit the inclusion of categorical factors. IDA analysts developed a function to create FFSFDs using the free statistical software R. To our knowledge, there are no R packages for creating an FFSFD that can accommodate a variety of user inputs, such as categorical factors. Moreover, users of IDA’s function can share their code to make their work reproducible.
Modeling and simulation (M&S) can be a useful tool when testers and evaluators need to augment the data collected during a test event. When planning M&S, testers use experimental design techniques to determine how much and which . . . types of data to collect, and they can use space-filling designs to spread out test points across the operational space. Fast flexible space-filling designs (FFSFDs) are a type of space-filling design useful for M&S because they work well in design spaces with disallowed combinations and permit the inclusion of categorical factors. IDA analysts developed a function to create FFSFDs using the free statistical software R. To our knowledge, there are no R packages for creating an FFSFD that can accommodate a variety of user inputs, such as categorical factors. Moreover, users of IDA’s function can share their code to make their work reproducible.
Christopher DimapasokTechnical Briefing

Improving Operational Test Efficiency: Sequential Methods in Operational Testing

The Department of Defense develops and acquires some of the world's most . . . advanced, sophisticated, and expensive systems. As new technologies emerge and are incorporated into systems, Director, Operational Test and Evaluation faces the challenge of ensuring that these systems undergo adequate and efficient test and evaluation (T&E) prior to operational use. In this talk, I will provide a survey of two projects highlighting ways in which relatively well-known statistical methods can help the T&E community in taking steps toward increasing test efficiency. The first case study will demonstrate the value in applying a sequential test planning approach to operational effectiveness testing. The second case study will demonstrate the value in applying Bayesian assurance methods to planning operational reliability testing.
The Department of Defense develops and acquires some of the world's most advanced, sophisticated, and expensive systems. As new technologies emerge and are incorporated into systems, Director, Operational Test and Evaluation faces the . . . challenge of ensuring that these systems undergo adequate and efficient test and evaluation (T&E) prior to operational use. In this talk, I will provide a survey of two projects highlighting ways in which relatively well-known statistical methods can help the T&E community in taking steps toward increasing test efficiency. The first case study will demonstrate the value in applying a sequential test planning approach to operational effectiveness testing. The second case study will demonstrate the value in applying Bayesian assurance methods to planning operational reliability testing.
Keyla Pagan-RiveraTechnical Briefing

Improving Reliability Estimates with Bayesian Statistics

This paper shows how Bayesian methods are ideal for the assessment of complex . . . system reliability assessments. Several examples illustrate the methodology.

This paper shows how Bayesian methods are ideal for the assessment of complex system reliability assessments. Several examples illustrate the methodology.

Kassandra Fronczyk, Laura J. FreemanResearch Paper

Improving Test Efficiency: A Bayesian Assurance Case Study

To improve test planning for evaluating system reliability, we propose the use . . . of Bayesian methods to incorporate supplementary data and reduce testing duration. Furthermore, we recommend Bayesian methods be employed in the analysis phase to better quantify uncertainty. We find that when using Bayesian Methods for test planning we can scope smaller tests and using Bayesian methods in analysis results in a more precise estimate of reliability – improving uncertainty quantification.
To improve test planning for evaluating system reliability, we propose the use of Bayesian methods to incorporate supplementary data and reduce testing duration. Furthermore, we recommend Bayesian methods be employed in the analysis phase . . . to better quantify uncertainty. We find that when using Bayesian Methods for test planning we can scope smaller tests and using Bayesian methods in analysis results in a more precise estimate of reliability – improving uncertainty quantification.
Rebecca M MedlinTechnical Briefing

Informing the Warfighter—Why Statistical Methods Matter in Defense Testing

https://chance.amstat.org/2018/04/informing-the-warfighter/

https://chance.amstat.org/2018/04/informing-the-warfighter/

Laura J. Freeman and Catherine WarnerResearch Paper

Initial Validation of the Trust of Automated Systems Test (TOAST)

Trust is a key determinant of whether people rely on automated systems in the . . . military and the public. However, there is currently no standard for measuring trust in automated systems. In the present studies we propose a scale to measure trust in automated systems that is grounded in current research and theory on trust formation, which we refer to as the Trust in Automated Systems Test (TOAST). We evaluated both the reliability of the scale structure and criterion validity using independent, military-affiliated and civilian samples. In both studies we found that the TOAST exhibited a two-factor structure, measuring system understanding and performance (respectively), and that factor scores significantly predicted scores on theoretically related constructs demonstrating clear criterion validity. We discuss the implications of our findings for advancing the empirical literature and in improving interface design.
Trust is a key determinant of whether people rely on automated systems in the military and the public. However, there is currently no standard for measuring trust in automated systems. In the present studies we propose a scale to measure . . . trust in automated systems that is grounded in current research and theory on trust formation, which we refer to as the Trust in Automated Systems Test (TOAST). We evaluated both the reliability of the scale structure and criterion validity using independent, military-affiliated and civilian samples. In both studies we found that the TOAST exhibited a two-factor structure, measuring system understanding and performance (respectively), and that factor scores significantly predicted scores on theoretically related constructs demonstrating clear criterion validity. We discuss the implications of our findings for advancing the empirical literature and in improving interface design.
Heather Wojton, Daniel Porter, Stephanie Lane, Chad Bieber, Poornima MadhavanResearch Paper

Introduction to ciTools

ciTools is an R package for working with model uncertainty. It gives users . . . access to confidence and prediction intervals for the fitted values of (log-) linear models, generalized linear models, and (log-) linear mixed models. Additionally, ciTools provides functions to determine probabilities and quantiles of the conditional response distribution given each of these models. This briefing introduces the package and provides simple illustrations for using ciTools to perform inference and plot results
ciTools is an R package for working with model uncertainty. It gives users access to confidence and prediction intervals for the fitted values of (log-) linear models, generalized linear models, and (log-) linear mixed models. Additionally, . . . ciTools provides functions to determine probabilities and quantiles of the conditional response distribution given each of these models. This briefing introduces the package and provides simple illustrations for using ciTools to perform inference and plot results
John Haman, Matthew Avery, Laura FreemanTechnical Briefing

Introduction to Design of Experiments

This training provides details regarding the use of design of experiments, from . . . choosing proper response variables, to identifying factors that could affect such responses, to determining the amount of data necessary to collect. The training also explains the benefits of using a DOE approach to testing and provides an overview of commonly used designs (e.g., factorial, optimal, and space-filling). The briefing illustrates the concepts discussed using several case studies.
This training provides details regarding the use of design of experiments, from choosing proper response variables, to identifying factors that could affect such responses, to determining the amount of data necessary to collect. The . . . training also explains the benefits of using a DOE approach to testing and provides an overview of commonly used designs (e.g., factorial, optimal, and space-filling). The briefing illustrates the concepts discussed using several case studies.
Breeana Anderson, Rebecca Medlin, John T. Haman, Kelly M. Avery, Keyla Pagan-RiveraTechnical Briefing

Introduction to Measuring Situational Awareness in Mission-Based Testing Scenarios

In FY23, OED’s Test Science group conducted research into situationalawareness . . . (SA) measurement for operational testing (OT). Following ourpresentation at the 2023 DATAWorks conference, a representative from the Army Evaluation Command (AEC) reached out to the Test Science group requesting we present a brown-bag style presentation on situational awareness to their evaluators.  The attached briefing is a modified version of the DATAWorks briefing that Test Science intends to present to DOT&E, and then AEC.
In FY23, OED’s Test Science group conducted research into situationalawareness (SA) measurement for operational testing (OT). Following ourpresentation at the 2023 DATAWorks conference, a representative from the Army Evaluation Command . . . (AEC) reached out to the Test Science group requesting we present a brown-bag style presentation on situational awareness to their evaluators.  The attached briefing is a modified version of the DATAWorks briefing that Test Science intends to present to DOT&E, and then AEC.
Elizabeth Green, John HamanTechnical Briefing

Managing T&E Data to encourage reuse

Reusing Test and Evaluation (T&E) datasets multiple times at different . . . points throughout a program’s lifecycle is one way to realize their full value. Data management plays an important role in enabling this practice. Reuse of T&E datasets does not occur in a consistent basis or in a formalized way. To enable and encourage data reuse, we expand upon four guiding principles – Findability, Accessibility, Interoperability, and Reusability (FAIR) – that can increase the reuse of T&E datasets.
Reusing Test and Evaluation (T&E) datasets multiple times at different points throughout a program’s lifecycle is one way to realize their full value. Data management plays an important role in enabling this practice. Reuse of T&E . . . datasets does not occur in a consistent basis or in a formalized way. To enable and encourage data reuse, we expand upon four guiding principles – Findability, Accessibility, Interoperability, and Reusability (FAIR) – that can increase the reuse of T&E datasets.
Andrew FlackResearch Paper

Data Management

Metamodeling Techniques for Verification and Validation of Modeling and Simulation Data

Modeling and simulation (M&S) outputs help the Director, Operational Test . . . and Evaluation (DOT&E), assess the effectiveness, survivability, lethality, and suitability of systems. To use M&S outputs, DOT&E needs models and simulators to be sufficiently verified and validated. The purpose of this paper is to improve the state of verification and validation by recommending and demonstrating a set of statistical techniques—metamodels, also called statistical emulators—to the M&S community. The paper expands on DOT&E’s existing guidance about metamodel usage by creating methodological recommendations the M&S community could apply to its activities. For a deterministic, discrete response variable, we recommend using a nearest neighbor or decision tree model. For a deterministic, continuous response variable, we recommend Gaussian process interpolation. For a stochastic response variable, we recommend a generalized additive model. We also present a set of techniques that testers can use to assess the adequacy of their metamodels. We conclude with a notional example (a paper plane simulation) that demonstrates the recommended techniques. Finally, we include supplemental software written in R that readers can use to reproduce the outputs from this paper.
Modeling and simulation (M&S) outputs help the Director, Operational Test and Evaluation (DOT&E), assess the effectiveness, survivability, lethality, and suitability of systems. To use M&S outputs, DOT&E needs models and . . . simulators to be sufficiently verified and validated. The purpose of this paper is to improve the state of verification and validation by recommending and demonstrating a set of statistical techniques—metamodels, also called statistical emulators—to the M&S community. The paper expands on DOT&E’s existing guidance about metamodel usage by creating methodological recommendations the M&S community could apply to its activities. For a deterministic, discrete response variable, we recommend using a nearest neighbor or decision tree model. For a deterministic, continuous response variable, we recommend Gaussian process interpolation. For a stochastic response variable, we recommend a generalized additive model. We also present a set of techniques that testers can use to assess the adequacy of their metamodels. We conclude with a notional example (a paper plane simulation) that demonstrates the recommended techniques. Finally, we include supplemental software written in R that readers can use to reproduce the outputs from this paper.
John T. Haman, Curtis G. MillerResearch Paper

On scoping a test that addresses the wrong objective

Statistical literature refers to a type of error that is committed by giving the . . . right answer to the wrong question. If a test design is adequately scoped to address an irrelevant objective, one could say that a Type III error occurs. In this paper, we focus on a specific Type III error that on some occasions test planners commit to reduce test size and resources.
Statistical literature refers to a type of error that is committed by giving the right answer to the wrong question. If a test design is adequately scoped to address an irrelevant objective, one could say that a Type III error occurs. In . . . this paper, we focus on a specific Type III error that on some occasions test planners commit to reduce test size and resources.
Thomas Johnson, Rebecca Medlin, Laura Freeman, James SimpsonResearch Paper

Power Analysis Tutorial for Experimental Design Software

This guide provides both a general explanation of power analysis and specific . . . guidance to successfully interface with two software packages, JMP and Design Expert (DX).

This guide provides both a general explanation of power analysis and specific guidance to successfully interface with two software packages, JMP and Design Expert (DX).

James Simpson, Thomas Johnson, Laura J. FreemanHandbook

Regularization for Continuously Observed Ordinal Response Variables with Piecewise-Constant Functional Predictors

This paper investigates regularization for continuously observed covariates that . . . resemble step functions. Two approaches for regularizing these covariates are considered, including a thinning approach commonly used within the DoD to address autocorrelated time series data.
This paper investigates regularization for continuously observed covariates that resemble step functions. Two approaches for regularizing these covariates are considered, including a thinning approach commonly used within the DoD to address . . . autocorrelated time series data.
Matthew Avery, Mark Orndorff, Timothy Robinson, Laura J. FreemanResearch Paper

Scientific Measurement of Situation Awareness in Operational Testing

Situation Awareness (SA) plays a key role in decision making and human . . . performance; higher operator SA is associated with increased operator performance and decreased operator errors. While maintaining or improving “situational awareness” is a common requirement for systems under test, there is no single standardized method or metric for quantifying SA in operational testing (OT). This leads to varied and sometimes suboptimal treatments of SA measurement across programs and test events. This paper introduces Endsley’s three-level model of SA in dynamic decision making, a frequently used model of individual SA; reviews trade-offs in some existing measures of SA, and discusses a selection of potential ways in which SA measurement during OT may be improved. 
Situation Awareness (SA) plays a key role in decision making and human performance; higher operator SA is associated with increased operator performance and decreased operator errors. While maintaining or improving “situational awareness” . . . is a common requirement for systems under test, there is no single standardized method or metric for quantifying SA in operational testing (OT). This leads to varied and sometimes suboptimal treatments of SA measurement across programs and test events. This paper introduces Endsley’s three-level model of SA in dynamic decision making, a frequently used model of individual SA; reviews trade-offs in some existing measures of SA, and discusses a selection of potential ways in which SA measurement during OT may be improved. 
Elizabeth A. Green, Miriam E. Armstrong, Janna MantuaResearch Paper

Space-Filling Designs for Modeling & Simulation

This document presents arguments and methods for using space-filling designs . . . (SFDs) to plan modeling and simulation (M&S) data collection.

This document presents arguments and methods for using space-filling designs (SFDs) to plan modeling and simulation (M&S) data collection.

Han Yi, Curtis Miller, Kelly AveryResearch Paper

Space-filling experimental design and surrogate models for U.S. Department of Defense modeling and simulation evaluation

The U.S. Department of Defense uses modeling and simulation (M&S) for test . . . and evaluation of systems acquired by the Services. The Director, Operational Test and Evaluation (DOT&E), who provides oversight of operational testing, needs "to have the same understanding of and confidence in the data obtained from M&S as... any other data," specifically requiring that design of experiments (DOE) methodologies be used when generating M&S output to explore conditions in which system will be employed, and statistical surrogates be estimated to characterize M&S predictions. Current policy does not recommend specific design and analysis methods, and there remains a gap in the defense community's statistical practice for M&S that needs to be filled. This presentation recommends how DOT&E policy can be fully operationalized by the test community. Specifically, we advocate for the use of space-filling designs (SFDs) to collect M&S data and for statistical emulators, Gaussian processes (GPs), or generalized additive models (GAMs). GPs and GAMs allow for high fidelity yet understandable statistical fits, and SFDs ensure good exploration of the operational space.
The U.S. Department of Defense uses modeling and simulation (M&S) for test and evaluation of systems acquired by the Services. The Director, Operational Test and Evaluation (DOT&E), who provides oversight of operational testing, . . . needs "to have the same understanding of and confidence in the data obtained from M&S as... any other data," specifically requiring that design of experiments (DOE) methodologies be used when generating M&S output to explore conditions in which system will be employed, and statistical surrogates be estimated to characterize M&S predictions. Current policy does not recommend specific design and analysis methods, and there remains a gap in the defense community's statistical practice for M&S that needs to be filled. This presentation recommends how DOT&E policy can be fully operationalized by the test community. Specifically, we advocate for the use of space-filling designs (SFDs) to collect M&S data and for statistical emulators, Gaussian processes (GPs), or generalized additive models (GAMs). GPs and GAMs allow for high fidelity yet understandable statistical fits, and SFDs ensure good exploration of the operational space.
Curtis G MillerTechnical Briefing

Statistical Methods Development Work for M&S Validation

Modeling and simulation (M&S) environments feature frequently in test and . . . evaluation (T&E) of Department of Defense (DoD) systems. Many M&S environments do not suffer many of the resourcing limitations associated with live test. We thus recommend testers apply higher resolution output generation and analysis techniques compared to those used for collecting live test data. Space-filling designs (SFDs) are experimental designs intended to fill the operational space for which M&S predictions are expected. These designs can be coupled with statistical metamodeling techniques that estimate a model that flexibly interpolates or predicts M&S outputs and their distributions at both observed settings and unobserved regions of the operational space. Analysts can study metamodel properties to decide if a M&S environment adequately represents the original systems. This paper summarizes a presentation given at the DATAWorks 2023 workshop.
Modeling and simulation (M&S) environments feature frequently in test and evaluation (T&E) of Department of Defense (DoD) systems. Many M&S environments do not suffer many of the resourcing limitations associated with live test. . . . We thus recommend testers apply higher resolution output generation and analysis techniques compared to those used for collecting live test data. Space-filling designs (SFDs) are experimental designs intended to fill the operational space for which M&S predictions are expected. These designs can be coupled with statistical metamodeling techniques that estimate a model that flexibly interpolates or predicts M&S outputs and their distributions at both observed settings and unobserved regions of the operational space. Analysts can study metamodel properties to decide if a M&S environment adequately represents the original systems. This paper summarizes a presentation given at the DATAWorks 2023 workshop.
Curtis MillerTechnical Briefing

Statistical Methods for Defense Testing

In the increasingly complex and data‐limited world of military defense testing, . . . statisticians play a valuable role in many applications. Before the DoD acquires any major new capability, that system must undergo realistic testing in its intended environment with military users. Oftentimes new or complex analysis techniques are needed to support the goal of characterizing or predicting system performance across the operational space. Statistical design and analysis techniques are essential for rigorous evaluation of these models.
In the increasingly complex and data‐limited world of military defense testing, statisticians play a valuable role in many applications. Before the DoD acquires any major new capability, that system must undergo realistic testing in its . . . intended environment with military users. Oftentimes new or complex analysis techniques are needed to support the goal of characterizing or predicting system performance across the operational space. Statistical design and analysis techniques are essential for rigorous evaluation of these models.
Dean Thomas, Kelly Avery, Laura FreemanResearch Paper

Statistical Models for Combining Information Stryker Reliability Case Study

This paper describes the benefits of using parametric statistical models to . . . combine information across multiple testing events. Both frequentist and Bayesian inference techniques are employed, and they are compared and contrasted to illustrate different statistical methods for combining information.
This paper describes the benefits of using parametric statistical models to combine information across multiple testing events. Both frequentist and Bayesian inference techniques are employed, and they are compared and contrasted to . . . illustrate different statistical methods for combining information.
Rebecca Dickinson, Laura J. Freeman, Bruce Simpson, Alyson WilsonResearch Paper

Test & Evaluation of AI-Enabled and Autonomous Systems: A Literature Review

This paper summarizes a subset of the literature regarding the challenges to and . . . recommendations for the test, evaluation, verification, and validation (TEV&V) of autonomous military systems.

This paper summarizes a subset of the literature regarding the challenges to and recommendations for the test, evaluation, verification, and validation (TEV&V) of autonomous military systems.

Heather Wojton, Daniel Porter, John DennisResearch Paper

Test Design Challenges in Defense Testing

All systems undergo operational testing before fielding or full-rate production. . . . While contractor and developmental testing tends to be requirements-driven, operational testing focuses on mission success. The goal is to evaluate operational effectiveness and suitability in the context of a realistic environment with representative users. This brief will first provide an overview of operational testing and discuss example defense applications of, and key differences between, classical and space-filling designs. It will then present several challenges (and possible solutions) associated with implementing space-filling designs and associated analyses in the defense community.
All systems undergo operational testing before fielding or full-rate production. While contractor and developmental testing tends to be requirements-driven, operational testing focuses on mission success. The goal is to evaluate operational . . . effectiveness and suitability in the context of a realistic environment with representative users. This brief will first provide an overview of operational testing and discuss example defense applications of, and key differences between, classical and space-filling designs. It will then present several challenges (and possible solutions) associated with implementing space-filling designs and associated analyses in the defense community.
Rebecca Medlin, Kelly Avery, Curtis MillerTechnical Briefing

The Effect of Extremes in Small Sample Size on Simple Mixed Models: A Comparison of Level-1 and Level-2 Size

We present a simulation study that examines the impact of small sample sizes in . . . both observation and nesting levels of the model on the fixed effect bias, type I error, and the power of a simple mixed model analysis. Despite the need for adjustments to control for type I error inflation, our findings indicate that smaller samples than previously recognized can be used for mixed models under certain conditions prevalent in applied research.
We present a simulation study that examines the impact of small sample sizes in both observation and nesting levels of the model on the fixed effect bias, type I error, and the power of a simple mixed model analysis. Despite the need for . . . adjustments to control for type I error inflation, our findings indicate that smaller samples than previously recognized can be used for mixed models under certain conditions prevalent in applied research.
Kristina A. Carter, Heather M. Wojton, Stephanie T. LaneResearch Paper

The Effect of Extremes in Small Sample Size on Simple Mixed Models: A Comparison of Level-1 and Level-2 Size

We present a simulation study that examines the impact of small sample sizes in . . . both observation and nesting levels of the model on the fixed effect bias, type I error, and the power of a simple mixed model analysis. Despite the need for adjustments to control for type I error inflation, our findings indicate that smaller samples than previously recognized can be used for mixed models under certain conditions prevalent in applied research.
We present a simulation study that examines the impact of small sample sizes in both observation and nesting levels of the model on the fixed effect bias, type I error, and the power of a simple mixed model analysis. Despite the need for . . . adjustments to control for type I error inflation, our findings indicate that smaller samples than previously recognized can be used for mixed models under certain conditions prevalent in applied research.
Kristina A. Carter, Heather M. Wojton, Stephanie T. LaneResearch Paper

The Purpose of Mixed-effects Models in Test and Evaluation

Mixed-effects models are the standard technique for analyzing data that exhibit . . . some grouping structure. In defense testing, these models are useful because they allow us to account for correlations between observations, a feature common in many operational tests. In this article, we describe the advantages of modeling data from a mixed-effects perspective and discuss an R package—ciTools—that equips the user with easy methods for presenting results from this type of model.”
Mixed-effects models are the standard technique for analyzing data that exhibit some grouping structure. In defense testing, these models are useful because they allow us to account for correlations between observations, a feature common in . . . many operational tests. In this article, we describe the advantages of modeling data from a mixed-effects perspective and discuss an R package—ciTools—that equips the user with easy methods for presenting results from this type of model.”
John Haman, Matthew Avery, Heather WojtonResearch Paper

Mixed models

Trustworthy Autonomy: A Roadmap to Assurance -- Part 1: System Effectiveness

In this document, we present part one of our two-part roadmap. We discuss the . . . challenges and possible solutions to assessing system effectiveness.

In this document, we present part one of our two-part roadmap. We discuss the challenges and possible solutions to assessing system effectiveness.

Daniel Porter, Michael McAnally, Chad Bieber, Heather Wojton, Rebecca MedlinHandbook

Why are Statistical Engineers needed for Test & Evaluation?

This briefing, developed for a presentation at the 2021 Quality and Productivity . . . Research Conference, includes two case studies that highlight why statistical engineers are necessary for successful T&E. These case studies center on the important theme of improving methods to integrate testing and data collection across the full system life cycle – a large, unstructured, real-world problem. Integrated testing supports efficient test execution, potentially reducing cost.
This briefing, developed for a presentation at the 2021 Quality and Productivity Research Conference, includes two case studies that highlight why statistical engineers are necessary for successful T&E. These case studies center on the . . . important theme of improving methods to integrate testing and data collection across the full system life cycle – a large, unstructured, real-world problem. Integrated testing supports efficient test execution, potentially reducing cost.
Rebecca Medlin, Keyla Pagán-Rivera, Monica AhrensTechnical Briefing