Title Authors Type Tags

A Comparison of Ballistic Resistance Testing Techniques in the Department of Defense

This paper summarizes sensitivity test methods commonly employed in the . . . Department of Defense. A comparison study shows that modern methods such as Neyer's method and Three-Phase Optimal Design are improvements over historical methods.

This paper summarizes sensitivity test methods commonly employed in the Department of Defense. A comparison study shows that modern methods such as Neyer's method and Three-Phase Optimal Design are improvements over historical methods.

Thomas Johnson, Laura J. Freeman, Janice Hester, Jonathan BellResearch Paper

A First Step into the Bootstrap World

Bootstrapping is a powerful nonparametric tool for conducting statistical . . . inference with many applications to data from operational testing. Bootstrapping is most useful when the population sampled from is unknown or complex or the sampling distribution of the desired statistic is difficult to derive. Careful use of bootstrapping can help address many challenges in analyzing operational test data.
Bootstrapping is a powerful nonparametric tool for conducting statistical inference with many applications to data from operational testing. Bootstrapping is most useful when the population sampled from is unknown or complex or the sampling . . . distribution of the desired statistic is difficult to derive. Careful use of bootstrapping can help address many challenges in analyzing operational test data.
Matthew AveryTechnical Briefing

A Multi-Method Approach to Evaluating Human-System Interactions during Operational Testing

The purpose of this paper was to identify the shortcomings of a single-method . . . approach to evaluating human-system interactions during operational testing and offer an alternative, multi-method approach that is more defensible, yields richer insights into how operators interact with weapon systems, and provides a practical implications for identifying when the quality of human-system interactions warrants correction through either operator training or redesign.
The purpose of this paper was to identify the shortcomings of a single-method approach to evaluating human-system interactions during operational testing and offer an alternative, multi-method approach that is more defensible, yields richer . . . insights into how operators interact with weapon systems, and provides a practical implications for identifying when the quality of human-system interactions warrants correction through either operator training or redesign.
Dean Thomas, Heather Wojton, Chad Bieber, Daniel PorterResearch Paper

A Review of Sequential Analysis

Sequential analysis concerns statistical evaluation in situations in which the . . . number, pattern, or composition of the data is not determined at the start of the investigation, but instead depends upon the information acquired throughout the course of the investigation. Expanding the use of sequential analysis has the potential to save a lot of money and reduce test time (National Research Council, 1998). This paper summarizes the literature on sequential analysis and offers fundamental information for providing recommendations for its use in DoD test and evaluation.
Sequential analysis concerns statistical evaluation in situations in which the number, pattern, or composition of the data is not determined at the start of the investigation, but instead depends upon the information acquired throughout the . . . course of the investigation. Expanding the use of sequential analysis has the potential to save a lot of money and reduce test time (National Research Council, 1998). This paper summarizes the literature on sequential analysis and offers fundamental information for providing recommendations for its use in DoD test and evaluation.
Rebecca Medlin, John Dennis, Keyla Pagán-Rivera, Leonard Wilkins, Heather WojtonResearch Paper

An Expository Paper on Optimal Design

There are many situations where the requirements of a standard experimental . . . design do not fit the research requirements of the problem. Three such situations occur when the problem requires unusual resource restrictions, when there are constraints on the design region, and when a non-standard model is expected to be required to adequately explain the response.
There are many situations where the requirements of a standard experimental design do not fit the research requirements of the problem. Three such situations occur when the problem requires unusual resource restrictions, when there are . . . constraints on the design region, and when a non-standard model is expected to be required to adequately explain the response.
Douglas C. Montgomery, Bradley A. Jones, Rachel T. JohnsonResearch Paper

An Uncertainty Analysis Case Study of Live Fire Modeling and Simulation

This paper emphasizes the use of fundamental statistical techniques – design of . . . experiments, statistical modeling, and propagation of uncertainty – in the context of a combat scenario that depicts a ground vehicle being engaged by indirect artillery.
This paper emphasizes the use of fundamental statistical techniques – design of experiments, statistical modeling, and propagation of uncertainty – in the context of a combat scenario that depicts a ground vehicle being engaged by indirect . . . artillery.
Mark Couch, Thomas Johnson, John Haman, Heather Wojton, Benjamin Turner, David HigdonOther

Artificial Intelligence & Autonomy Test & Evaluation Roadmap Goals

As the Department of Defense acquires new systems with artificial intelligence . . . (AI) and autonomous (AI&A) capabilities, the test and evaluation (T&E) community will need to adapt to the challenges that these novel technologies present. The goals listed in this AI Roadmap address the broad range of tasks that the T&E community will need to achieve in order to properly test, evaluate, verify, and validate AI-enabled and autonomous systems. It includes issues that are unique to AI and autonomous systems, as well as legacy T&E shortcomings that will be compounded by newer technologies.
As the Department of Defense acquires new systems with artificial intelligence (AI) and autonomous (AI&A) capabilities, the test and evaluation (T&E) community will need to adapt to the challenges that these novel . . . technologies present. The goals listed in this AI Roadmap address the broad range of tasks that the T&E community will need to achieve in order to properly test, evaluate, verify, and validate AI-enabled and autonomous systems. It includes issues that are unique to AI and autonomous systems, as well as legacy T&E shortcomings that will be compounded by newer technologies.
Brian Vickers, Daniel Porter, Rachel Haga, Heather WojtonTechnical Briefing

Bayesian Reliability: Combining Information

One of the most powerful features of Bayesian analyses is the ability to combine . . . multiple sources of information in a principled way to perform inference. This feature can be particularly valuable in assessing the reliability of systems where testing is limited. At their most basic, Bayesian methods for reliability develop informative prior distributions using expert judgment or similar systems. Appropriate models allow the incorporation of many other sources of information, including historical data, information from similar systems, and computer models. We introduce the Bayesian approach to reliability using several examples and point to open problems and areas for future work.
One of the most powerful features of Bayesian analyses is the ability to combine multiple sources of information in a principled way to perform inference. This feature can be particularly valuable in assessing the reliability of systems . . . where testing is limited. At their most basic, Bayesian methods for reliability develop informative prior distributions using expert judgment or similar systems. Appropriate models allow the incorporation of many other sources of information, including historical data, information from similar systems, and computer models. We introduce the Bayesian approach to reliability using several examples and point to open problems and areas for future work.
Alyson Wilson, Kassandra FroncyzkResearch Paper

Censored Data Analysis Methods for Performance Data: A Tutorial

Binomial metrics like probability-to-detect or probability-to-hit typically do . . . not provide the maximum information from testing. Using continuous metrics such as time to detect provide more information, but do not account for non-detects. Censored data analysis allows us to account for both pieces of information simultaneously.
Binomial metrics like probability-to-detect or probability-to-hit typically do not provide the maximum information from testing. Using continuous metrics such as time to detect provide more information, but do not account for non-detects. . . . Censored data analysis allows us to account for both pieces of information simultaneously.
V. Bram LillardTechnical Briefing

Characterizing Human-Machine Teaming Metrics for Test & Evaluation

This briefing defines human-machine teaming, describes new challenges in . . . evaluating HMTs, and provides a framework for the categories of metrics that are important for the T&E of HMTs.

This briefing defines human-machine teaming, describes new challenges in evaluating HMTs, and provides a framework for the categories of metrics that are important for the T&E of HMTs.

Heather Wojton, Brian Vickers, Kristina Carter, David Sparrow, Leonard Wilkins, Caitlan FealingTechnical Briefing

Choice of second-order response surface designs for logistic and Poisson regression models

This paper illustrates the construction of D-optimal second order designs for . . . situations when the response is either binomial (pass/fail) or Poisson (count data).

This paper illustrates the construction of D-optimal second order designs for situations when the response is either binomial (pass/fail) or Poisson (count data).

Rachel T. Johnson, Douglas C. MontgomeryResearch Paper

Comparing Computer Experiments for the Gaussian Process Model Using Integrated Prediction Variance

Space filling designs are a common choice of experimental design strategy for . . . computer experiments. This paper compares space filling design types based on their theoretical prediction variance properties with respect to the Gaussian Process model. https://www.tandfonline.com/doi/abs/10.1080/08982112.2012.758284
Space filling designs are a common choice of experimental design strategy for computer experiments. This paper compares space filling design types based on their theoretical prediction variance properties with respect to the Gaussian . . . Process model. https://www.tandfonline.com/doi/abs/10.1080/08982112.2012.758284
Rachel T. Johnson, Douglas C. Montgomery, Bradley Jones, Chris GotwaltResearch Paper

Demystifying the Black Box: A Test Strategy for Autonomy

The purpose of this briefing is to provide a high-level overview of how to frame . . . the question of testing autonomous systems in a way that will enable development of successful test strategies. The brief outlines the challenges and broad-stroke reforms needed to get ready for the test challenges of the next century.
The purpose of this briefing is to provide a high-level overview of how to frame the question of testing autonomous systems in a way that will enable development of successful test strategies. The brief outlines the challenges and . . . broad-stroke reforms needed to get ready for the test challenges of the next century.
Heather Wojton, Daniel PorterTechnical Briefing

Designed Experiments for the Defense Community

This paper presents the underlying tenets of design of experiments, as applied . . . in the Department of Defense, focusing on factorial, fractional factorial and response surface design and analyses. The concepts of statistical modeling and sequential experimentation are also emphasized.
This paper presents the underlying tenets of design of experiments, as applied in the Department of Defense, focusing on factorial, fractional factorial and response surface design and analyses. The concepts of statistical modeling and . . . sequential experimentation are also emphasized.
Rachel T. Johnson, Douglas C. Montgomery, James R. SimpsonResearch Paper

Designing Experiments for Model Validation

Advances in computational power have allowed both greater fidelity and more . . . extensive use of such models. Numerous complex military systems have a corresponding models that simulate its performance in the field. In response, the DoD needs defensible practices for validating these models. DOE and statistical analysis techniques are the foundational building blocks for validating the use of computer models and quantifying uncertainty in that validation. Recent developments in uncertainty quantification have the potential to benefit the DoD in using modeling and simulation to inform operational evaluations.
Advances in computational power have allowed both greater fidelity and more extensive use of such models. Numerous complex military systems have a corresponding models that simulate its performance in the field. In response, the DoD needs . . . defensible practices for validating these models. DOE and statistical analysis techniques are the foundational building blocks for validating the use of computer models and quantifying uncertainty in that validation. Recent developments in uncertainty quantification have the potential to benefit the DoD in using modeling and simulation to inform operational evaluations.
Heather Wojton, Kelly Avery, Laura Freeman, Thomas JohnsonOther

Designing experiments for nonlinear models—an introduction

This paper illustrates the construction of Bayesian D-optimal designs for . . . nonlinear models and compares the relative efficiency of standard designs to these designs for several models and prior distributions on the parameters.

This paper illustrates the construction of Bayesian D-optimal designs for nonlinear models and compares the relative efficiency of standard designs to these designs for several models and prior distributions on the parameters.

Rachel T. Johnson, Douglas C. MontgomeryResearch Paper

Determining How Much Testing is Enough: An Exploration of Progress in the Department of Defense Test and Evaluation Community

This paper describes holistic progress in answering the question of “How much . . . testing is enough?” It covers areas in which the T&E community has made progress, areas in which progress remains elusive, and issues that have emerged since 1994 that provide additional challenges. The selected case studies used to highlight progress are especially interesting examples, rather than a comprehensive look at all programs since 1994.
This paper describes holistic progress in answering the question of “How much testing is enough?” It covers areas in which the T&E community has made progress, areas in which progress remains elusive, and issues that have emerged since . . . 1994 that provide additional challenges. The selected case studies used to highlight progress are especially interesting examples, rather than a comprehensive look at all programs since 1994.
Rebecca Medlin, Matthew Avery, James Simpson, Heather WojtonResearch Paper

Examining Improved Experimental Designs for Wind Tunnel Testing Using Monte Carlo Sampling Methods

In this paper we compare data from a fairly large legacy wind tunnel test . . . campaign to smaller, statistically-motivated experimental design strategies. The comparison, using Monte Carlo sampling methodology, suggests a tremendous opportunity to reduce wind tunnel test efforts without losing test information.
In this paper we compare data from a fairly large legacy wind tunnel test campaign to smaller, statistically-motivated experimental design strategies. The comparison, using Monte Carlo sampling methodology, suggests a tremendous opportunity . . . to reduce wind tunnel test efforts without losing test information.
Raymond R. Hill, Derek A. Leggio, Shay R. Capehart, August G. RoesenerResearch Paper

Handbook on Statistical Design & Analysis Techniques for Modeling & Simulation Validation

This handbook focuses on methods for data-driven validation to supplement the . . . vast existing literature for Verification, Validation, and Accreditation (VV&A) and the emerging references on uncertainty quantification (UQ). The goal of this handbook is to aid the test and evaluation (T&E) community in developing test strategies that support model validation (both external validation and parametric analysis) and statistical UQ.
This handbook focuses on methods for data-driven validation to supplement the vast existing literature for Verification, Validation, and Accreditation (VV&A) and the emerging references on uncertainty quantification (UQ). The goal of . . . this handbook is to aid the test and evaluation (T&E) community in developing test strategies that support model validation (both external validation and parametric analysis) and statistical UQ.
Heather Wojton, Kelly Avery, Laura J. Freeman, Samuel Parry, Gregory Whittier, Thomas Johnson, Andrew FlackHandbook

handbook, statistics

Hybrid Designs: Space Filling and Optimal Experimental Designs for Use in Studying Computer Simulation Models

This tutorial provides an overview of experimental design for modeling and . . . simulation. Pros and cons of each design methodology are discussed.

This tutorial provides an overview of experimental design for modeling and simulation. Pros and cons of each design methodology are discussed.

Rachel Johnson SilvestriniTechnical Briefing

Improving Reliability Estimates with Bayesian Statistics

This paper shows how Bayesian methods are ideal for the assessment of complex . . . system reliability assessments. Several examples illustrate the methodology.

This paper shows how Bayesian methods are ideal for the assessment of complex system reliability assessments. Several examples illustrate the methodology.

Kassandra Fronczyk, Laura J. FreemanResearch Paper

Initial Validation of the Trust of Automated Systems Test (TOAST)

Trust is a key determinant of whether people rely on automated systems in the . . . military and the public. However, there is currently no standard for measuring trust in automated systems. In the present studies we propose a scale to measure trust in automated systems that is grounded in current research and theory on trust formation, which we refer to as the Trust in Automated Systems Test (TOAST). We evaluated both the reliability of the scale structure and criterion validity using independent, military-affiliated and civilian samples. In both studies we found that the TOAST exhibited a two-factor structure, measuring system understanding and performance (respectively), and that factor scores significantly predicted scores on theoretically related constructs demonstrating clear criterion validity. We discuss the implications of our findings for advancing the empirical literature and in improving interface design.
Trust is a key determinant of whether people rely on automated systems in the military and the public. However, there is currently no standard for measuring trust in automated systems. In the present studies we propose a scale to measure . . . trust in automated systems that is grounded in current research and theory on trust formation, which we refer to as the Trust in Automated Systems Test (TOAST). We evaluated both the reliability of the scale structure and criterion validity using independent, military-affiliated and civilian samples. In both studies we found that the TOAST exhibited a two-factor structure, measuring system understanding and performance (respectively), and that factor scores significantly predicted scores on theoretically related constructs demonstrating clear criterion validity. We discuss the implications of our findings for advancing the empirical literature and in improving interface design.
Heather Wojton, Daniel Porter, Stephanie Lane, Chad Bieber, Poornima MadhavanResearch Paper

Metamodeling Techniques for Verification and Validation of Modeling and Simulation Data

Modeling and simulation (M&S) outputs help the Director, Operational Test . . . and Evaluation (DOT&E), assess the effectiveness, survivability, lethality, and suitability of systems. To use M&S outputs, DOT&E needs models and simulators to be sufficiently verified and validated. The purpose of this paper is to improve the state of verification and validation by recommending and demonstrating a set of statistical techniques—metamodels, also called statistical emulators—to the M&S community. The paper expands on DOT&E’s existing guidance about metamodel usage by creating methodological recommendations the M&S community could apply to its activities. For a deterministic, discrete response variable, we recommend using a nearest neighbor or decision tree model. For a deterministic, continuous response variable, we recommend Gaussian process interpolation. For a stochastic response variable, we recommend a generalized additive model. We also present a set of techniques that testers can use to assess the adequacy of their metamodels. We conclude with a notional example (a paper plane simulation) that demonstrates the recommended techniques. Finally, we include supplemental software written in R that readers can use to reproduce the outputs from this paper.
Modeling and simulation (M&S) outputs help the Director, Operational Test and Evaluation (DOT&E), assess the effectiveness, survivability, lethality, and suitability of systems. To use M&S outputs, DOT&E needs models and . . . simulators to be sufficiently verified and validated. The purpose of this paper is to improve the state of verification and validation by recommending and demonstrating a set of statistical techniques—metamodels, also called statistical emulators—to the M&S community. The paper expands on DOT&E’s existing guidance about metamodel usage by creating methodological recommendations the M&S community could apply to its activities. For a deterministic, discrete response variable, we recommend using a nearest neighbor or decision tree model. For a deterministic, continuous response variable, we recommend Gaussian process interpolation. For a stochastic response variable, we recommend a generalized additive model. We also present a set of techniques that testers can use to assess the adequacy of their metamodels. We conclude with a notional example (a paper plane simulation) that demonstrates the recommended techniques. Finally, we include supplemental software written in R that readers can use to reproduce the outputs from this paper.
John T. Haman, Curtis G. MillerResearch Paper

Power Analysis Tutorial for Experimental Design Software

This guide provides both a general explanation of power analysis and specific . . . guidance to successfully interface with two software packages, JMP and Design Expert (DX).

This guide provides both a general explanation of power analysis and specific guidance to successfully interface with two software packages, JMP and Design Expert (DX).

James Simpson, Thomas Johnson, Laura J. FreemanHandbook

Regularization for Continuously Observed Ordinal Response Variables with Piecewise-Constant Functional Predictors

This paper investigates regularization for continuously observed covariates that . . . resemble step functions. Two approaches for regularizing these covariates are considered, including a thinning approach commonly used within the DoD to address autocorrelated time series data.
This paper investigates regularization for continuously observed covariates that resemble step functions. Two approaches for regularizing these covariates are considered, including a thinning approach commonly used within the DoD to address . . . autocorrelated time series data.
Matthew Avery, Mark Orndorff, Timothy Robinson, Laura J. FreemanResearch Paper

Space-Filling Designs for Modeling & Simulation

This document presents arguments and methods for using space-filling designs . . . (SFDs) to plan modeling and simulation (M&S) data collection.

This document presents arguments and methods for using space-filling designs (SFDs) to plan modeling and simulation (M&S) data collection.

Han Yi, Curtis Miller, Kelly AveryResearch Paper

Statistical Methods for Defense Testing

In the increasingly complex and data‐limited world of military defense testing, . . . statisticians play a valuable role in many applications. Before the DoD acquires any major new capability, that system must undergo realistic testing in its intended environment with military users. Oftentimes new or complex analysis techniques are needed to support the goal of characterizing or predicting system performance across the operational space. Statistical design and analysis techniques are essential for rigorous evaluation of these models.
In the increasingly complex and data‐limited world of military defense testing, statisticians play a valuable role in many applications. Before the DoD acquires any major new capability, that system must undergo realistic testing in its . . . intended environment with military users. Oftentimes new or complex analysis techniques are needed to support the goal of characterizing or predicting system performance across the operational space. Statistical design and analysis techniques are essential for rigorous evaluation of these models.
Dean Thomas, Kelly Avery, Laura FreemanResearch Paper

Statistical Models for Combining Information Stryker Reliability Case Study

This paper describes the benefits of using parametric statistical models to . . . combine information across multiple testing events. Both frequentist and Bayesian inference techniques are employed, and they are compared and contrasted to illustrate different statistical methods for combining information.
This paper describes the benefits of using parametric statistical models to combine information across multiple testing events. Both frequentist and Bayesian inference techniques are employed, and they are compared and contrasted to . . . illustrate different statistical methods for combining information.
Rebecca Dickinson, Laura J. Freeman, Bruce Simpson, Alyson WilsonResearch Paper

Test & Evaluation of AI-Enabled and Autonomous Systems: A Literature Review

This paper summarizes a subset of the literature regarding the challenges to and . . . recommendations for the test, evaluation, verification, and validation (TEV&V) of autonomous military systems.

This paper summarizes a subset of the literature regarding the challenges to and recommendations for the test, evaluation, verification, and validation (TEV&V) of autonomous military systems.

Heather Wojton, Daniel Porter, John DennisResearch Paper

Test Design Challenges in Defense Testing

All systems undergo operational testing before fielding or full-rate production. . . . While contractor and developmental testing tends to be requirements-driven, operational testing focuses on mission success. The goal is to evaluate operational effectiveness and suitability in the context of a realistic environment with representative users. This brief will first provide an overview of operational testing and discuss example defense applications of, and key differences between, classical and space-filling designs. It will then present several challenges (and possible solutions) associated with implementing space-filling designs and associated analyses in the defense community.
All systems undergo operational testing before fielding or full-rate production. While contractor and developmental testing tends to be requirements-driven, operational testing focuses on mission success. The goal is to evaluate operational . . . effectiveness and suitability in the context of a realistic environment with representative users. This brief will first provide an overview of operational testing and discuss example defense applications of, and key differences between, classical and space-filling designs. It will then present several challenges (and possible solutions) associated with implementing space-filling designs and associated analyses in the defense community.
Rebecca Medlin, Kelly Avery, Curtis MillerTechnical Briefing

Trustworthy Autonomy: A Roadmap to Assurance -- Part 1: System Effectiveness

In this document, we present part one of our two-part roadmap. We discuss the . . . challenges and possible solutions to assessing system effectiveness.

In this document, we present part one of our two-part roadmap. We discuss the challenges and possible solutions to assessing system effectiveness.

Daniel Porter, Michael McAnally, Chad Bieber, Heather Wojton, Rebecca MedlinHandbook

Why are Statistical Engineers needed for Test & Evaluation?

This briefing, developed for a presentation at the 2021 Quality and Productivity . . . Research Conference, includes two case studies that highlight why statistical engineers are necessary for successful T&E. These case studies center on the important theme of improving methods to integrate testing and data collection across the full system life cycle – a large, unstructured, real-world problem. Integrated testing supports efficient test execution, potentially reducing cost.
This briefing, developed for a presentation at the 2021 Quality and Productivity Research Conference, includes two case studies that highlight why statistical engineers are necessary for successful T&E. These case studies center on the . . . important theme of improving methods to integrate testing and data collection across the full system life cycle – a large, unstructured, real-world problem. Integrated testing supports efficient test execution, potentially reducing cost.
Rebecca Medlin, Keyla Pagán-Rivera, Monica AhrensTechnical Briefing