### Construct Test Design Assess Test Adequacy Execute the Test

In the process of choosing a design, you may find that there are multiple reasonable options and you might ask how your design “stacks up” against another. To answer these questions, you can compute several **statistical measures of merit** that indicate the adequacy of a given test. Generally, these help to answer the questions “How much testing is enough?” and “How much do I stand to learn?”

Assessing the adequacy of a test is complex and there are multiple features to consider depending on the goal and characteristics of the test. There are no “one size fits all” solutions, but there are useful tools that, when used in combination, allow you to compare and demonstrate the efficiency and effectiveness of test designs. These techniques inform us of the risks of making an incorrect decision for a proposed test design. That is, they can tell us how costly our test might be (i.e., required sample size), how much knowledge we stand to learn (i.e., precision of results and likelihood of detecting an effect), and how confident we can be in our conclusions. The following are Statistical Measures of Merit that provide information concerning test adequacy. You can follow the links learn more about or read examples of each one.

**Statistical Model Supported **

The goal of analyzing test data is to produce a statistical model that summarizes the results in such a way that we can observe data patterns and draw conclusions. Models provide us with a snapshot of variations in the system’s behavior across the test’s multiple factors and levels (and ideally across the system’s **operational envelope**). It is important to consider what type of model you want to end up with and ensure that your design supports this model. Conversely, if you have generated a design, it is important to check that the data can be modeled in such a way that your questions will be answered.

The test’s design determines the complexity of the model that can be used to summarize it’s results. More flexible and complex models result in more detailed and precise depictions of any patterns within the data, but utilize more resources. Matching the flexibility and complexity of the model to the test’s goals and questions is a key component of test adequacy. Complexity is often measured in terms of Model Resolution**, **which indicates the highest order of effects the model can estimate.

**Confidence **

A design’s confidence level tells us the likelihood that we are correct when we conclude a factor has no effect on the response variable (i.e., likelihood of a true negative). It stands in contrast to alpha level, which is the risk of concluding that a factor has an effect when, in reality, it does not (Type I error; false positive).

**Power **

The power of a design tells us the likelihood that we will conclude a factor has an effect on the response variable when, in reality, it does. This is the true positive rate. Power is directly related to test precision and is critical to the determination of test adequacy. Power also informs us of the test resource/level of certainty trade-off. That is, information comes at a cost, but there is point of diminishing returns.

Power is impacted by many factors including confidence level, effect sizes, error, and design elements such as number and placement of test points. A detailed description as well as several tools are available to help you compute tests power.

**Collinearity**

Collinearity describes how strongly individual factors are linearly related (e.g., as one factor increases, so does the other). Analyzing data from designs with collinear factors can be misleading and imprecise, as estimates from this data are highly variable. Quality designs minimize collinearity among factors.

**Variance Inflation Factor (VIF)**

VIF is a one number summary describing the degree of collinearity one factor has with other factors in the model. A factor’s VIF represents the increase in the variance of the estimated coefficient for that factor compared to if the factors were not collinear (i.e., orthogonal). A VIF of 1 is ideal and a rule of thumb is to keep this value below 5. A design including factors with low VIFs has greater power and requires fewer resources to generate precise results.

**Scaled Prediction Variance (SPV)**

SPV estimates how precise a model’s predictions will be. Before data is collected, SPV can indicate how much error would be involved in the resulting model’s estimates. SPV is calculated for specific locations across the design space (i.e., at the various factor level combinations) and commonly displayed graphically as shown below.

**Fraction of Design Space (FDS)**

FDS summarizes the scaled prediction variance across the entire design space. An FDS graph shows the proportion of the design space with Scaled Prediction Variance less than or equal to a given value. For the previous example, the FDS graph for Design A and B shows that nearly 80% of the Design A space has an SPV below 4.0, while roughly 55% of the Design B region has an SPV below 4.0. From this chart it is clear that the evaluators should choose Design A.