### Benefits of Statistical Modeling

Statistical models summarize the results of a test and present them in such a way that humans can more easily see and understand any patterns within the data. Without statistical modeling, evaluators are left, at best, with “eye-ball” tests or, at worst, gut-feelings of whether one system performed better than another. Rigorous statistical analysis is less subject to bias, as it involves objectively quantifying and summarizing the data. This results in defensible conclusions that can better inform decision making.

Statistical models give evaluators a vocabulary with which to talk about test results, including the magnitude of differences, strength and types of relationships, and the degree to which one can have confidence in results.

##### What is a model

The goal of statistical modeling is to summarizes a test’s results in such a way that evaluators can observe data patterns, draw conclusions, and ultimately answer the questions that prompted the test. Models provide a snapshot of variations in the system’s behavior across the test’s multiple factors and levels. For example, a simple model could summarize the behavior (i.e., response variable) of a tracking system at high velocity and compare that to a summary of the system’s behavior at low velocity, and then indicate whether the behavior differed significantly.

Statistical models are expressed as mathematical equations that can specify how the response variable changes as a function of factor levels. These empirical models (in contrast to mechanical, physics based models) can then be used to make statements about changes in performance across the operational space (i.e., the test’s factors and levels), as well as to predict system performance.

Analysts follow a process of model selection to decide which factors to include and what assumptions to make in order to accurately represent the test data.

##### Model Selection

Model selection refers to choosing which terms should play a role in modeling the response variable. Each factor that is tested can be included as a term in the model, as can interactions and covariates (e.g., nuissance variables that were recorded for statistical control). The goal of model selection is to choose a sparse statistical model that adequately explains the data.

Whereas the response variables, factors/levels, and covariates will likely be determined during the test planning stage, it is possible that the factors included in the model will change from what was originally planned. For example, two recorded (uncontrolled) factors may end up being confounded with each other, in which case only one (or a combination of the two) should be included in the model, not both. Collinearity should also be assessed prior to modeling, as two collinear factors should not both be included.

It is important to consider the type of model your design and data support. For example, continuous data allows for more detailed models than dichotomous data, and center points allow for the modeling of curvature. More flexible and complex models result in more detailed and precise depictions of any patterns within the data. The content of the model can also be influenced by how the design was executed (e.g., Additional terms are added for a split-plot design). Additionally, there may be “holes” in the data (e.g., cancelled test points) such that not all planned analyses can be conducted. Factor by factor plots can help with the identification of these problem areas.

The analyst must also choose the most appropriate **distribution** for the response variable. This should also have already been thought about notionally in the test planning stage, at least as far as determining whether the response is continuous or binary. Once the data have been collected, exploratory data analyses and visualizations such as Q-Q plots** **can be used to inform a decision on the specific distribution to be used in modeling, as well as to check that the data are appropriate for modeling and meet assumptions.

##### Selection Methods

There are three possible overarching methods with which to perform model selection. The first is forward selection, where nothing but an intercept is included in the initial model. The addition of each variable is then tested using a chosen criterion and the variable (if any) that improves the model the most is added to the model. This process is repeated until the addition of a term no longer significantly improves the model.

A second method is backward selection. In this case, all possible model terms are initially included in the model. The deletion of each variable is tested using a chosen criterion and the variable (if any) that improves the model the most by being deleted is removed. This process is repeated until no further improvement is possible.

Finally, stepwise selection is a combination of forward and backward selection. At each step, a variable may be added or removed based on what improves the model the most.

##### Model Fit Criteria

A handful of criteria can be used to determine whether or not the addition (or deletion) of a term significantly improves a model. The first is the p-value; the probability that the effect due to a particular factor (or interaction) occurred by chance alone. Thus, a smaller p-value means there is a stronger effect due to that factor and it should probably be added to (or kept in) the model.

A second commonly used model selection criterion is the likelihood ratio test. Likelihood measures the “probability” of the observed data given a selected model; the higher the likelihood, the better the goodness of fit of the model to the data. The likelihood ratio test statistic *D* compares the fit of two nested models and has a chi-squared distribution so that a test of significance can be performed. One should continue adding (or removing) terms until the difference between that model and the previous is not significant.

A final set of model selection criteria are those known as information criteria. These methods compare various candidate subsets of factors based on a tradeoff between lack of fit (measured by model likelihood) and complexity (measured by the number of parameters included in the model). Two commonly used information criteria are Akaike Information Criterion (AIC) and Bayes Information Criterion (BIC).

\(\textrm{AIC} =\ – 2 ln(\textrm{likelihood}) + 2p\), where *p* is the number of parameters in the model.

\(\textrm{BIC} =\ – 2 ln(\textrm{likelihood}) + p\ ln(n)\) , where *p *is the number of parameters in the model and *n *is the number of observations in the dataset.

##### Model Validation

Once a potential model has been selected, one should ensure that the model is valid. A graphical inspection of the residuals is useful for checking model assumptions, which typically include linearity, homoscedasticity (constant variance), independence, and normality. Suggested plots include residuals vs. predicted values, residuals over time, and a QQ-plot of the residuals.

Cross validation can also be used as a confirmation of model adequacy. The concept of cross validation is to perform analysis and fit the selected model on one portion of the data (the “training” set) and see how well it performs on another, unseen portion (the “testing” set). The root mean squared error (a measure of the difference between the model’s estimation and the actual observed data) will necessarily be higher for the testing set, but the difference between that and the error for the training set should not be substantial. Multiple types of cross-validation exist, some of which require more data than others.

A final check of model validation is to compare the predictions output by the model to the raw data. Any disagreement between the data and the model doesn’t necessarily invalidate the model, but the analyst should ensure he or she understands and can explain why the differences exist.

##### Impact of Design on Modeling

In the best case scenarios, evaluators are able to stick to their design and tests are executed according to plan. When the design is followed, the statistical modeling of the data is straight-forward because statistical thinking was involved in the planning stage. For example, the evaluators experimentally controlled factors of no interest such as different operators and systematically varied the primary factor (e.g., time of day). Thus, the analyst can directly compare system performance at night versus during the day.