Model Selection

Analysis ChecklistPre-AnalysisExploratory Data AnalysisInferential Analysis

Statistical models are expressed as mathematical equations that can specify how the response variable changes as a function of factor levels. They summarize the results of a test and present them in such a way that humans can more easily see and understand any patterns within the data. By fitting statistical models, we will have the vocabulary with which to talk about test results, including the magnitude of differences, strength and types of relationships, and the degree to which one can have confidence in results. These empirical models (in contrast to mechanical, physics based models) can then be used to make statements about changes in performance across the operational space (i.e., the test’s factors and levels), as well as to predict system performance.

Two important features of a statistical model are that it aligns with, but is simpler than, the reality it describes. When characterizing the results of a test, our statistical model should contain key factors that characterize the operational environment of our system under test, but this model will naturally be simpler than reality. For example, in an operational test for a combat system, we may introduce a test factor representing a mock threat. However, we cannot fully reproduce aspects of the threat environment in other ways, such as operator stress in a combat environment. Therefore, much like our test design, our model will align with, but be a simplification of, reality.

Rigorous statistical analysis is less subject to one’s personal biases, as it involves objectively quantifying and summarizing the data. Without statistical modeling, we are left, at best, with “eye-ball” tests or, at worst, gut-feelings of whether one system performed better than another. On the other hand, with statistical modeling we can observe data patterns, draw conclusions, and ultimately answer the questions that prompted the test. Models provide a snapshot of variations in the system’s behavior across the test’s multiple factors and levels. For example, a simple model could summarize the behavior (i.e., response variable) of a tracking system at high velocity and compare that to a summary of the system’s behavior at low velocity, and then indicate whether the behavior differed significantly. We follow a process of model selection to decide which factors to include and what assumptions to make in order to accurately represent the test data.


Model selection refers to choosing which terms should play a role in modeling the response variable. Each factor that is tested can be included as a term in the model, as can interactions and covariates (e.g., potential nuisance variables that were recorded for statistical control). The goal of model selection is to choose a sparse statistical model that adequately explains the data. A good model has three main characteristics: parsimony (model simplicity), Goodness-of-fit test (model fits the data well), and generalizability (model can be used to describe or predict new data). Moreover, a good model includes just the necessary factors and covariates to 1) avoid being underfit (too simple), 2) avoid being overfit (unnecessarily complex), and 3) account for potential confounding.

It is important to consider the type of model your design and data support. For example, continuous data allows for more detailed models than dichotomous data, and center points allow for the modeling of curvature. More flexible and complex models result in more detailed and precise depictions of any patterns within the data. The content of the model can also be influenced by how the design was executed (e.g., additional terms are added for a split-plot design). Moreover, there may be “holes” in the data (e.g., cancelled test points) such that not all planned analyses can be conducted. Factor by factor plots can help with the identification of these problem areas.

We must also choose the most appropriate distribution for the response variable. This should have already been thought about notionally in the test planning stage, at least as far as determining whether the response is continuous or binary. Once the data have been collected, exploratory data analyses and visualizations such as Q-Q plots can be used to inform a decision on the specific distribution to be used in modeling, as well as to check that the data are appropriate for modeling and meet assumptions.

Selection Methods

There are several model building strategies available using regression modeling. Traditional approaches may include exhaustive search, forward selection, backward selection, and stepwise selection; however, increasing attention is being paid to regularization approaches (ridge regression, LASSO, elastic net). These approaches comprise our selection methods.
In the exhaustive search selection, all possible models are fitted. An evaluation criterion (discussed in the next section) is used to determine which model fit best the data.

Using forward selection, the model begins with only an intercept term in the initial model. The addition of each variable is then tested using a chosen criterion and the variable (if any) which most improves the model is added to the model. This process terminates when no variable significantly improves the model.

Backward selection follows a similar logic, but in reverse order. In backward selection, all variables of interest are included in the initial model. Then, the variable with the worst criterion is dropped from the model. This process continues iteratively until all remaining predictors meet a pre-defined level of significance.
Finally, stepwise selection may be viewed as a combination of forward and backward selection. That is, stepwise selection may begin as a backward selection procedure, dropping variables which do not improve the model. Select variables may be reintroduced to the model if they are now significant according to the chosen criterion. The forward, backward and stepwise methods are automated in some statistical software.

Alternative methods of variable selection exist which fall under the umbrella term of “regularization.” Broadly speaking, these regularization procedures operate by introducing penalty terms into our model, addressing what is commonly known as the “bias-variance tradeoff.” That is, as our model becomes increasingly complex (i.e., we add more independent variables), our parameter estimates have increased variance. In turn, this increased variance results in increased prediction error.

By adding penalties to our model, we are introducing some bias in our parameter estimates, but we are doing so to reduce the variance and ultimately decrease our prediction error. These regularization methods, including the LASSO, the ridge, and the elastic net (among others), serve to ameliorate concerns that our model may be overfitting our data.

The table below summarizes several approaches to variable selection within a regression context. Note that several of these methods can be extended to a grouped context, where variables can be grouped together in sets to reduce the number of total tests. This topic, known as grouped variable selection, will not be covered further here.

Forward SelectionBegins with null (empty) model, iteratively adds variables into the model until a criterion is metStraightforward to implementBecause variables are added one at a time, and never removed, variables may remain in the model which are no longer significant
Backward SelectionBegins with full model, iteratively removes variables from the model until criterion is metStraightforward to implementBecause variables are removed one at a time, and not reconsidered for addition, one may drop a variable that would be significant after dropping a different variable
Stepwise SelectionCan begin with either full or empty model. If beginning with a full model, it drops variables iteratively until a criterion is met. It then reintroduced variables that may have been dropped, which are now significantStraightforward to implement, searches both forward and backwardsCan experience issues selection predictors in the presence of collinearity
Exhaustive SelectionTest every possible subset of variables, including possible higher order effects (i.e., interaction effects between factors)Exhaustive, allows for investigation of all possible combinationsComputationally expensive, quickly intractable with a larger number of factors. Only useful with a small number of predictor variables
LASSOBegins with full model and shrinks coefficients via a turning parameter. Coefficients can be shrunk in size all the way to zero, effectively performing variable selection (i.e., only keeping meaningful variables)Can perform variable selection; reduces variance in the parameter estimates and reduces prediction errorBiases the magnitude of the coefficients
RidgeBegins with full model and shrinks coefficients via a turning parameter. Coefficients can be shrunk in size toward zero, but not all the way to zero. Cannot perform variable selectionCan shrink highly collinear variables toward each other, reducing variance in parameter estimatesCannot perform variable selection; you may be left with a large number of variables that have been shrunk close to zero but have not reached it
Elastic net Begins with full model and shrinks coefficients via multiple turning parameters. Can shrink coefficients to zero (like the LASSO) and can shrink highly correlated variables toward each other (like the Ridge)Includes benefits of both the lasso and the ridge penaltiesIf predictors are not correlated, lasso penalty may be more accurate
Evaluation Criteria

In order to select a final model, we need some criterion on which to decide whether or not our model has “improved” during the model selection process. Various selection criteria exist to try to characterize the fit of the model to the data.

One model selection criteria is the significance of the factors and covariates based on the p-value. The p-value of individual variables (factors or covariates) may be used as an evaluation criterion with the use of forward, backward, or stepwise selection procedures. Because these p-values are usually used during an exploratory phase, a slightly higher p-value is frequently used as a cutoff for all three. That is, a p-value of 0.10 or 0.15 may be employed as a cutoff, opposed to a more conservative value of 0.05.

Another such criteria, the likelihood ratio test, makes use of the concept of likelihood. Likelihood measures the “probability” of the observed data given a selected model; the higher the likelihood, the better the goodness of fit of the model to the data. Specifically, the likelihood ratio test statistic (D) compares the fit of two nested models and has a chi-squared distribution so that a test of significance can be performed. One should continue adding or removing terms (depending on which selection method is chosen) until the difference between that model and the previous is not significant. The D statistic is defined as follow:

\(\begin{eqnarray*} D & = & -2\ ln \left(\frac{\textrm{likelihood for null model}}{\textrm{likelihood for alternative model}}\right)\\ \\ & = & -2\ [ln(\textrm{likelihood for null model})\ -\ ln (\textrm{likelihood for alternative model})]\\ \\ & \sim & \chi^2_{df_2-df_1} \end{eqnarray*}\)


The Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC) compare various candidate subsets of factors based on a tradeoff between lack of fit (measured by model likelihood) and complexity (measured by the number of parameters included in the model). The AICc is a correction to the AIC for small sample sizes. The smaller the value of the information criteria, the better the model. The rule of thumb for the difference in information criteria for two models is that if such difference is less than 2, then there is no preference of one model over the other one. That is, the AIC and BIC are relative metrics, in which we compare the relative fit of models. Denoting p as the number of parameters in the model and n as the number of observations in the dataset, we can define these information criteria as follow:

\(\textrm{AIC} =\ – 2 ln(\textrm{likelihood}) + 2p\), where p is the number of parameters in the model.

\(\textrm{BIC} =\ – 2 ln(\textrm{likelihood}) + p\ ln(n)\) , where p is the number of parameters in the model and n is the number of observations in the dataset.

We can use these criteria in combination with a selection method to select an optimal model to characterize our data.


Model Validation

Once a potential model has been selected, one should ensure that the model is valid. A graphical inspection of the residuals is useful for checking model assumptions, which typically include linearity, homoscedasticity (constant variance), independence, and normality. Suggested plots include residuals vs. predicted values, and a Q-Q plot of the residuals. Inspecting the residuals is a simple but effective way to evaluate whether there are parts of our model that are not adequately characterizing our data.

A final check of model validation is to compare the predictions output by the model to the raw data. Any disagreement between the data and the model does not necessarily invalidate the model, but we should ensure we understand and can explain why the differences exist.

Another important concept in model validation is cross-validation. With the use of model selection procedures, we run the risk of overfitting a model to our specific sample, yielding a model that may not generalize well to other samples. To address this concern, we may perform model selection on a certain percentage of our sample (say, 80%) and evaluate the performance of our model on the remaining 20%. These samples are frequently referred to as training and testing samples, respectively. If the model we select based on 80% of the sample generalizes well to the remaining 20% of the sample, we are more confident in our model.


Final Thoughts

Ultimately, model selection is a complex process that requires careful consideration of multiple components. Though many model selection procedures may be implemented in an automated way (via your software program of choice), it is important to keep our subject matter knowledge in mind when considering candidate models. That is, in order to fit the best model, it is important to keep in mind what we know about our system under test. By doing so, we are able to rule out some models or include others that the automated method might have missed. For example, sometimes the p-value for a factor will not achieve statistical significance, but our previous knowledge tells us that such a factor is important and therefore should be kept in the model. Additionally, given the many options for model selection techniques and model evaluation criteria, it is likely that two researchers using the same data may end up choosing different models. This does not necessarily mean that one model is better than the other one, but that the chosen models are the best ones based on some method/criteria, among the subset of models that the researchers fitted.“All models are wrong, some are useful.” – George Box


Leave a Reply