Try Adding Interaction Terms and Perform the Automated Stepwise Selection Again.
Ann Transl Med. 2016 Apr; four(seven): 136.
Variable choice with stepwise and best subset approaches
Received 2015 Dec 25; Accepted 2016 Jan 24.
Abstract
While purposeful choice is performed partly by software and partly by hand, the stepwise and all-time subset approaches are automatically performed past software. Two R functions stepAIC() and bestglm() are well designed for stepwise and best subset regression, respectively. The stepAIC() function begins with a full or nothing model, and methods for stepwise regression can be specified in the management argument with character values "forward", "backward" and "both". The bestglm() role begins with a information frame containing explanatory variables and response variables. The response variable should be in the concluding column. Varieties of goodness-of-fit criteria tin can be specified in the IC argument. The Bayesian information benchmark (BIC) usually results in more parsimonious model than the Akaike data criterion.
Keywords: Logistic regression, interaction, R, best subset, stepwise, Bayesian information criterion
Introduction
The previous commodity introduces purposeful choice for regression model, which allows incorporation of clinical experience and/or subject thing knowledge into statistical science. There are several other methods for variable option, namely, the stepwise and best subsets regression. In stepwise regression, the selection procedure is automatically performed past statistical packages. The criteria for variable selection include adjusted R-square, Akaike information criterion (AIC), Bayesian information criterion (BIC), Mallows's Cp, PRESS, or false discovery rate (ane,ii). Principal approaches of stepwise choice are the forward selection, astern emptying and a combination of the two (iii). The procedure has advantages if there are numerous potential explanatory variables, but information technology is also criticized for being a paradigmatic example of data dredging that significant variables may exist obtained from "noise" variables (iv,5). Clinical feel and expertise are not allowed in model edifice process.
While stepwise regression select variables sequentially, the all-time subsets approach aims to find out the best fit model from all possible subset models (ii). If at that place are p covariates, the number of all subsets is 2p. There are also varieties of statistical methods to compare the fit of subset models. In this article, I will introduce how to perform stepwise and all-time subset choice by using R.
Working example
The working example used in the tutorial is from the bundle MASS. You can take a wait at what each variable represents for.
The bwt data frame contains 9 columns and 189 rows. The variable low is an indicator variable with "0" indicates nascency weight >two.v kg and "1" indicates the presence of low birth weight. Age is mother'due south age in years. The variable lwt is mothers' weight in pounds. Race is mother'south race and smoke is smoking condition during pregnancy. The number of previous premature labor is plt. Other information includes history of hypertension (bt), presence of uterine irritability (ui), and the number of medico visits during the outset trimester (ftv).
Stepwise selection
We can brainstorm with the full model. Total model can exist denoted past using symbol "." on the right hand side of formula.
As you can see in the output, all variables except depression are included in the logistic regression model. Variables lwt, race, ptd and ht are establish to be statistically significant at conventional level. With the full model at hand, we tin begin our stepwise selection procedure.
All arguments in the stepAIC() function are set to default. If you want to set direction of stepwise regression (due east.g., backward, forwards, both), the direction statement should be assigned. The default is both.
Because the frontwards stepwise regression begins with total model, there are no additional variables that can exist added. The final model is the full model. Forrad selection can begin with the zero model (incept only model).
The backward elimination procedure eliminated variables ftv and age, which is exactly the aforementioned as the "both" procedure.
Different criteria tin be assigned to the stepAIC() function for stepwise selection. The default is AIC, which is performed past assigning the argument k to two (the default choice).
The stepAIC() function besides allows specification of the range of variables to be included in the model by using the telescopic argument. The lower model is the model with smallest number of variables and the upper model is the largest possible model. Both upper and lower components of scope tin can be explicitly specified. If scope is a unmarried formula, it specifies the upper component, and the lower model is empty. If scope is missing, the initial model is the upper model.
When I specify the smallest model to include age variable, information technology volition not exist excluded past stepwise regression (e.yard., otherwise, the age volition be excluded as shown above). This function can help investigators to keep variables that are considered to exist relevant past discipline-matter knowledge. Next, nosotros can have more complicated model for stepwise selection.
Recall that "^" symbol denotes interactions up to a specified caste. In our instance, we specified two-degree interactions among all possible combinations of variables. Elements within I() are interpreted arithmetically. The function scale(historic period) centers variable age at its mean and scales it by standard deviation. "~ .^2 + I(scale(age)^two)+ I(scale(lwt)^2)" is the scope argument and a single formula implies the upper component. The results show that the interaction between historic period and ftv, smoke and ui are remained in the final model. Other interactions and quadratic terms are removed.
Best subset regression
Best subset regression selects the best model from all possible subsets co-ordinate to some goodness-of-fit criteria. This approach has been in utilise for linear regression for several decades with the branch and jump algorithm (6). Afterwards on, lawless and Singhal proposed an extension that tin be used for non-normal error model (7). The application of all-time subsets for logistic regression model was described by Hosmer and coworkers (8). An R package called "bestglm" contains functions for performing all-time subsets option. The bestglm() employs simple exhaustive searching algorithm as described by Morgan (9).
Xy is a data frame containing contained variables and response variable. For logistic regression model when family is set to be binomial, the last cavalcade is the response variable. The sequence of Xy is of import because a formula to specify response and independent variables are non allowed with bestglm() part. We can move the response variable depression to the concluding cavalcade and assign a new proper name to the new data frame. The IC argument specifies the information criteria to use. Its values can be "AIC", "BIC", "BICg", "BICq", "LOOCV" and "CV" (ten).
Furthermore, factors with more than than ii levels should be converted to dummy variables. Otherwise, it returns an error message.
To create dummy variables for factors with more than 2 levels, we use the dummies package. The dummy() role passes a single variable and returns a matrix with the number of rows equal to that of given variable, and the number of columns equal to the number of levels of that variable. Because just n-i dummy variables are needed to define a factor with n levels, I remove the base level past simple manipulation of vectors. Finally, a new information frame containing dummy variables is created, with the response variable in the last column.
The model option by AIC always keeps more variables in the model as follows.
Readers can try other options available in bestglm() function. Dissimilar options may result in different models.
Summary
The article introduces variable selection with stepwise and all-time subset approaches. Two R functions stepAIC() and bestglm() are well designed for these purposes. The stepAIC() function begins with a full or nothing model, and methods for stepwise regression can be specified in the management argument with character values "forward", "backward" and "both". The bestglm() function begins with a data frame containing explanatory variables and response variables. The response variable should exist in the concluding column. Factor variables with more than than two levels should be converted before running bestglm(). The dummies package contains good role to convert factor variable to dummy variables. There are varieties of information criteria for selection of the best subset model.
Biography
•
Writer's introduction: Zhongheng Zhang, MMed. Section of Critical Intendance Medicine, Jinhua Municipal Primal Hospital, Jinhua Hospital of Zhejiang University. Dr. Zhongheng Zhang is a fellow doc of the Jinhua Municipal Key Hospital. He graduated from School of Medicine, Zhejiang University in 2009, receiving Main Caste. He has published more than 35 academic papers (scientific discipline citation indexed) that take been cited for over 200 times. He has been appointed as reviewer for ten journals, including Journal of Cardiovascular Medicine, Hemodialysis International, Journal of Translational Medicine, Disquisitional Care, International Journal of Clinical Practice, Periodical of Critical Care. His major research interests include hemodynamic monitoring in sepsis and septic shock, delirium, and outcome study for critically ill patients. He is experienced in information management and statistical analysis by using R and STATA, big information exploration, systematic review and meta-assay.
Footnotes
Conflicts of Interest: The author has no conflicts of interest to declare.
References
1. Hocking RR. A biometrics invited paper. The analysis and selection of variables in linear regression. Biometrics 1976;32:ane-49. 10.2307/2529336 [CrossRef] [Google Scholar]
2. Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied Logistic Regression. Hoboken: John Wiley & Sons, Inc, 2013. [Google Scholar]
3. Harrell Fe. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York: Springer-Verlag New York, 2013. [Google Scholar]
4. Freedman DA. A note on screening regression equations. The American Statistician 1983;37:152-5. [Google Scholar]
v. Flack VF, Chang PC. Frequency of selecting noise variables in subset regression analysis: a simulation written report. The American Statistician 1987;41:84-half dozen. [Google Scholar]
vi. Furnival GM, Wilson RW. Regressions by leaps and premises. Technometrics 1974;16:499-511. x.1080/00401706.1974.10489231 [CrossRef] [Google Scholar]
vii. Lawless JF, Singhal K. ISMOD: an all-subsets regression program for generalized linear models. II. Program guide and examples. Comput Methods Programs Biomed 1987;24:125-34. 10.1016/0169-2607(87)90023-X [PubMed] [CrossRef] [Google Scholar]
eight. Hosmer DW, Jovanovic B, Lemeshow South. Best subsets logistic regression. Biometrics 1989;45:1265-70. 10.2307/2531779 [CrossRef] [Google Scholar]
9. Morgan JA, Tatar JF. Calculation of the residual sum of squares for all possible regressions. Technometrics 1972;14:317-25. 10.1080/00401706.1972.10488918 [CrossRef] [Google Scholar]
Articles from Annals of Translational Medicine are provided hither courtesy of AME Publications
mcwhirterwassiriour.blogspot.com
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4842399/
0 Response to "Try Adding Interaction Terms and Perform the Automated Stepwise Selection Again."
Post a Comment