Stats >> Training >> Stats #03: Practice Exercises
These exercises refer to two data sets:
- BF.SAV, a study of breast feeding in pre-term infants, which you have read extensively about earlier.
- HOUSING.SAV, a study of housing prices in Albuquerque, NM. You can find a good overview and a text version of this data set on the web at http://lib.stat.cmu.edu/DASL/Stories/homeprice.html.
You will find these data sets in the SPSS Program folder, located in the Classroom Examples folder.
1. In the breast feeding data set (BF.SAV), examine the relationship between the total number of apnea and bradycardia incidents (TOTAL_AB) and the age of the infant at discharge from the hospital (DC_AGE). Although a linear regression model is not ideal for this type of data, you will find some interesting and useful ideas from this analysis.
- Draw a scatterplot with DC_AGE on the X axis and TOTAL_AB on the Y axis.
- Compute a linear regression model using DC_AGE as the independent variable and TOTAL_AB as the dependent variable. Interpret the slope and intercept terms for this model.
2. Using the same data set, examine the relationship between TOTAL_AB and the treatment group variable (FEED_TYP).
- Draw a box plot for this data.
- Compute a linear regression model using TOTAL_AB as the dependent variable and FEED_TYP as the independent variable. Interpret the slope and intercept terms for this model.
3. In the housing data set (HOUSING.SAV), examine the relationship between the square footage of a house (SQFT) and the sales price of the house (PRICE).
- Draw a scatterplot with SQFT on the X axis and PRICE on the X axis. Interpret this plot.
- Compute a linear regression model using SQFT as the independent variable and PRICE as the dependent variable. Interpret the slope and intercept terms for this model. Is the regression model consistent with your graph?
4. In the same data set, examine whether a custom built house (CUST: 1=Yes, 0=No) influences the price of a home.
- Draw a boxplot with PRICE as the variable and CUST as the category axis.
- Compute the means for custom built and non-custom built houses (select ANALYZE | COMPARE MEANS | MEANS from the menu. Place PRICE in the dependent list and CUST in the independent list).
- Compute a regression model with PRICE as the dependent variable and CUST as the covariate. Interpret this model.
- Compute a regression model with PRICE as the dependent variable and CUST as a fixed factor (not a covariate). How does this model differ from the previous one?
5. You are concerned that custom built houses are more expensive, not because they are custom built, but only because they are bigger.
- Draw a boxplot that shows whether custom built houses are bigger than other houses.
- Compute the mean square footage between custom built and normal houses.
- Fit a regression model with PRICE as the the dependent variable, CUST as the fixed factor, and SQFT as the covariate. Contrast this model to the model that uses SQFT alone to predict PRICE and to the model that uses CUST alone to predict PRICE.
6. Infants with low birth weights and early gestational ages tend to have more problems with apnea and bradycardia. Since birth weight and gestational age are so closely related, you are not sure how to separately account for the predictive ability of each variable.
- Draw two scatterplots. For both scatterplots, place the outcome variable TOTAL_AB on the Y axis. On the first scatterplot, use birth weight (BW) on the X axis. On the second scatterplot, use gestational age (GEST_AGE) on the X axis.
- Fit a regression model using TOTAL_AB as the dependent variable and BW as the independent variable. Interpret the slope and intercept from this model.
- Fit a regression model using TOTAL_AB as the dependent variable and GEST_AGE as the independent variable. Interpret the slope and intercept from this model.
- Fit a regression model using both BW and GEST_AGE as independent variables. Interpret the intercept and the two slope terms from this model.
7. Examine the assumptions of the regression model for the housing data, where you used SQFT and CUST to predict PRICE.
- Compute residuals and predicted values from this model.
- Draw a scatterplot with the residuals on the Y axis and the predicted values on the X axis.
- Draw a normal probability plot for the residuals.
8. There are additional residual plots that you can use to check if additional variables should be included in your regression model.
- Draw a scatterplot with the residuals on the Y axis and the number of features of the house (FEAT) on the X axis. A pattern such as a trend would indicate that FEAT provides additional predictive power above and beyond SQFT and CUST.
- Draw boxplots for the residuals with the corner lot variable (COR: 1=Yes, 2=No) as the category axis. If one boxplot is a lot higher/lower than the other, this indicates that COR provides additional predictive power above and beyond SQFT and CUST.
9. A possible violation of the assumptions of the linear regression model is when the variation in the dependent variable is related to one of the fixed factors or to one of the covariates. You draw scatterplots and/or boxplots of the residuals versus the factors and covariates. If the variation in one part of the graph is much different than in another part of the graph, you should investigate further. Generally, you need to look for a very large discrepancy: variation that is 2 or 3 times larger/smaller. Discrepancies of this size warrant further investigation and possible use of more complex regression models.
- Draw a boxplot of the residuals with CUST as the category axis. Compare the size of the two boxplots. Is the a twofold or threefold difference in variations?
- Draw a scatterplot of the residuals on the Y axis and SQFT on the X axis. Compare the variation for small square footage houses to the variation for large square footage houses. Do the larger houses tend to have more variation?
10. For the breast feeding data, fit a regression model using TOTAL_AB as the dependent variable and DC_AGE as the independent variable. As noted earlier, linear regression is not an ideal procedure here.
- Compute the residuals and predicted values from this model.
- Draw two scatterplots, relating the residuals to first BW and then GEST_AGE. For both plots, place the residuals on the Y axis. Compare these to the scatterplots for #6. Why do we see a relationship when we use TOTAL_AB, but this relationship disappears when we use the residuals?
- Draw a scatterplot relating the residuals to DC_AGE. Does a linear function of DC_AGe appear sufficient, or is there an additional non-linear relationship that appears in the residual plot?