Today we will briefly review univariate regression and then will discuss how to summarize and visualize uncertainty in regression models using a variety of plotting methods. We will then touch on how to estimate regression coefficients using matrix algebra. Lastly, we will introduce the General Linear Model and demonstrate how GLM can be used to understand all of the statistical tests we have learned so far (t-tests, ANOVA, correlations, regressions) within one (beautiful!) unifying framework.
We’re going to work with a dataset about the relationship between personality and health. Let’s pretende we’ve collected data from 60 people (30 men and 30 women) on their self-reported conscientiousness (using the Big Five Inventory) and their self-rated physical health.
To quickly navigate to the desired section, click one of the following links:
lm()
function which is part of the pre-loaded {stats} library. There are basically two ingredients we pass to the lm()
functionThe formula: Specify your regression formula in the general form y ~ x
.
The data: the dataframe that contains the variables in the formula. This is technically optional.
\[Y_i = b_0 + b_1X_i + e_i\]
\[sr\_health_i = b_0 + b_1consc_i + e_i\]
coefficient | b | SE | t | p |
---|---|---|---|---|
(Intercept) | 1.66 | 0.36 | 4.64 | < .001 |
consc | 0.49 | 0.12 | 4.13 | < .001 |
What do the intercept and slope mean? What do the t-values tell us?
The intercept is the expected value for self-rated health when conscientiousness is 0. The slope is the magnitude of the relationship between conscientiousness and self-rated health: for every 1-unit increase in conscientiousness, we expect a 0.49-unit increase in self-rated health. t-values are from a one-sample t-test assessing whether the slope and intercept are significantly different from 0. The t-values represent the ratio of signal to noise (i.e. each b divided by its standard error).
Our b's
(intercept and slope) are estimates from our sample of true population parameters (\(\beta\)’s). Remember that whenever we calculate an estimate of something, we should also determine how precise our estimate is. This is where standard errors and confidence intervals come in.
Recall the formula for calculating confidence intervals:
\[CI_b = b \pm critical\_value * SE_b\]
stats::confint()
. This function takes in a fitted model object as the first argument. By default it will give you 95% CI’s.confint(model)
## 2.5 % 97.5 %
## (Intercept) 0.9422515 2.3716950
## consc 0.2526050 0.7282069
What does these 95% CI for the slope of conscientiousness mean in plain English?
If we were to repeat this experiment over and over again, sampling from the same population, 95% of the time the slope we calculate would be between 0.25 and 0.73 (i.e. in 19 out of every 20 experiments we’d get a slope in this interval). More generally it means that if we carried out random sampling from the population a large number of times, and calculated the 95% CI each time around our coefficients (intercept and slope), then 95% of those intervals can be expected to contain the population parameters. In other words, we have good reason to believe the true population parameters (\(\beta_0\) and \(\beta_1\)) fall in this interval because 95% of the time such intervals contain the true population parameters.
broom::augment()
(and are labeled .se.fit
).model %>% # start with our model object
augment() %>% # from broom package; gives us fitted values, residuals, etc.
select(sr_health, .fitted, .se.fit) # select relevant variables
geom_smooth(method = "lm")
.
A regression line, by definition, corresponds to the line that gives the mean value of Y
corresponding to each possible value of X
, i.e. E(Y|X)
.
In addition, we can also predict an individual’s score (\(Y_i\)) for any value of X
. From our regression model, we have the following equation that mathematically represents the relationship between conscientiousness and self-rated health:
\[\hat{health}_{i} = 1.6569733 + 0.4904059 * consc_{i}\]
3.5
, we can easily predict their score for self-rated health according to our model:\[\hat{health} = 1.6569733 + 0.4904059*3.5 = 3.374\]
predict()
function gives us an easy way to get the predicted Y
values from all the X
values in our dataset.predict(model)
## 1 2 3 4 5 6 7 8
## 2.961300 4.119826 3.543509 2.507581 3.350326 3.621532 3.038562 3.570843
## 9 10 11 12 13 14 15 16
## 3.178485 3.147345 2.758911 3.403290 2.987058 3.015841 3.216697 3.135321
## 17 18 19 20 21 22 23 24
## 3.056917 3.006047 3.015321 3.297619 2.458993 2.840238 3.082434 2.292952
## 25 26 27 28 29 30 31 32
## 3.103269 3.133964 2.989617 3.354057 2.882657 3.952820 2.891139 3.307783
## 33 34 35 36 37 38 39 40
## 2.745710 3.555768 3.643154 2.474954 2.778601 2.520440 2.935472 2.524356
## 41 42 43 44 45 46 47 48
## 2.467469 3.442900 4.079152 2.368996 3.882772 2.256636 3.239823 2.609330
## 49 50 51 52 53 54 55 56
## 2.309207 3.229900 2.795072 2.476516 3.459384 3.363102 3.536529 1.938202
## 57 58 59 60
## 2.698847 3.737070 3.129544 2.779391
broom::augment()
.augment(model)$.fitted
## [1] 2.961300 4.119826 3.543509 2.507581 3.350326 3.621532 3.038562 3.570843
## [9] 3.178485 3.147345 2.758911 3.403290 2.987058 3.015841 3.216697 3.135321
## [17] 3.056917 3.006047 3.015321 3.297619 2.458993 2.840238 3.082434 2.292952
## [25] 3.103269 3.133964 2.989617 3.354057 2.882657 3.952820 2.891139 3.307783
## [33] 2.745710 3.555768 3.643154 2.474954 2.778601 2.520440 2.935472 2.524356
## [41] 2.467469 3.442900 4.079152 2.368996 3.882772 2.256636 3.239823 2.609330
## [49] 2.309207 3.229900 2.795072 2.476516 3.459384 3.363102 3.536529 1.938202
## [57] 2.698847 3.737070 3.129544 2.779391
predicted <- predict(model, interval = "prediction")
predicted
## Warning in predict.lm(model, interval = "prediction"): predictions on current data refer to _future_ responses
## fit lwr upr
## 1 2.961300 1.17371888 4.748882
## 2 4.119826 2.25947787 5.980173
## 3 3.543509 1.74074780 5.346271
## 4 2.507581 0.70106755 4.314095
## 5 3.350326 1.55750837 5.143143
## 6 3.621532 1.81339348 5.429671
## 7 3.038562 1.25152397 4.825601
## 8 3.570843 1.76628606 5.375400
## 9 3.178485 1.39043087 4.966539
## 10 3.147345 1.35973942 4.934950
## 11 2.758911 0.96619235 4.551629
## 12 3.403290 1.60822745 5.198353
## 13 2.987058 1.19974524 4.774372
## 14 3.015841 1.22872479 4.802958
## 15 3.216697 1.42791809 5.005476
## 16 3.135321 1.34785491 4.922787
## 17 3.056917 1.26989230 4.843942
## 18 3.006047 1.21887622 4.793219
## 19 3.015321 1.22820165 4.802440
## 20 3.297619 1.50667342 5.088564
## 21 2.458993 0.64887783 4.269108
## 22 2.840238 1.05022878 4.630247
## 23 3.082434 1.29535377 4.869513
## 24 2.292952 0.46828552 4.117619
## 25 3.103269 1.31608138 4.890458
## 26 3.133964 1.34651249 4.921416
## 27 2.989617 1.20232583 4.776908
## 28 3.354057 1.56109375 5.147021
## 29 2.882657 1.09371696 4.671597
## 30 3.952820 2.11333811 5.792301
## 31 2.891139 1.10238491 4.679893
## 32 3.307783 1.51650432 5.099061
## 33 2.745710 0.95247006 4.538949
## 34 3.555768 1.75221339 5.359323
## 35 3.643154 1.83338823 5.452920
## 36 2.474954 0.66605493 4.283853
## 37 2.778601 0.98661799 4.570584
## 38 2.520440 0.71482928 4.326051
## 39 2.935472 1.14753408 4.723410
## 40 2.524356 0.71901566 4.329695
## 41 2.467469 0.65800380 4.276934
## 42 3.442900 1.64591980 5.239880
## 43 4.079152 2.22418998 5.934114
## 44 2.368996 0.55142160 4.186570
## 45 3.882772 2.05104678 5.714497
## 46 2.256636 0.42832923 4.084942
## 47 3.239823 1.45051297 5.029134
## 48 2.609330 0.80938210 4.409277
## 49 2.309207 0.48611661 4.132297
## 50 3.229900 1.44082634 5.018974
## 51 2.795072 1.00366438 4.586479
## 52 2.476516 0.66773471 4.285298
## 53 3.459384 1.66154564 5.257222
## 54 3.363102 1.56977624 5.156427
## 55 3.536529 1.73421026 5.338847
## 56 1.938202 0.07115747 3.805246
## 57 2.698847 0.90357393 4.494119
## 58 3.737070 1.91955030 5.554589
## 59 3.129544 1.34213807 4.916950
## 60 2.779391 0.98743583 4.571345
pred_plot_data <- cbind(health, predicted)
pred_plot_data
And finally, we’ll plot a prediction band on top of the data by adding a geom_ribbon()
. This prediction band gives us, for every value of X
, the interval that represents the range of values that is likely to contain the value of a single new individual observation. Unlike confidence intervals, prediction intervals predict the spread for individual observations rather than the mean.
Fill in the blanks in the code to create a plot of the prediction band for our model. Pay special attention to lwr
and upr
within pred_plot_data
above. How will you incorporate thise variables into the plot?
Why is the prediction band wider than the confidence band around our regression line?
Because there is greater variation in predicting an individual value rather than a collection of individual values (i.e., the mean) the prediction band is wider.
X
value (i.e. conscientiousness score) into our regression equation and predict the corresponding Y
value (i.e. self-rated health score). And when I say any X
value, I mean that it doesn’t have to be an actual observation from our sample. We can also predict out-of-sample!X
values (conscientiousness scores). Let’s pretend that we collected this data from a completely new sample of people.# read in new data
consc_data_new <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-3/data/consc_new.csv")
# let's view it
consc_data_new
Now we can predict Y
values for these out-of-sample X
values based on the coefficients from our regression model. We just have to tell the predict()
function what our newdata
is.
Fill in the blanks in the code below to make predictions about self-rated health values based on brand new, never-before seen conscientiousness scores.
sjPlot::plot_model()
plot_model(model = model, # name of model object
type = "pred", # show predicted values (i.e. regression line)
show.data = TRUE, # include data points on plot
jitter = TRUE) # add small amount of random variation to to prevent overlap
## $consc
ggpubr::ggscatter()
ggscatter(data = health, # name of data.frame
x = "consc", # IV (must be quoted)
y = "sr_health", # DV (must be quoted)
add = "reg.line", # add regression line
xlab = "Conscientiousness", # x-axis label
ylab = "Self-rated health", # y-axis label
conf.int = TRUE, # show 95% confidence band around regression line
cor.coef = TRUE) # display correlation coefficient and p-value
\[\mathbf{y} = b_0 + b_1\mathbf{x} + e\]
Recall that y
is a vector of values, which can be represented as an \(n\times1\) matrix, \(\mathbf{Y}\). Similarly, x
can be represented as an \(n\times1\) matrix, \(\mathbf{X}\).
If we augment the matrix \(\mathbf{X}\) to be an \(n\times2\) matrix, in which the first column is filled with 1
’s, we can simplify our regression equation (one property of the residuals is that the the average residual is 0, so we can remove e
from the equation as well):
\[\mathbf{Y} = \mathbf{XB}\]
Let’s illustrate how this works by running our regression analysis (self-rated health ~ conscientiousness) using matrix algebra.
Create the X
matrix
x_mat <- health %>% # start with the original data frame
mutate(ones = rep(1, nrow(.))) %>% # create a column of 1's to represent the intercept
select(ones, consc) %>% # select only the column of 1's and X variable
as.matrix() %>% # coerce to a matrix
unname() # get rid of dimnames
print(x_mat)
## [,1] [,2]
## [1,] 1 2.6596884
## [2,] 1 5.0220685
## [3,] 1 3.8468867
## [4,] 1 1.7344974
## [5,] 1 3.4529606
## [6,] 1 4.0059852
## [7,] 1 2.8172358
## [8,] 1 3.9026234
## [9,] 1 3.1025556
## [10,] 1 3.0390568
## [11,] 1 2.2469903
## [12,] 1 3.5609621
## [13,] 1 2.7122125
## [14,] 1 2.7709048
## [15,] 1 3.1804746
## [16,] 1 3.0145390
## [17,] 1 2.8546636
## [18,] 1 2.7509337
## [19,] 1 2.7698433
## [20,] 1 3.3454849
## [21,] 1 1.6354199
## [22,] 1 2.4128273
## [23,] 1 2.9066946
## [24,] 1 1.2968417
## [25,] 1 2.9491817
## [26,] 1 3.0117720
## [27,] 1 2.7174300
## [28,] 1 3.4605698
## [29,] 1 2.4993243
## [30,] 1 4.6815225
## [31,] 1 2.5166209
## [32,] 1 3.3662098
## [33,] 1 2.2200715
## [34,] 1 3.8718843
## [35,] 1 4.0500753
## [36,] 1 1.6679664
## [37,] 1 2.2871418
## [38,] 1 1.7607182
## [39,] 1 2.6070213
## [40,] 1 1.7687026
## [41,] 1 1.6527038
## [42,] 1 3.6417315
## [43,] 1 4.9391295
## [44,] 1 1.4519042
## [45,] 1 4.5386858
## [46,] 1 1.2227879
## [47,] 1 3.2276325
## [48,] 1 1.9419757
## [49,] 1 1.3299870
## [50,] 1 3.2073976
## [51,] 1 2.3207269
## [52,] 1 1.6711524
## [53,] 1 3.6753437
## [54,] 1 3.4790127
## [55,] 1 3.8326522
## [56,] 1 0.5734602
## [57,] 1 2.1245120
## [58,] 1 4.2415805
## [59,] 1 3.0027591
## [60,] 1 2.2887515
Y
matrixy_mat <- health %>% # start with the original data frame
select(sr_health) %>% # select just the Y variable
as.matrix() %>% # coerce to a matrix
unname() # get rid of dimnames
print(y_mat)
## [,1]
## [1,] 1.909508
## [2,] 4.493539
## [3,] 3.403894
## [4,] 2.045359
## [5,] 4.210038
## [6,] 5.111872
## [7,] 3.089556
## [8,] 4.575580
## [9,] 3.902935
## [10,] 2.637214
## [11,] 3.600213
## [12,] 2.941156
## [13,] 2.630240
## [14,] 3.696171
## [15,] 5.232125
## [16,] 3.083821
## [17,] 2.415286
## [18,] 4.424573
## [19,] 3.652712
## [20,] 2.914556
## [21,] 2.274012
## [22,] 2.275091
## [23,] 2.502707
## [24,] 1.651588
## [25,] 1.633890
## [26,] 2.171920
## [27,] 3.128065
## [28,] 3.782407
## [29,] 2.963693
## [30,] 3.728316
## [31,] 1.550077
## [32,] 4.927635
## [33,] 3.764443
## [34,] 4.631044
## [35,] 3.363912
## [36,] 2.860350
## [37,] 1.787703
## [38,] 1.575609
## [39,] 2.998598
## [40,] 4.292776
## [41,] 3.639892
## [42,] 2.314446
## [43,] 3.126662
## [44,] 1.658845
## [45,] 2.337766
## [46,] 1.944531
## [47,] 3.615128
## [48,] 1.902425
## [49,] 2.946371
## [50,] 2.833848
## [51,] 3.990127
## [52,] 1.139102
## [53,] 3.384039
## [54,] 3.460063
## [55,] 3.267850
## [56,] 2.623867
## [57,] 1.982783
## [58,] 3.415394
## [59,] 4.124655
## [60,] 1.658569
B
matrix\[\mathbf{B} = (\mathbf{X'X})^{-1} \mathbf{X'Y}\]
We just saw that regression works “under the hood” by solving a matrix algebra equation to get the intercept and slope of the regression line. This matrix algebra equation works for any type of data we have. This should clue us into the idea that there is some fundamental process going on behind the scenes of all of our models…
The general linear model (GLM) is a family of models that assume the relationship between your DV and IV(s) is linear and additive, and that your outcome is normally distributed. In its simplest form, we can think of the general linear model as:
\[Data = Model + Error \]
This provides a unifying framework for all of the statistical tests we have learned (or at least touched on) so far: t-tests, correlations, ANOVA, and linear regression. All of these tests are really doing the same math at the end of the day.
To illustrate this, we’ll turn back to an example from PSY611. The data are from Fox and Guyer’s (1978) anonymity and cooperation study. The data are included in the {carData}
package, and you can see information about the data set using ?Guyer
. Twenty groups of four participants each played 30 trials of the the prisoner’s dilemma game. The number of cooperative choices (cooperation
) made by each group were scored out of 120 (i.e., cooperative choices made by 4 participants over 30 trials). The groups either made decisions publicly or privately (condition
).
You can load in the data with the following code:
guyer <- carData::Guyer
head(guyer)
str(guyer)
## 'data.frame': 20 obs. of 3 variables:
## $ cooperation: num 49 64 37 52 68 54 61 79 64 29 ...
## $ condition : Factor w/ 2 levels "anonymous","public": 2 2 2 2 2 2 2 2 2 2 ...
## $ sex : Factor w/ 2 levels "female","male": 2 2 2 2 2 1 1 1 1 1 ...
condition
variable to 0's
and 1's
. We will talk more about dummy coding in a future lab.guyer <- guyer %>%
mutate(condition = case_when(condition == "public" ~ 0,
condition == "anonymous" ~ 1))
guyer
public
and anonymous
conditions. Assume the homogeneity of variance assumption has been met.What is the 95% CI around?
The difference in mean cooperation scores for the public vs. anonymous conditions.
cooperation
and condition
.What is the 95% CI around this time?
The correlation coefficient between cooperation and condition
public
and anonymous
conditions.What is the relationship between the F statistic from this ANOVA and the t statistic from our previous two tests?
t is the square root of F.
cooperation
varies as a function of condition
.What does the intercept represent? The slope?
The intercept is the mean cooperation score for the condition that was coded as 0 (i.e. public). Te slope is the difference in means between the two conditions
Continuing with the guyer
dataset example, use matrix algebra to manually calculate the intercept and slope for the regression equation (cooperation
~ condition
).
Hint: To find the inverse of a matrix, use the solve()
function. Refer back to the lab on matrix algebra from 611 for a refresher on other matrix operations if needed.
lm()
function in R.confint()
to calculate the 95% CI for our regression coefficients. Your job is start with the model output, stored as a list object, and extract the relevant pieces of information to calculate the 95% CI around the intercept and slope.qt()
to draw from a t distribution)model <- lm(sr_health ~ consc, data = health)
model_summary <- summary(model)
model_summary
##
## Call:
## lm(formula = sr_health ~ consc, data = health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5450 -0.6414 -0.1623 0.6817 2.0154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6570 0.3571 4.641 0.0000203 ***
## consc 0.4904 0.1188 4.128 0.000119 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8854 on 58 degrees of freedom
## Multiple R-squared: 0.2271, Adjusted R-squared: 0.2138
## F-statistic: 17.04 on 1 and 58 DF, p-value: 0.0001186
confint()
.