Today we’ll be going over the basics of univariate regression in R, including reporting regression results in Tables.
To review.
Today, we’ll just be focusing on univariate regression with a single continuous predictor, but over the weeks we will build up into much more complicated regression equations.
To quickly navigate to the desired section, click one of the following links:
library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(psych) # for covariance and correlation functions
library(apaTables) # for correlation tables
library(pwr) # for power calculation
library(broom) # for cleaning up output
library(stargazer)
library(sjPlot)
Even in simple univariate regression, we simultaneously estimate several values at once. Remember, one of our primary goals is estimate \(Y\) values with our regression model:
\[Y_i = b_0 + b_1X_i + e_i\]
Alternatively:
\[\hat{Y_i} = b_0 + b_1X_i\]
And sometimes we use greek letters for the model parameters when we’re referring to the (generally hypothetical) population parameters:
\[Y_i = \beta_0 + \beta_1X_i + \epsilon_i\] As Sara mentioned, it is common to use b for unstandardized slopes and \(\beta\) for standardized slopes within psychology.
Without further ado, let’s go ahead and estimate a regression equation in R.
Conducting regressions in R is actually pretty simple. We use the lm()
function which is part of the pre-loaded {stats} library. There are basically two ingredients we pass to the lm()
function
The formula: Specify your regression formula in the general form y ~ x
.
The data: the dataframe that contains the variables in the formula. This is technically optional.
Let’s take a look. First we’ll import the same data that we used last week:
health <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-1/data/consc_health.csv")
Next we run the model, using lm()
, setting the formula and data arguments. We want to see to what extent self-reported conscientiousness relates to self-reported health.
model <- lm(formula = sr_health ~ consc, # remember, outcome ~ predictors
data = health) # enter the dataframe
More typically, people omit the formula =
:
model <- lm(sr_health ~ consc, # remember, outcome ~ predictors
data = health)
Now our regression model objct, called model
above, is a list
object with various useful information. Let’s take a look with the str()
function:
str(model)
## List of 12
## $ coefficients : Named num [1:2] 1.66 0.49
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "consc"
## $ residuals : Named num [1:60] -1.052 0.374 -0.14 -0.462 0.86 ...
## ..- attr(*, "names")= chr [1:60] "1" "2" "3" "4" ...
## $ effects : Named num [1:60] -23.6511 -3.655 -0.0716 -0.2861 0.9479 ...
## ..- attr(*, "names")= chr [1:60] "(Intercept)" "consc" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:60] 2.96 4.12 3.54 2.51 3.35 ...
## ..- attr(*, "names")= chr [1:60] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:60, 1:2] -7.746 0.129 0.129 0.129 0.129 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:60] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "consc"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.13 1.29
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 58
## $ xlevels : Named list()
## $ call : language lm(formula = sr_health ~ consc, data = health)
## $ terms :Classes 'terms', 'formula' language sr_health ~ consc
## .. ..- attr(*, "variables")= language list(sr_health, consc)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "sr_health" "consc"
## .. .. .. ..$ : chr "consc"
## .. ..- attr(*, "term.labels")= chr "consc"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(sr_health, consc)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "sr_health" "consc"
## $ model :'data.frame': 60 obs. of 2 variables:
## ..$ sr_health: num [1:60] 1.91 4.49 3.4 2.05 4.21 ...
## ..$ consc : num [1:60] 2.66 5.02 3.85 1.73 3.45 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language sr_health ~ consc
## .. .. ..- attr(*, "variables")= language list(sr_health, consc)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "sr_health" "consc"
## .. .. .. .. ..$ : chr "consc"
## .. .. ..- attr(*, "term.labels")= chr "consc"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(sr_health, consc)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "sr_health" "consc"
## - attr(*, "class")= chr "lm"
And we can extract elements from this list like any other list in R, by using LIST$ELEMENT
or LIST[["ELEMENT"]]
. There are also specific functions for extracting those elements. Let’s start by getting the predicted values of Y.
We can get predicted values from model
by subsetting our list with $
:
model$fitted.values
## 1 2 3 4 5 6 7 8
## 2.961300 4.119826 3.543509 2.507581 3.350326 3.621532 3.038562 3.570843
## 9 10 11 12 13 14 15 16
## 3.178485 3.147345 2.758911 3.403290 2.987058 3.015841 3.216697 3.135321
## 17 18 19 20 21 22 23 24
## 3.056917 3.006047 3.015321 3.297619 2.458993 2.840238 3.082434 2.292952
## 25 26 27 28 29 30 31 32
## 3.103269 3.133964 2.989617 3.354057 2.882657 3.952820 2.891139 3.307783
## 33 34 35 36 37 38 39 40
## 2.745710 3.555768 3.643154 2.474954 2.778601 2.520440 2.935472 2.524356
## 41 42 43 44 45 46 47 48
## 2.467469 3.442900 4.079152 2.368996 3.882772 2.256636 3.239823 2.609330
## 49 50 51 52 53 54 55 56
## 2.309207 3.229900 2.795072 2.476516 3.459384 3.363102 3.536529 1.938202
## 57 58 59 60
## 2.698847 3.737070 3.129544 2.779391
We can also use double-brackets:
model[["fitted.values"]]
## 1 2 3 4 5 6 7 8
## 2.961300 4.119826 3.543509 2.507581 3.350326 3.621532 3.038562 3.570843
## 9 10 11 12 13 14 15 16
## 3.178485 3.147345 2.758911 3.403290 2.987058 3.015841 3.216697 3.135321
## 17 18 19 20 21 22 23 24
## 3.056917 3.006047 3.015321 3.297619 2.458993 2.840238 3.082434 2.292952
## 25 26 27 28 29 30 31 32
## 3.103269 3.133964 2.989617 3.354057 2.882657 3.952820 2.891139 3.307783
## 33 34 35 36 37 38 39 40
## 2.745710 3.555768 3.643154 2.474954 2.778601 2.520440 2.935472 2.524356
## 41 42 43 44 45 46 47 48
## 2.467469 3.442900 4.079152 2.368996 3.882772 2.256636 3.239823 2.609330
## 49 50 51 52 53 54 55 56
## 2.309207 3.229900 2.795072 2.476516 3.459384 3.363102 3.536529 1.938202
## 57 58 59 60
## 2.698847 3.737070 3.129544 2.779391
fitted.values()
functionOr the fitted.values
function:
fitted.values(model)
## 1 2 3 4 5 6 7 8
## 2.961300 4.119826 3.543509 2.507581 3.350326 3.621532 3.038562 3.570843
## 9 10 11 12 13 14 15 16
## 3.178485 3.147345 2.758911 3.403290 2.987058 3.015841 3.216697 3.135321
## 17 18 19 20 21 22 23 24
## 3.056917 3.006047 3.015321 3.297619 2.458993 2.840238 3.082434 2.292952
## 25 26 27 28 29 30 31 32
## 3.103269 3.133964 2.989617 3.354057 2.882657 3.952820 2.891139 3.307783
## 33 34 35 36 37 38 39 40
## 2.745710 3.555768 3.643154 2.474954 2.778601 2.520440 2.935472 2.524356
## 41 42 43 44 45 46 47 48
## 2.467469 3.442900 4.079152 2.368996 3.882772 2.256636 3.239823 2.609330
## 49 50 51 52 53 54 55 56
## 2.309207 3.229900 2.795072 2.476516 3.459384 3.363102 3.536529 1.938202
## 57 58 59 60
## 2.698847 3.737070 3.129544 2.779391
We can also extract our residuals or the deviation of each predicted score \(\hat{Y_i}\) from the observed scores \(Y_i\). That too can be done using list subetting:
model$residuals
## 1 2 3 4 5 6
## -1.05179179 0.37371307 -0.13961566 -0.46222179 0.85971248 1.49034005
## 7 8 9 10 11 12
## 0.05099371 1.00473746 0.72444971 -0.51013035 0.84130202 -0.46213405
## 13 14 15 16 17 18
## -0.35681873 0.68032922 2.01542826 -0.05149984 -0.64163120 1.41852518
## 19 20 21 22 23 24
## 0.63739066 -0.38306263 -0.18498050 -0.56514672 -0.57972677 -0.64136398
## 25 26 27 28 29 30
## -1.46937901 -0.96204444 0.13844794 0.42834928 0.08103582 -0.22450409
## 31 32 33 34 35 36
## -1.34106226 1.61985264 1.01873328 1.07527601 -0.27924189 0.38539640
## 37 38 39 40 41 42
## -0.99089855 -0.94483134 0.06312578 1.76842089 1.17242288 -1.12845379
## 43 44 45 46 47 48
## -0.95248954 -0.71015044 -1.54500530 -0.31210445 0.37530444 -0.70690430
## 49 50 51 52 53 54
## 0.63716375 -0.39605196 1.19505547 -1.33741467 -0.07534450 0.09696171
## 55 56 57 58 59 60
## -0.26867918 0.68566564 -0.71606326 -0.32167592 0.99511089 -1.12082173
residuals()
functionOr we could use the residuals()
function:
residuals(model)
## 1 2 3 4 5 6
## -1.05179179 0.37371307 -0.13961566 -0.46222179 0.85971248 1.49034005
## 7 8 9 10 11 12
## 0.05099371 1.00473746 0.72444971 -0.51013035 0.84130202 -0.46213405
## 13 14 15 16 17 18
## -0.35681873 0.68032922 2.01542826 -0.05149984 -0.64163120 1.41852518
## 19 20 21 22 23 24
## 0.63739066 -0.38306263 -0.18498050 -0.56514672 -0.57972677 -0.64136398
## 25 26 27 28 29 30
## -1.46937901 -0.96204444 0.13844794 0.42834928 0.08103582 -0.22450409
## 31 32 33 34 35 36
## -1.34106226 1.61985264 1.01873328 1.07527601 -0.27924189 0.38539640
## 37 38 39 40 41 42
## -0.99089855 -0.94483134 0.06312578 1.76842089 1.17242288 -1.12845379
## 43 44 45 46 47 48
## -0.95248954 -0.71015044 -1.54500530 -0.31210445 0.37530444 -0.70690430
## 49 50 51 52 53 54
## 0.63716375 -0.39605196 1.19505547 -1.33741467 -0.07534450 0.09696171
## 55 56 57 58 59 60
## -0.26867918 0.68566564 -0.71606326 -0.32167592 0.99511089 -1.12082173
And, we could reproduce those values ourselves by subtracting the predicted values we just obtained from the observed Y values from our data
health$sr_health - fitted.values(model)
## 1 2 3 4 5 6
## -1.05179179 0.37371307 -0.13961566 -0.46222179 0.85971248 1.49034005
## 7 8 9 10 11 12
## 0.05099371 1.00473746 0.72444971 -0.51013035 0.84130202 -0.46213405
## 13 14 15 16 17 18
## -0.35681873 0.68032922 2.01542826 -0.05149984 -0.64163120 1.41852518
## 19 20 21 22 23 24
## 0.63739066 -0.38306263 -0.18498050 -0.56514672 -0.57972677 -0.64136398
## 25 26 27 28 29 30
## -1.46937901 -0.96204444 0.13844794 0.42834928 0.08103582 -0.22450409
## 31 32 33 34 35 36
## -1.34106226 1.61985264 1.01873328 1.07527601 -0.27924189 0.38539640
## 37 38 39 40 41 42
## -0.99089855 -0.94483134 0.06312578 1.76842089 1.17242288 -1.12845379
## 43 44 45 46 47 48
## -0.95248954 -0.71015044 -1.54500530 -0.31210445 0.37530444 -0.70690430
## 49 50 51 52 53 54
## 0.63716375 -0.39605196 1.19505547 -1.33741467 -0.07534450 0.09696171
## 55 56 57 58 59 60
## -0.26867918 0.68566564 -0.71606326 -0.32167592 0.99511089 -1.12082173
Let’s make sure those are all the same:
round(health$sr_health - fitted.values(model), 8) == round(residuals(model), 8)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 49 50 51 52 53 54 55 56 57 58 59 60
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
With this information, we could calculate the sum of squared residuals:
\[\Sigma(Y_i - \hat{Y_i})^2\]
sum((health$sr_health - fitted.values(model))^2)
## [1] 45.46788
This of course is one of the building blocks for estimating the standard error of the estimate:
\[\sqrt{\frac{\Sigma(Y_i - \hat{Y_i})^2}{N-2}}\]
Let’s go ahead and calculate that too:
sqrt(sum((health$sr_health - fitted.values(model))^2)/(nrow(health)-2))
## [1] 0.8853976
We can also get the standard error more directly. However, it is not stored in the model list. We have to use the summary()
function, which gives us some additional useful information about our model. Let’s take a look at the structure of it:
str(summary(model))
## List of 11
## $ call : language lm(formula = sr_health ~ consc, data = health)
## $ terms :Classes 'terms', 'formula' language sr_health ~ consc
## .. ..- attr(*, "variables")= language list(sr_health, consc)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "sr_health" "consc"
## .. .. .. ..$ : chr "consc"
## .. ..- attr(*, "term.labels")= chr "consc"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(sr_health, consc)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "sr_health" "consc"
## $ residuals : Named num [1:60] -1.052 0.374 -0.14 -0.462 0.86 ...
## ..- attr(*, "names")= chr [1:60] "1" "2" "3" "4" ...
## $ coefficients : num [1:2, 1:4] 1.657 0.49 0.357 0.119 4.641 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "consc"
## .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## $ aliased : Named logi [1:2] FALSE FALSE
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "consc"
## $ sigma : num 0.885
## $ df : int [1:3] 2 58 2
## $ r.squared : num 0.227
## $ adj.r.squared: num 0.214
## $ fstatistic : Named num [1:3] 17 1 58
## ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
## $ cov.unscaled : num [1:2, 1:2] 0.1626 -0.0513 -0.0513 0.018
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "(Intercept)" "consc"
## .. ..$ : chr [1:2] "(Intercept)" "consc"
## - attr(*, "class")= chr "summary.lm"
Remember, R calls the standard error of the estimate sigma
, so we can get it out of the new summary object using list subsetting:
summary(model)$sigma
## [1] 0.8853976
And we can see that it is equivalent to what we calculated above:
sqrt(sum((health$sr_health - fitted.values(model))^2)/(nrow(health)-2)) == summary(model)$sigma
## [1] TRUE
Question: What does a standard error of the estimate of 0.89 mean? Is that good?
Recall that another way we can evaluate our regression model is by extracting the coefficient of determination, \(R^2\). This represents the proportion of variance explained by the model. Like the standard error of the estimate, we need to use the summary function to get it:
summary(model)$r.squared
## [1] 0.2270871
Question: What does an \(R^2\) of 0.23 mean (in plain english)?
Recall that we also get estimates for the individual regression coefficients, \(b_0\) and \(b_1\) in the case of univariate regression. Like many of the other components we’ve covered, you can extract the coefficients with list subsetting:
model$coefficients
## (Intercept) consc
## 1.6569733 0.4904059
coefficients()
functionOr we could get the same information with the coefficients()
function:
coefficients(model)
## (Intercept) consc
## 1.6569733 0.4904059
Question: What does the intercept mean?
Question: What about the slope for conscientiousness?
You probably recall that these are called the unstandardized coefficients. We can also get the standardized coefficients, but that is a little trickier. Before I show you how to do that:
Question: How does a standardized coefficient differ from an unstandardized coefficient?
Standardized regression coefficients, often notated as \(\beta\), are just the regression coefficients after the variables have been standardized or Z-scored. To obtain them, we need to z-score our data with scale()
before we run the lm()
function. One really cool thing is that we can do it in the lm()
call:
std_model <- lm(scale(sr_health) ~ scale(consc), data = health)
coefficients(std_model) %>%
round(3)
## (Intercept) scale(consc)
## 0.000 0.477
Question: What does the standardized slope for conscientiousness mean?
Question: Why is the intercept zero?
So far, we’ve been covering how to estimate the various regression components in R and how to extract those components from our model object. However, within the NHST tradition, we also typically want significance tests. With regression, we have several significance tests simultaneously.
First, we have the omnibus test, which tests whether or not the regression model significantly outperforms the NULL model. This is typically done with an F ratio:
\[F = \frac{MS_{model}}{MS_{residual}}\]
That is also stored in the summary of our model:
summary(model)$fstatistic
## value numdf dendf
## 17.0408 1.0000 58.0000
And we can use these numbers to look up its p value:
pf(summary(model)$fstatistic[1],
summary(model)$fstatistic[2],
summary(model)$fstatistic[3],
lower.tail = FALSE)
## value
## 0.0001186082
Alternatively, we can pass our model to the anova()
function, which gives us an F table or ANOVA table for our regression model:
anova(model)
## Analysis of Variance Table
##
## Response: sr_health
## Df Sum Sq Mean Sq F value Pr(>F)
## consc 1 13.359 13.3588 17.041 0.0001186 ***
## Residuals 58 45.468 0.7839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Question: Is the F significant? What does this mean?
In addition to the omnibus test, we get a significance test for each of our model’s coefficients.
Recall that for each coefficient, we get a t test from the formula:
\[\begin{align} t = \frac{b_1}{se_b}\\ \\ \\ \\ se_b = \frac{s_Y}{s_X}\sqrt{\frac{1 - r^2_{XY}}{n-2}} \end{align}\]
This t is distributed with \(df = n - 2\).
We can get these from the summary of our model object, by extracting the coefficients from the summary.
summary(model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6569733 0.3570543 4.640676 0.0000202958
## consc 0.4904059 0.1187984 4.128050 0.0001186082
Question: Is the test of the intercept significant? What does this mean?
Question: Is the test of the slope significant? What does this mean?
Also, recall that in the case of simple univariate regression, the omnibus F is equivalent to the t for the slope squared:
summary(model)$fstatistic[[1]]
summary(model)$coefficients[2,3]^2
## [1] 17.0408
## [1] 17.0408
Summary()
on its ownFinally, all of this information is displayed in when we just run summary()
:
summary(model)
##
## Call:
## lm(formula = sr_health ~ consc, data = health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5450 -0.6414 -0.1623 0.6817 2.0154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6570 0.3571 4.641 2.03e-05 ***
## consc 0.4904 0.1188 4.128 0.000119 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8854 on 58 degrees of freedom
## Multiple R-squared: 0.2271, Adjusted R-squared: 0.2138
## F-statistic: 17.04 on 1 and 58 DF, p-value: 0.0001186
You may have noticed at this point that working with lists has its challenges. Even just extracting the information we’ve extracted so far has some pretty messy code. There must be a better (tidier) way!
Thankfully, there is. This requires the {broom} library, which might be new. It is a package for tidying model results objects. It’s pretty easy to use - you just pass the model object to a function from {broom} called tidy
. There are some more advanced things you can do, but just tidy(model_obj)
(where model_obj is the name of the model) works for most purposes. And, one really nice thing about broom
is that it works with a lot of different types of models, so this will continue to work as we move to other techniques (e.g., multi-level model with lme4
).
Let’s see what happens when we tidy our model:
tidy(model)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.66 0.357 4.64 0.0000203
## 2 consc 0.490 0.119 4.13 0.000119
You can see it produces a dataframe containing the model coefficients and their significance tests.
{broom} also has a function called glance()
that provides some of the omnibus model information we might want:
glance(model)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.227 0.214 0.885 17.0 1.19e-4 2 -76.8 160. 166.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
augment
provides more information from our model:
augment(model)
## # A tibble: 60 x 9
## sr_health consc .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.91 2.66 2.96 0.116 -1.05 0.0173 0.882 0.0126 -1.20
## 2 4.49 5.02 4.12 0.283 0.374 0.102 0.892 0.0112 0.445
## 3 3.40 3.85 3.54 0.165 -0.140 0.0347 0.893 0.000462 -0.160
## 4 2.05 1.73 2.51 0.175 -0.462 0.0390 0.891 0.00575 -0.533
## 5 4.21 3.45 3.35 0.135 0.860 0.0233 0.886 0.0115 0.982
## 6 5.11 4.01 3.62 0.179 1.49 0.0408 0.870 0.0629 1.72
## 7 3.09 2.82 3.04 0.114 0.0510 0.0167 0.893 0.0000286 0.0581
## 8 4.58 3.90 3.57 0.170 1.00 0.0367 0.883 0.0255 1.16
## 9 3.90 3.10 3.18 0.118 0.724 0.0178 0.888 0.00619 0.826
## 10 2.64 3.04 3.15 0.117 -0.510 0.0173 0.891 0.00298 -0.581
## # … with 50 more rows
Using augment, we get the DV, IV, fitted values, residuals, and other diagnostic information we’ll learn about in future weeks.
So with broom, just remember:
1. We tidy
up coefficients.
2. We glance
at (omnibus) model statistics.
3. We augment
to find everything else.
The payoff for these functions comes when you want to do something with some of your regression results. As an example, you could use tidy()
to make it easier to make a plot of your regression coefficients:
tidy(model) %>%
ggplot(aes(x = term, y = estimate)) +
geom_point()+
geom_linerange(aes(ymin = estimate - std.error,
ymax = estimate + std.error))
The last thing we’ll cover today is how to report the results of your regression in Tables. We’ll cover three different methods, which each have their pro’s and con’s.
Our first option would be to make a table ‘by hand’ using a combination of tidy()
and the kable
function from {knitr}.
tidy(model) %>% # first run tidy on the model.
# Then pipe it to knitr's kable function
knitr::kable(digits = c(NA, 2, 2, 2, 3), # set digits; everything rounded to
# 2 except the labels (NA) and p's (3 digits)
caption = "Results of Regressing Self-Reported Health on Conscientiousness") # provide a table caption
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 1.66 | 0.36 | 4.64 | 0 |
consc | 0.49 | 0.12 | 4.13 | 0 |
We could clean things up a bit more by changing the names and re-formatting that pesky p value column:
tidy(model) %>% # we'll still run tidy on the model first
# but we'll pass it to rename (part of the tidyverse/dplyr)
# and rename some of the columns to be more similar to how
# we normally report these things.
# rename is pretty easy, it's a way to rename column names
# the general format is rename(new_name = old_name),
# where new_name is the new name you want the column to have
# and old_name is the old name that you're replacing.
rename(coefficient = term,
b = estimate,
SE = std.error,
t = statistic,
p = p.value) %>%
# Then we can mutate the p column to deal with the
# triple zeroes
mutate(p = ifelse(p > .001, round(p, 3), "< .001")) %>%
knitr::kable(digits = c(NA, 2, 2, 2, 3), # Then we'll do the same as above with kable
caption = "Results of Regressing Self-Reported Health on Conscientiousness")
coefficient | b | SE | t | p |
---|---|---|---|---|
(Intercept) | 1.66 | 0.36 | 4.64 | < .001 |
consc | 0.49 | 0.12 | 4.13 | < .001 |
This method is nice for two reasons:
1. You have a lot of control over how things look.
2. It’s pretty general-purpose, and you can easily adapt it to new things you learn how to do in R.
However, it has a downside in that it is hard to get this into picture perfect APA format (we didn’t get all the way there above) and so you may have to do some editing once you get it into word.
The {stargazer} library can be used to pretty easily get APA formatted tables from many kinds of results or model object. It’s super easy to use:
stargazer(model, # give it the model
type = "html", # tell it the format type.
#I'm using html since this is an html doc
out = "./tbl/reg_tbl_sg.html") # you can give it a path to save
Dependent variable: | |
sr_health | |
consc | 0.490*** |
(0.119) | |
Constant | 1.657*** |
(0.357) | |
Observations | 60 |
R2 | 0.227 |
Adjusted R2 | 0.214 |
Residual Std. Error | 0.885 (df = 58) |
F Statistic | 17.041*** (df = 1; 58) |
Note: | p<0.1; p<0.05; p<0.01 |
This is nice in that it is super easy, and puts it into APA format for you. It’s downsides are that it is less flexible and stargazer may not have a method for your results objct if you’re using more advanced or cutting edge tools.
sjPlot
A third option is to use tab_model()
from the {sjPlot} library. This one does not always work well within the rMarkdown document, but it is very easy to export these tables to word. we can do so by setting the file extension of our output to .doc
:
sjPlot::tab_model(model, file = "./tbl/reg_tbl_sjp.doc")
sr health | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 1.66 | 0.94 – 2.37 | <0.001 |
consc | 0.49 | 0.25 – 0.73 | <0.001 |
Observations | 60 | ||
R2 / R2 adjusted | 0.227 / 0.214 |
This has similar cons to stargazer, and additionally doesn’t always play nicely with rmarkdown. However, it’s super easy to use and even easier than stargazer for exporting tables to MS word.
For this minihack, you’ll demonstrate the equivalence of a correlation between two variables and the standardized coefficient from a univariate regression regressing one on the other. You’ll be working with a dataset called PSY612_Lab2_Data.csv, which is located in the lab-2/data subfolder. It can be downloaded from the following web address:
“https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-2/data/PSY612_Lab2_Data.csv”
It contains a number of variables, but we’ll focus on Extraversion and Happiness first.
Calculate the bi-variate correlation between Extraversion and Happiness. Then, conduct a univariate regression, regressing Happiness on Extraversion and obtain the standardized estimate for the slope. Finally, conduct a series of logical tests showing the equivalence of the estimate (correlation and standardized slope value), their test statistic, and the p value associated with the test statistic (Hint: You can round to 3 digits, which will probably be necessary).
# code here
For this minihack, you’ll calculate predicted scores ‘by hand’ using the regression equation, compare them to the predicted scores stored in the model, and finally use the predicted scores you calculate to estimate \(R^2\). We’ll work with the same dataset, but this time you’ll regress social support (SocSup) on Extraversion.
a.) Run the regression model. Extract the coefficients, calculate the predicted scores using the extracted coefficients (HINT: think about the regression equation).
# code here
b.) Demonstrate that the predicted scores are the same as the values from fitted.values
.
# code here
c.) Use those predicted scores to calculate \(R^2\). Demonstrate its equivalence to the result of summary(model)$r.squared
.
# code here
Create two tables using two different methods we covered today. The first table should correspond to the regression from Minihack 1 (regressing happiness on Exraversion) and the second should correspond to the regression from Minihack 2 (regressing social support on Extraversion).
# Code here
# code here