Today we’ll be going over the basics of regression with multiple continuous predictors in R, including reporting the results in tables.
To quickly navigate to the desired section, click one of the following links:
# you may need to install:
library(ppcor) # for partial correlations; may need to install
library(janitor) # for cleaning up df names
# you should already have:
library(rio) # for importing data
library(pwr) # for power calculation
library(broom) # for cleaning up output
library(sjPlot) # for tables
library(tidyverse) # for plotting and data wrangling
We’ll start by reviewing semi-partial & partial correlations. Recall that zero-order correlations (our old friend) is the strength of the association between two variables, full-stop. Semi-partial & partial correlations index the strength of the association between two variables controlling for a third variable.
Given x1, x2, and Y, the semi-partial correlation between x1 and y (controlling for x2) is the correlation between the part of x1 that is not related to x2 and y:
\[\large r_{Y(X1.X2)} = \frac{r_{YX1}-r_{YX2}r_{X1X2} }{\sqrt{1-r_{X1X2}^2}}\]
Given X1, X2, and Y, the partial correlation between X1 and Y (controlling for X2) is the correlation between the part of X1 that is not related to X2 and the part of Y that is not related to X2:
\[\large r_{YX1.X2}\frac{r_{YX1}-r_{YX2}r_{X1X2} }{\sqrt{1-r_{YX2}^2}\sqrt{1-r_{X1X2}^2}}\]
Let’s work on calculating these coefficients in R. To do so, we’ll work with that lab 2 dataset we worked with before. Recall that it has: 1. extraversion: self-reported extraversion 2. pos_express: self-reported positive expressivity 3. soc_sup: self-reported social support 4. optimism: self-reported optimism 5. exercise: self-reported exercise 6. happiness: self-reported happiness
df <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-2/data/PSY612_Lab2_Data.csv") %>% janitor::clean_names() # to get things into snake_case
We’ll use functions from the {ppcor} library.
To get the semi-partial correlation, we use the spcor.test()
function. It requires 3 arguments: x =
, y =
, and z =
, which correspond to X1, Y, and X2 above respectively. Put differently, x and y are what you want the correlation between, and z is the variable you want to control for. Let’s look at the semi-partial correlation between extraversion and happiness, controlling for social support.
ex_happy_socsup_spcor <- spcor.test(
x = df$extraversion, # X1 (var of interest 1)
y = df$happiness, # Y (var of interest 2)
z = df$soc_sup # X2 (control variable)
)
ex_happy_socsup_spcor
The resulting object is a dataframe, so we can extract values using $
notation. For instance, we can get the estimate of the correlation by using $estimate
ex_happy_socsup_spcor$estimate
## [1] 0.4049577
Question: What does this value 0.4 mean?
Or the p value by using $p.value
ex_happy_socsup_spcor$p.value
## [1] 3.978825e-07
Question: What does this p value tell us?
We can get the full semi-partial correlation matrix by passing a dataframe to the spcor()
function, which just has two arguments: x =
which is where we tell it the df, and method =
where we could change the method from pearson to another (we’ll leave it as its default here).
spcor_mat <- spcor(df)
spcor_mat
## $estimate
## extraversion pos_express soc_sup optimism exercise
## extraversion 1.0000000 0.28856188 0.10718319 0.091815651 0.007219683
## pos_express 0.3094357 1.00000000 0.15021866 -0.041798865 -0.019397059
## soc_sup 0.1200254 0.15686965 1.00000000 0.171121481 0.071135781
## optimism 0.1059002 -0.04495862 0.17625361 1.000000000 -0.008875602
## exercise 0.0079304 -0.01986924 0.06977799 -0.008452686 1.000000000
## happiness 0.1832998 0.11944467 0.06808492 0.178099997 0.341717610
## happiness
## extraversion 0.2020639
## pos_express 0.1411969
## soc_sup 0.0840474
## optimism 0.2264492
## exercise 0.4137815
## happiness 1.0000000
##
## $p.value
## extraversion pos_express soc_sup optimism exercise
## extraversion 0.000000000 0.0004739486 0.20261053 0.27543807 9.318017e-01
## pos_express 0.000169532 0.0000000000 0.07333123 0.62012618 8.181377e-01
## soc_sup 0.153330568 0.0613415780 0.00000000 0.04100756 3.985216e-01
## optimism 0.208098760 0.5939102900 0.03523023 0.00000000 9.162113e-01
## exercise 0.925106821 0.8137900380 0.40760820 0.92019026 0.000000e+00
## happiness 0.028429416 0.1553439149 0.41910805 0.03332725 2.954715e-05
## happiness
## extraversion 1.551435e-02
## pos_express 9.255180e-02
## soc_sup 3.182761e-01
## optimism 6.536222e-03
## exercise 2.794460e-07
## happiness 0.000000e+00
##
## $statistic
## extraversion pos_express soc_sup optimism exercise
## extraversion 0.00000000 3.5787165 1.2801042 1.0948752 0.08573123
## pos_express 3.86398902 0.0000000 1.8042206 -0.4967682 -0.23037066
## soc_sup 1.43560115 1.8860747 0.0000000 2.0623752 0.84683594
## optimism 1.26460630 -0.5343944 2.1261814 0.0000000 -0.10539609
## exercise 0.09417124 -0.2359808 0.8305923 -0.1003737 0.00000000
## happiness 2.21407693 1.4285540 0.8103440 2.1491805 4.31757845
## happiness
## extraversion 2.449911
## pos_express 1.693587
## soc_sup 1.001551
## optimism 2.760648
## exercise 5.397091
## happiness 0.000000
##
## $n
## [1] 147
##
## $gp
## [1] 4
##
## $method
## [1] "pearson"
The resulting object is a list
. It contains three jxj matrices (where j is the number of variables in the df) called:
1. estimate = a matrix of semi-partial correlations.
2. p.value = a matrix of p values. 3. statistic = a matrix of t statistics.
It also contains 3 additional values:
1. n = sample size 2. gp = number of control variables (number of variables being ‘partialed out’)
3. method = type of correlation
Like any list, we can extract elements using $
. So we could get the semi-partial correlation matrix using $estimate
:
spcor_mat$estimate
## extraversion pos_express soc_sup optimism exercise
## extraversion 1.0000000 0.28856188 0.10718319 0.091815651 0.007219683
## pos_express 0.3094357 1.00000000 0.15021866 -0.041798865 -0.019397059
## soc_sup 0.1200254 0.15686965 1.00000000 0.171121481 0.071135781
## optimism 0.1059002 -0.04495862 0.17625361 1.000000000 -0.008875602
## exercise 0.0079304 -0.01986924 0.06977799 -0.008452686 1.000000000
## happiness 0.1832998 0.11944467 0.06808492 0.178099997 0.341717610
## happiness
## extraversion 0.2020639
## pos_express 0.1411969
## soc_sup 0.0840474
## optimism 0.2264492
## exercise 0.4137815
## happiness 1.0000000
Or get p values with $p.values
spcor_mat$p.value
## extraversion pos_express soc_sup optimism exercise
## extraversion 0.000000000 0.0004739486 0.20261053 0.27543807 9.318017e-01
## pos_express 0.000169532 0.0000000000 0.07333123 0.62012618 8.181377e-01
## soc_sup 0.153330568 0.0613415780 0.00000000 0.04100756 3.985216e-01
## optimism 0.208098760 0.5939102900 0.03523023 0.00000000 9.162113e-01
## exercise 0.925106821 0.8137900380 0.40760820 0.92019026 0.000000e+00
## happiness 0.028429416 0.1553439149 0.41910805 0.03332725 2.954715e-05
## happiness
## extraversion 1.551435e-02
## pos_express 9.255180e-02
## soc_sup 3.182761e-01
## optimism 6.536222e-03
## exercise 2.794460e-07
## happiness 0.000000e+00
To get the partial correlation, we use the pcor.test()
function. It also requires the same 3 arguments: x =
, y =
, and z =
, which correspond to X1, Y, and X2 above respectively. Put differently, x and y are what you want the correlation between, and z is the variable you want to control for.
ex_happy_socsup_pcor <- pcor.test(
x = df$extraversion, # X1 (var of interest 1)
y = df$happiness, # Y (var of interest 2)
z = df$soc_sup # X2 (control variable)
)
ex_happy_socsup_pcor
Question: What does this value 0.45 mean?
Question: What does is the p value telling us?
Just like with semi-partial correlations, we can get the full partial correlation matrix using the pcor()
function.
pcor_mat <- pcor(df)
pcor_mat
## $estimate
## extraversion pos_express soc_sup optimism exercise
## extraversion 1.00000000 0.36259938 0.14303288 0.12286034 0.00973409
## pos_express 0.36259938 1.00000000 0.18560006 -0.05248458 -0.02438221
## soc_sup 0.14303288 0.18560006 1.00000000 0.20180292 0.08533989
## optimism 0.12286034 -0.05248458 0.20180292 1.00000000 -0.01037510
## exercise 0.00973409 -0.02438221 0.08533989 -0.01037510 1.00000000
## happiness 0.26286818 0.17480447 0.10068458 0.25590626 0.45285094
## happiness
## extraversion 0.2628682
## pos_express 0.1748045
## soc_sup 0.1006846
## optimism 0.2559063
## exercise 0.4528509
## happiness 1.0000000
##
## $p.value
## extraversion pos_express soc_sup optimism exercise
## extraversion 0.000000e+00 8.570356e-06 0.08834621 0.143783503 9.081408e-01
## pos_express 8.570356e-06 0.000000e+00 0.02646675 0.533583434 7.725410e-01
## soc_sup 8.834621e-02 2.646675e-02 0.00000000 0.015651048 3.108643e-01
## optimism 1.437835e-01 5.335834e-01 0.01565105 0.000000000 9.021209e-01
## exercise 9.081408e-01 7.725410e-01 0.31086433 0.902120932 0.000000e+00
## happiness 1.514813e-03 3.678732e-02 0.23150604 0.002036401 1.356952e-08
## happiness
## extraversion 1.514813e-03
## pos_express 3.678732e-02
## soc_sup 2.315060e-01
## optimism 2.036401e-03
## exercise 1.356952e-08
## happiness 0.000000e+00
##
## $statistic
## extraversion pos_express soc_sup optimism exercise happiness
## extraversion 0.0000000 4.6200457 1.716066 1.4700227 0.1155914 3.235162
## pos_express 4.6200457 0.0000000 2.242847 -0.6240800 -0.2896088 2.108147
## soc_sup 1.7160661 2.2428472 0.000000 2.4466131 1.0170654 1.201670
## optimism 1.4700227 -0.6240800 2.446613 0.0000000 -0.1232041 3.143388
## exercise 0.1155914 -0.2896088 1.017065 -0.1232041 0.0000000 6.031169
## happiness 3.2351617 2.1081469 1.201670 3.1433882 6.0311686 0.000000
##
## $n
## [1] 147
##
## $gp
## [1] 4
##
## $method
## [1] "pearson"
One nice thing about regression analysis in R is that it is pretty easy to build up from the simple to the more complex cases. We still use the lm()
function and the functions we’ve used over the past few weeks (e.g., coefficients()
) to estimate our regression model and extract the information we want. To add multiple predictors, you enter them in the formula =
argument, separated by a +
.
Let’s run a regression regressing happiness on extraversion and social support:
mr_model <- lm(happiness ~ extraversion + soc_sup,
data = df)
And, as we covered a couple of weeks ago, we can get much of the information we want by running summary()
on it:
summary(mr_model)
##
## Call:
## lm(formula = happiness ~ extraversion + soc_sup, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.335 -5.913 -0.010 5.785 32.582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.81383 4.28021 8.367 4.67e-14 ***
## extraversion 0.31769 0.05235 6.068 1.09e-08 ***
## soc_sup 0.18931 0.05693 3.325 0.00112 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.498 on 144 degrees of freedom
## Multiple R-squared: 0.3616, Adjusted R-squared: 0.3528
## F-statistic: 40.79 on 2 and 144 DF, p-value: 9.237e-15
Now we’ll go through extracting these pieces of information (separately) and interpreting them. Let’s start with coefficients.
We can see the coefficients in the summary()
object above, but recall from lab 2 that we can also get the coefficients using the coefficients()
function:
coefficients(mr_model)
## (Intercept) extraversion soc_sup
## 35.8138331 0.3176919 0.1893082
Question: What does the intercept mean (in plain english, using the estimate in your answer)?
Question: What does the extraversion term mean (in plain english, using the estimate in your answer)?
Question: What does the social support term mean (in plain english, using the estimate in your answer)?
We often want NHSTs for each of these parameters, which we can extract the $coefficients
table from the summary()
object:
summary(mr_model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.8138331 4.28021324 8.367301 4.669959e-14
## extraversion 0.3176919 0.05235365 6.068191 1.087597e-08
## soc_sup 0.1893082 0.05692777 3.325410 1.120165e-03
Or, using broom::tidy()
to get it as a tibble:
tidy(mr_model)
Question: Is the intercept signficant? What does that mean?
Question: Is the extraversion slope significant? What does that mean?
Question: Is the social support slope significant? What does that mean?
And, we could even get CIs around our estimates:
tidy(mr_model, conf.int = TRUE)
Question: What does the CI for the Intercept mean?
Question: What does the CI for the slope for extraversion mean?
Question: What does the CI for the slope for social support mean?
Often we want standardized coefficients in addition to unstandardized coefficents. Recall from lecture that the formula for a standardized \(\beta\) looks like the following
\[\large \beta_{1} = \frac{r_{YX1}-r_{YX2}r_{X1X2}}{1-r_{X1X2}^2}\]
This would be the formula for the standardized slope for \(\beta_1\), or for extraversion in our running example.
Question: What do we do to get standardized coefficients in R?
That’s right, we z score them first. Let’s go ahead and do that now:
std_mr_model <- lm(scale(happiness) ~ scale(extraversion) + scale(soc_sup),
data = df)
And we can see the coefficients table using any of the methods above. I’ll use tidy here:
tidy(std_mr_model)
Question: Why is the intercept zero?
Question: What does the standardized slope for extraversion mean?
Question: What does the standardized slope for social support mean?
Often we want to extract even more information from our model. We can use the same functions we covered in lab 2 to get predicted scores, residuals, a coefficients table, etc. Let’s start with predicted y values or fitted values (\(\hat{y}\)), which we can get using the fitted.values()
function:
fitted.values(mr_model)
## 1 2 3 4 5 6 7 8
## 70.41169 75.52954 70.45301 82.51877 85.56077 61.34661 57.77147 67.94415
## 9 10 11 12 13 14 15 16
## 74.40023 74.02161 74.01508 69.45861 63.03731 60.99624 74.57647 73.48847
## 17 18 19 20 21 22 23 24
## 61.34661 71.97401 83.65461 72.53540 70.24199 79.31360 71.54754 81.57223
## 25 26 27 28 29 30 31 32
## 65.32862 60.81347 74.02161 62.29969 69.10824 76.87431 69.49993 76.13014
## 33 34 35 36 37 38 39 40
## 63.42900 60.38700 71.54100 71.16239 77.05708 73.62993 63.61177 67.02586
## 41 42 43 44 45 46 47 48
## 67.98547 65.33515 68.72309 65.15238 74.58300 69.49993 81.18707 79.34838
## 49 50 51 52 53 54 55 56
## 67.80923 56.64869 69.46515 84.60769 75.35331 70.45301 68.72309 50.00330
## 57 58 59 60 61 62 63 64
## 65.50485 64.55831 74.58300 68.36408 60.42621 72.11546 62.32794 71.16239
## 65 66 67 68 69 70 71 72
## 72.88576 73.48847 76.31292 75.52954 62.46939 51.13261 62.85454 58.86600
## 73 74 75 76 77 78 79 80
## 78.39531 70.06786 71.96747 80.47770 58.68976 76.87431 66.66684 64.38208
## 81 82 83 84 85 86 87 88
## 64.76069 78.57154 65.72030 77.44223 74.75924 44.10861 74.23053 74.19132
## 89 90 91 92 93 94 95 96
## 54.35085 73.06200 72.67685 62.67830 69.11478 66.07931 62.65216 62.86108
## 97 98 99 100 101 102 103 104
## 74.40676 59.81908 68.93201 67.41754 65.89654 68.74270 65.12624 69.11478
## 105 106 107 108 109 110 111 112
## 73.62993 63.80762 79.34838 70.06786 63.60524 74.96815 67.76791 80.47770
## 113 114 115 116 117 118 119 120
## 59.25769 71.39955 68.93854 76.09747 69.32369 76.73285 63.22009 64.38208
## 121 122 123 124 125 126 127 128
## 71.96747 73.62339 75.17707 61.90147 72.67685 69.28237 77.05708 70.23545
## 129 130 131 132 133 134 135 136
## 73.27092 78.01015 79.90977 79.34838 81.56569 64.58445 64.41033 61.90147
## 137 138 139 140 141 142 143 144
## 69.28237 77.43569 72.50061 59.47524 69.89162 76.48915 66.46447 65.12624
## 145 146 147
## 71.39955 68.16170 72.49408
Question: Why are there 147 values?
We can get the residual values using residuals()
function:
residuals(mr_model)
## 1 2 3 4 5 6
## -2.41168804 2.47045683 -0.45300719 5.48123413 4.43923358 1.65338924
## 7 8 9 10 11 12
## 2.22853016 -2.94414702 0.59977066 8.97838696 18.98492203 -9.45861222
## 13 14 15 16 17 18
## -13.03731397 -2.99624346 -1.57646735 16.51152734 -11.34661076 3.02599254
## 19 20 21 22 23 24
## -8.65461477 -14.53539684 -12.24198511 5.68640254 3.45246306 1.42777488
## 25 26 27 28 29 30
## 2.67138301 -2.81347038 0.97838696 2.70031342 0.89175508 6.12569456
## 31 32 33 34 35 36
## -4.49993136 8.86985702 9.57099959 -10.38699987 -6.54100187 13.83761443
## 37 38 39 40 41 42
## -4.05707852 14.37007340 9.38822651 -4.02585527 7.01453384 -30.33515206
## 43 44 45 46 47 48
## 4.27690645 -5.15237898 -4.58300242 -6.49993136 3.81292625 13.65161847
## 49 50 51 52 53 54
## 7.19077184 6.35130892 -6.46514729 -6.60769060 -2.35330516 9.54699281
## 55 56 57 58 59 60
## 9.27690645 14.99669939 -0.50485499 -9.55831424 0.41699758 9.63591754
## 61 62 63 64 65 66
## -7.42621030 5.88453861 5.67206442 1.83761443 5.11423586 -15.48847266
## 67 68 69 70 71 72
## 13.68708394 -0.52954317 -19.46938952 8.86738557 0.14545911 21.13400041
## 73 74 75 76 77 78
## -0.39530571 4.93214419 -1.96747239 7.52230464 6.31023841 -13.87430544
## 79 80 81 82 83 84
## -1.66684418 -1.38207624 -4.76069254 -15.57154372 -0.72030343 -7.44222989
## 85 86 87 88 89 90
## 15.24075957 0.89139234 3.76947359 -1.19131597 -11.35085299 -3.06200215
## 91 92 93 94 95 96
## 12.32314923 -2.67830288 0.88522001 -8.07931452 -4.65216260 -12.86107596
## 97 98 99 100 101 102
## -4.40676441 10.18092458 -3.93200691 32.58245829 -12.89654144 6.25730124
## 103 104 105 106 107 108
## -5.12623869 -4.11477999 -8.62992660 -13.80761671 -6.34838153 4.93214419
## 109 110 111 112 113 114
## -10.60523842 3.03184621 -14.76790901 -5.47769536 13.74231396 1.60045206
## 115 116 117 118 119 120
## -0.93854199 8.90253238 0.67630664 13.26714850 4.77991295 8.61792376
## 121 122 123 124 125 126
## -6.96747239 1.37660848 4.82293284 -6.90146507 -17.67685077 -1.28237421
## 127 128 129 130 131 132
## -7.05707852 4.76454996 7.72908449 -0.01015434 18.09022909 -1.34838153
## 133 134 135 136 137 138
## -6.56569005 -11.58445453 -1.41032524 6.09853493 8.71762579 -2.43569482
## 139 140 141 142 143 144
## 0.49938723 18.52475681 5.10838219 -16.48915407 -18.46446589 4.87376131
## 145 146 147
## -3.39954794 -0.16170417 -22.49407770
And we can get our \(R^2\) for the model by extracting it from the summary()
of the model - remember, that isn’t stored in the model itself, only in the summary()
of it.
summary(mr_model)$r.squared
## [1] 0.3616237
Which is equal to the squared correlation between the predicted and observed y values:
cor(mr_model$fitted.values, df$happiness)^2
## [1] 0.3616237
Question: What does this value of 0.3616237 mean (in plain english)?
We can instead get the adjusted \(R^2\) value that Sara mentioned in lecture yesterday:
summary(mr_model)$adj.r.squared
## [1] 0.3527574
Question: Why is the adjusted \(R^2\) 0.35 lower than the \(R^2\) 0.36?
We can also get our model F by pulling it from the model’s summary()
summary(mr_model)$fstatistic
## value numdf dendf
## 40.78615 2.00000 144.00000
Which we could lookup the p value using the pf()
function, providing the f value, numerator df, denominator df (in that order). Don’t forget to set lower.tail = FALSE
:
pf(q = summary(mr_model)$fstatistic[1],
df1 = summary(mr_model)$fstatistic[2],
df2 = summary(mr_model)$fstatistic[3],
lower.tail = FALSE)
## value
## 9.236865e-15
Question: Is the Model F significant? What does this mean?
We can see that it is equivalent to testing it against an intercept-only model by doing exactly that. Let’s try it. First we fit an intercept only model:
int_model <- lm(happiness ~ 1,
data = df)
Then, we can use the anova()
function to compare them:
anova(int_model, mr_model)
You can see in the output above that the F is exactly what we saw for our model, as is the p value.
broom
And we could of course use augment()
and glance()
from {broom} to get tibbles with this information. We can get fitted and residual values (and some other values we’ll cover in a few weeks) using augment:
augment(mr_model)
And we can get those model level statistics using glance()
:
glance(mr_model)
Lastly, we want to make a nice table to report our results. We can use any of the methods we covered in Lab 2. This time, we’ll just use sjPlot::tab_model()
, but stargazer::stargazer()
and doing it ‘by-hand’ using broom and knitr::kable()
are also valid approaches.
For tab_model()
, we can just pass it the model, but I’ll also give it a title using the title =
argument and then save it out using the file =
argument.
tab_model(mr_model,
title = "Regressing Happiness on Extraversion and Social Support",
file = "tbl/reg_tbl_extra_socsup_happy.doc")
happiness | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 35.81 | 27.35 – 44.27 | <0.001 |
extraversion | 0.32 | 0.21 – 0.42 | <0.001 |
soc_sup | 0.19 | 0.08 – 0.30 | 0.001 |
Observations | 147 | ||
R2 / R2 adjusted | 0.362 / 0.353 |
Get the partial & semi-partial correlation matrices with just social support, positive expressivity, and optimism (no other variables). (Hint: you may need to do some data wrangling.)
Run a regression predicting happiness from social support, positive expressivity, and extraversion. Then make a new model that has these predictors plus exercise and optimism (5 total predictors).
What happens to the slopes for social support, positive expressivity, and extraversion after you add optimism and exercise to the model?
Next, formally compare the overall performance of the models (i.e., get a significance test comparing the models) and examine their adjusted \(R^2\) values. Which model is better and how much is it an improvement?
Make a table that has both of the models from Minihack 2 in a single table.