Purpose

Today’s lab will focus on one-way ANOVA. However, rather than spend a lot of time on the traditional way of doing an ANOVA, we will instead focus on running ANOVA as a linear regression with a categorical predictor. In particular, we will discuss how to code categorical variables in R and how this affects interpretation of model coefficients.

To quickly navigate to the desired section, click one of the following links:

  1. Regression with categorical predictors
  2. Traditional ANOVA
  3. Minihacks

Be sure to have the following packages loaded:

library(tidyverse) # for plotting and data wrangling
library(rio) # for importing data
library(psych) # for descriptives

Regression with categorical predictors

Research scenario

  • We’ll be looking at a dataset comparing different kinds of treatment for depression. In this study, depressed patients (n = 5 per group) were randomly assigned to one of three groups:
  1. CBT group
  2. Psychotherapy group
  3. Control group (received no therapy)
  • After 10 weeks, participants’ depression scores were measured. Our dataset will have just 2 variables: group (CBT, Psychotherapy or Control) and depress (depression scores).

  • NOTE: In this example, lower scores = more depressed. (This is a little counter-intuitive.)

  • NOTE: 1 = CBT; 2 = Psychotherapy; 3 = Control

  • Import the data

depression <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-5/data/depression_therapy.csv")

Structure of the data

  • The first thing we should check is how the data are structured. We should have a factor called group and a numeric variable for depression scores. We’ll use the hopefully familiar functionstr() which stands for structure.

Code
str(depression) # or use tibble::glimpse(), which is a tidyverse function
Output
## 'data.frame':    15 obs. of  2 variables:
##  $ group  : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ depress: int  4 7 6 7 5 7 10 9 10 12 ...


Recode group as a factor

  • Okay, it looks like group is being read as an integer. We should go ahead and change this now to help make sure we don’t make a mistake and use group as an integer later on. This is a bit unnecessary - we would find out when we tried to create dummy codes (since the technique we’ll use only works with factors), but it is a good habit to get into.

  • We’ll use mutate() and base R’s factor(), which can be used to turn variables into factors. At a minimum, you can just put factor() around a variable (e.g., factor(group)), but we will also provide labels for the factor. To be really clear that we want to work with the “factor” version of this variables, we’ll actually create a new variable and call it group_fct.

  • NOTE: You want to provide labels in the same order as the levels of the factor, so in this case we want them in the order CBT, Psychotherapy, and Control (see above).

depression <- depression %>% 
  mutate(group_fct = factor(group,
                            labels = c("CBT", "Psychotherapy", "Control"))) # order matters here! 


  • Look at the structure of the data again. Now it’s clear that group_fct is a factor variable with the correct levels.

Code
str(depression)
Output
## 'data.frame':    15 obs. of  3 variables:
##  $ group    : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ depress  : int  4 7 6 7 5 7 10 9 10 12 ...
##  $ group_fct: Factor w/ 3 levels "CBT","Psychotherapy",..: 1 1 1 1 1 2 2 2 2 2 ...


Descriptives

  • Next, we’ll get descriptives for each group using two tidyverse functions.

  • The first is group_by() which sets some metadata saying what variable can be used to group observations (we’ll use the group variable for this). One thing to mention is that group_by() itself looks like it doesn’t change much which is OK; it basically sets the groups as meta data which other functions (like summarize()) can use.

  • The second is summarize() which can be used to get summary information about a dataset. When you combine them, it can give you summary information about each group. summarize() works sort of like mutate(); you put a name on the left-hand side of an = and the definition on the right. The main difference is that summarize collapses across rows; either collapsing across the rows within a group or across all rows if there is no grouping variable set (with group_by()). We’ll use it to get the mean (using base R’s mean()), standard deviation (using base R’s sd()), and sample size per group (using tidyverse’s n()).

  • For a refresher on these functions, see here

Code
depression %>% 
  group_by(group_fct) %>%  # telling it to treat group as the group variable
  summarize(mean = mean(depress, na.rm = TRUE),
            sd = sd(depress, na.rm = TRUE),
            n = n()) %>% # don't need to put anything in here; it's looking for n per group
  knitr::kable(digits = 2) # make it look pretty
Output
group_fct mean sd n
CBT 5.8 1.30 5
Psychotherapy 9.6 1.82 5
Control 3.2 1.79 5


  • What’s a faster way to do this?

Code
describeBy(depression$depress, depression$group_fct)
Output
## 
##  Descriptive statistics by group 
## group: CBT
##    vars n mean  sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 5  5.8 1.3      6     5.8 1.48   4   7     3 -0.26    -1.96 0.58
## ------------------------------------------------------------ 
## group: Psychotherapy
##    vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 5  9.6 1.82     10     9.6 1.48   7  12     5 -0.13    -1.55 0.81
## ------------------------------------------------------------ 
## group: Control
##    vars n mean   sd median trimmed  mad min max range  skew kurtosis  se
## X1    1 5  3.2 1.79      3     3.2 2.97   1   5     4 -0.03    -2.09 0.8


  • NOTE: If you want the output of psych::describeBy() as a matrix instead of a list, add mat = TRUE.

Code
describeBy(depression$depress, depression$group_fct, mat = TRUE)
Output


Dummy coding

  • Categorical predictors with more than two levels are broken up into several smaller variables. In doing so, we take variables that don’t have any inherent numerical value to them, e.g. group_fct , which takes the values CBT, Psychotherapy or Control, and ascribe meaningful numbers that allow for us to calculate meaningful statistics.

  • This actually happens under the hood in R when our independent variable is a factor. Let’s see what happens when we put group_fct as the IV in our model:

Code
model_default <- lm(depress ~ group_fct, data = depression) 
summary(model_default)
Output
## 
## Call:
## lm(formula = depress ~ group_fct, data = depression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -2.6   -1.0    0.2    1.2    2.4 
## 
## Coefficients:
##                        Estimate Std. Error t value   Pr(>|t|)    
## (Intercept)              5.8000     0.7394   7.845 0.00000459 ***
## group_fctPsychotherapy   3.8000     1.0456   3.634    0.00342 ** 
## group_fctControl        -2.6000     1.0456  -2.487    0.02861 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.653 on 12 degrees of freedom
## Multiple R-squared:  0.7595, Adjusted R-squared:  0.7195 
## F-statistic: 18.95 on 2 and 12 DF,  p-value: 0.0001934


  • Notice that our coefficients are labeled (Intercept), group_fctPsychotherapy and group_fctControl. Why is this the case? Let’s remind ourselves of the levels of group_fct:

Code
levels(depression$group_fct)
Output
## [1] "CBT"           "Psychotherapy" "Control"


  • R is creating dummy variables for us under the hood. By default, R treats whatever the first level of the factor variable is as the reference group. In this case, CBT was the first level of group_fct (because it was initially coded as 1 in our raw data), so the model treated CBT as our reference group.

Question: In the above model output, what does (Intercept) mean?

Question: What does the group_fctPsychotherapy slope mean? What does the t value mean?

Question: What does the group_fctControl slope mean? What does the t value mean?

Question: What does the F value mean? And what does the R^2 mean?

  • However, let’s imagine that we have the following (more intuitive) research questions:
  1. Is CBT effective (relative to no therapy)?
  2. Is psychotherapy effective (relative to no therapy)?

Question: What do want our reference group to be to answer these research questions?

  • Now let’s make the appropriate dummy codes. Recall that we need k-1 dummy codes (where k is the number of groups). We have 3 groups, so we need 2 dummy codes.

  • Remember, our decision of how to set the dummy codes (which group to set as the reference group) should be guided by our research questions. Above, we said we’re interested in whether CBT is better than Control and whether psychotherapy is better than Control. This implies that we should make our reference group the Control group, and make a dummy code to represent the difference between CBT and control and a second dummy code for the difference between psychotherapy and the control group.

  • There are (as always) a ton of ways to do this in R. Let’s go over a few methods:

Method 1: Create dummy variables with mutate()

  • You’ve seen this before from this slide in class.
# create new dummy variables containing 1's and 0's
depression <- depression %>%
  mutate(CBT.v.Control = ifelse(group_fct == "CBT", 1, 0), 
         Psychotherapy.v.Control = ifelse(group_fct == "Psychotherapy", 1, 0)) 


  • View the data frame with these new dummy variables.

Code
depression
Output


  • Run the linear model using these new dummy variables as the IV’s.

Code
model_dummy1 <- lm(depress ~ CBT.v.Control + Psychotherapy.v.Control, data = depression) 
summary(model_dummy1)
Output
## 
## Call:
## lm(formula = depress ~ CBT.v.Control + Psychotherapy.v.Control, 
##     data = depression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -2.6   -1.0    0.2    1.2    2.4 
## 
## Coefficients:
##                         Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)               3.2000     0.7394   4.328  0.000982 ***
## CBT.v.Control             2.6000     1.0456   2.487  0.028613 *  
## Psychotherapy.v.Control   6.4000     1.0456   6.121 0.0000517 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.653 on 12 degrees of freedom
## Multiple R-squared:  0.7595, Adjusted R-squared:  0.7195 
## F-statistic: 18.95 on 2 and 12 DF,  p-value: 0.0001934


Question: What does the intercept mean?

Question: What does the CBT.v.Control slope mean?

Question: What does the Psychotherapy.v.Control slope mean?

Question: How does the F value and R^2 we got from this model compare to our last model with different (underlying) dummy codes?

Method 2: Contrast Matrix

  • Now we’re going use another method to create our dummy codes by creating a contrast matrix, which is basically just a set of dummy codes (or effects codes, as we’ll discuss in a moment). When we pass a factor to lm() it uses a contrast matrix under the hood (which is what happened with the first model we ran. Although R creates the contrast matrix automatically with certain default assumptions, you can set it yourself if you have a specific set of codes in mind (which we do).

  • Use contrasts() to view the default contrast matrix associated with our grouping variable. This function expects the factor variable that we are trying to view the contrasts for (in this case, group_fct) as the first argument.

Code
contrasts(depression$group_fct)  
Output
##               Psychotherapy Control
## CBT                       0       0
## Psychotherapy             1       0
## Control                   0       1


  • Again, by default our contrasts are not how we want them because we want Control to be the reference group. To manually specify our dummy codes, we’ll use the function contr.treatment(), which will return a contrast matrix with specific characteristics, depending on what we tell it. The first argument, n is the number of levels of the factor variable (in this case, n = 3). The second argument, base refers to the level of the factor that should be considered the reference group – in this case, since Control is the third level of group_fct, we will put base = 3.

Code
contr.treatment(n = 3, base = 3)    
Output
##   1 2
## 1 1 0
## 2 0 1
## 3 0 0


  • Now we will overwrite the default contrast matrix for group_fct with the one we just manually specified.
contrasts(depression$group_fct) <- contr.treatment(n = 3, base = 3)    


  • Now look at the structure of the data again. What do you notice about group_fct?

Code
str(depression)
Output
## 'data.frame':    15 obs. of  5 variables:
##  $ group                  : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ depress                : int  4 7 6 7 5 7 10 9 10 12 ...
##  $ group_fct              : Factor w/ 3 levels "CBT","Psychotherapy",..: 1 1 1 1 1 2 2 2 2 2 ...
##   ..- attr(*, "contrasts")= num [1:3, 1:2] 1 0 0 0 1 0
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr  "CBT" "Psychotherapy" "Control"
##   .. .. ..$ : chr  "1" "2"
##  $ CBT.v.Control          : num  1 1 1 1 1 0 0 0 0 0 ...
##  $ Psychotherapy.v.Control: num  0 0 0 0 0 1 1 1 1 1 ...


  • Let’s take a closer look at group_fct using attributes()

Code
attributes(depression$group_fct)
Output
## $levels
## [1] "CBT"           "Psychotherapy" "Control"      
## 
## $class
## [1] "factor"
## 
## $contrasts
##               1 2
## CBT           1 0
## Psychotherapy 0 1
## Control       0 0


  • Run the linear model with group_fct (now containing our desired dummy codes) as the IV

Code
model_dummy2 <- lm(depress ~ group_fct, data = depression) 
summary(model_dummy2)
Output
## 
## Call:
## lm(formula = depress ~ group_fct, data = depression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -2.6   -1.0    0.2    1.2    2.4 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   3.2000     0.7394   4.328  0.000982 ***
## group_fct1    2.6000     1.0456   2.487  0.028613 *  
## group_fct2    6.4000     1.0456   6.121 0.0000517 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.653 on 12 degrees of freedom
## Multiple R-squared:  0.7595, Adjusted R-squared:  0.7195 
## F-statistic: 18.95 on 2 and 12 DF,  p-value: 0.0001934


Question: How does this compare to the output from the model that contained individual dummy variables (i.e. CBT.v.Control and Psychotherapy.v.Control) as the IV’s?

Method 3: Re-order levels of your factor

  • Let’s take another look at the levels of group_fct

Code
levels(depression$group_fct)
Output
## [1] "CBT"           "Psychotherapy" "Control"


  • We can manually change the order of the levels by using factor() and manually specifying the levels argument to contain the order we want. Again, since R treats the first level of the factor as the reference group, and we want Control to be the reference group in order to obtain our contrasts of interests when we run the linear model, we will create a new version of group_fct called group_fct_relevel and make Control the first level of the factor in this new variable.
depression <- depression %>% 
  mutate(group_fct_relevel = factor(group_fct, levels = c(
                                                  "Control",         # now level 1
                                                  "CBT",             # now level 2
                                                  "Psychotherapy"))) # now level 3


  • To double check that this worked, let’s look at the levels of group_fct_relevel

Code
levels(depression$group_fct_relevel)
Output
## [1] "Control"       "CBT"           "Psychotherapy"


  • Now when we run the linear model, the contrast matrix that will be automatically generated will correspond to the correct contrasts we want. To verify this, let’s re-run the linear model (yet again!) with the re-leveled group_fct_relevel variable as the IV.

Code
model_dummy3 <- lm(depress ~ group_fct_relevel, data = depression) 
summary(model_dummy3)
Output
## 
## Call:
## lm(formula = depress ~ group_fct_relevel, data = depression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -2.6   -1.0    0.2    1.2    2.4 
## 
## Coefficients:
##                                Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)                      3.2000     0.7394   4.328  0.000982 ***
## group_fct_relevelCBT             2.6000     1.0456   2.487  0.028613 *  
## group_fct_relevelPsychotherapy   6.4000     1.0456   6.121 0.0000517 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.653 on 12 degrees of freedom
## Multiple R-squared:  0.7595, Adjusted R-squared:  0.7195 
## F-statistic: 18.95 on 2 and 12 DF,  p-value: 0.0001934


  • Voilà, it’s the exact same as our last two models!

Effects coding

  • Instead of dummy coding, we can instead use effects coding (also called “deviation coding”) to compare individual group means to the overall mean of the dependent variable (i.e. the grand mean). In R, we do this in basically the same way as dummy coding, but just change the values we use to set the codes.

  • NOTE: This will change how we interpret our coefficients!. When using effects coding, the intercept represents the grand mean. The coefficients of each of the effect-coded variables is equal to the difference between the mean of the group coded 1 and the grand mean.

  • With effects coding, the reference group will end up being coded with all -1’s.

  • Let’s imagine our research questions are as follows: Is CBT effective compared to the overall sample? Is Psychotherapy effective compared to the overall sample?

  • We’ll create our contrast matrix of effects codes in a very similar way to how we made our dummy codes. The only difference is that this time we will use contr.sum(), again telling it the number of levels of our grouping factor (n = 3)

Code
contr.sum(n = 3)
Output
##   [,1] [,2]
## 1    1    0
## 2    0    1
## 3   -1   -1


  • And as before with dummy codes, we overwrite the default contrast matrix with our effect codes.
contrasts(depression$group_fct) <- contr.sum(n = 3)


  • Run the linear model with effects coding

Code
model_ec <- lm(depress ~ group_fct, data = depression) 
summary(model_ec)
Output
## 
## Call:
## lm(formula = depress ~ group_fct, data = depression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -2.6   -1.0    0.2    1.2    2.4 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept)   6.2000     0.4269  14.524 0.00000000562 ***
## group_fct1   -0.4000     0.6037  -0.663       0.52012    
## group_fct2    3.4000     0.6037   5.632       0.00011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.653 on 12 degrees of freedom
## Multiple R-squared:  0.7595, Adjusted R-squared:  0.7195 
## F-statistic: 18.95 on 2 and 12 DF,  p-value: 0.0001934


Question: What does the intercept mean? The slopes?

Question: How does the overall model fit compare to that from the model(s) that used dummy coding?

  • There are many other contrast coding schemes besides dummy and effects coding. You can read more about that here.

Factor vs. numeric/character

  • This whole time we’ve been working with our grouping variable as a factor group_fct. For a moment, let’s consider what would have happened if we forgot to make it a factor. Remember that in our depression data frame, group is an integer.

Code
model_numeric <- lm(depress ~ group, data = depression) 
summary(model_numeric)
Output
## 
## Call:
## lm(formula = depress ~ group, data = depression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -3.9   -2.2   -0.5    1.8    5.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.8000     2.0712   4.249 0.000949 ***
## group        -1.3000     0.9588  -1.356 0.198214    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.032 on 13 degrees of freedom
## Multiple R-squared:  0.1239, Adjusted R-squared:  0.05651 
## F-statistic: 1.838 on 1 and 13 DF,  p-value: 0.1982


  • Compare this with the output from our model using group_fct:
## 
## Call:
## lm(formula = depress ~ group_fct_relevel, data = depression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -2.6   -1.0    0.2    1.2    2.4 
## 
## Coefficients:
##                                Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)                      3.2000     0.7394   4.328  0.000982 ***
## group_fct_relevelCBT             2.6000     1.0456   2.487  0.028613 *  
## group_fct_relevelPsychotherapy   6.4000     1.0456   6.121 0.0000517 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.653 on 12 degrees of freedom
## Multiple R-squared:  0.7595, Adjusted R-squared:  0.7195 
## F-statistic: 18.95 on 2 and 12 DF,  p-value: 0.0001934

Question: What do you notice about the degrees of freedom in each of these models?


Traditional ANOVA

  • We have just reviewed in detail how to do an ANOVA within a linear regression framework. We will briefly review how to generate a traditional ANOVA table, though this will give us the same information we already got from our regression model, so we are not going to go over the old-school ANOVA method in great detail.

  • To get an ANOVA table, you can usestats::aov() and provide it with the same information you would pass to lm(), namely a formula (e.g. IV ~ DV) and the data. Again, make sure that the categorical IV is a factor.

Code
model_anova <- aov(depress ~ group_fct, data = depression)
summary(model_anova)
Output
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## group_fct    2  103.6   51.80   18.95 0.000193 ***
## Residuals   12   32.8    2.73                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


  • To calculate pairwise comparisons between group levels, use pairwise.t.test(). This function takes arguments x (the outcome variable), g (the grouping variable) an p.adjust.method (e.g. “bonferroni” or “holm”).

Code
pairwise.t.test(x = depression$depress, g = depression$group_fct, p.adjust.method = "none")
Output
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  depression$depress and depression$group_fct 
## 
##               CBT    Psychotherapy
## Psychotherapy 0.0034 -            
## Control       0.0286 0.000052     
## 
## P value adjustment method: none


  • Bonferroni correction

Code
pairwise.t.test(x = depression$depress, g = depression$group_fct, p.adjust.method = "bonferroni")
Output
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  depression$depress and depression$group_fct 
## 
##               CBT     Psychotherapy
## Psychotherapy 0.01027 -            
## Control       0.08584 0.00016      
## 
## P value adjustment method: bonferroni


  • Holm correction

Code
pairwise.t.test(x = depression$depress, g = depression$group_fct, p.adjust.method = "holm")
Output
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  depression$depress and depression$group_fct 
## 
##               CBT     Psychotherapy
## Psychotherapy 0.00685 -            
## Control       0.02861 0.00016      
## 
## P value adjustment method: holm


Question: What difference do you notice about the Bonferroni vs. Holm correction?


Minihacks

Research Problem:

You are interested in the effectiveness of a new drug in treating the symptoms of depression. You randomly assign patients with depression into one of 3 treatment groups with different levels of the drug (low, medium, and high), and also one control group that only received a placebo. You then measure their symptoms after 4 weeks of drug treatment. You are interested in whether the drug is effective, and what level of the drug is most effective.

  • Load in the raw data with the following code
depression_drug <- import("https://raw.githubusercontent.com/uopsych/psy612/master/labs/lab-5/data/depression_drug.csv")

Minihack 1

  • Recreate the following table.


  • Recreate the following boxplot.

  • What pattern do you notice in the data?

Minihack 2

  • Run an omnibus ANOVA test to determine whether there is a relationship between drug and depression. What does this tell you?

  • Using your preferred method of dummy coding, test whether there is a significant difference between depression scores for each drug group (low, medium, high) and the none group. Interpret all model coefficients.


Minihack 3

  • Conduct pairwise comparisons for depression scores across all levels of drug. Use a correction method to account for multiple comparisons.