class: center, middle, inverse, title-slide # Causal models --- ## Correlation does not imply -- causation -- always? --- ## Karl Pearson (1857-1936) - Francis Galton demonstrates that one phenomenon -- regression to the mean -- is not caused by external factors, but merely the result of natural chance variation. - Pearson (his student) takes this to the extreme, which is causation can never be proven. "Force as a cause of motion is exactly on the same footing as a tree-god as a cause of growth." .small[(Pearson, 1892, pp 119)] --- ## Sewall Wright (1889-1988) .pull-left[ ![](images/wright.jpeg) ] .pull-right[ Geneticist; studied of guinea pigs at the USDA Developed path models as a way to identify causal forces and estimate values Rebuke from mathematical community, including from Ronald Fisher ] --- ## Judea Pearl .pull-left[ - [Professor of computer science at UCLA](http://bayes.cs.ucla.edu/jp_home.html); studies AI - Extensively studied this history of causation in statistics, inferring causality through data, and the development of path analysis - Popularized the use of causal graph theory ] .pull-right[ ![](images/pearl.jpg) ] --- ## Article: Rohrer (2018) **Thinking clearly about correlation and causation: Graphical causal models for observational data** Psychologists are interested in inherently causual relationships. * e.g., how does social class cause behavior? According to Rohrer, how have psychologists attempted to study causal relationships while avoiding issues of ethicality and feasibility? -- It is impossible to infer causation from correlation without background knowledge about the domain. * Experimental studies require assumptions as well. ??? "surrogate interventions" -- e.g., perceived social class with subjective ladder * these lack external validity Avoiding causal language * Doesn't stop lay persons or other researchers from inferring; also, maybe that's based in probematic assumptions Controlling for 3rd variables * How do we know what to control for? --- ## Directed Acyclic Graphics (DAGs) * Visual representations of causal assumptions * Share many features with structural equation models There are many small differences between SEMs and DAGs, but this is the biggest: * SEMs are quantitative models -- they are used to estimate various parameters (think, several multiple regressions strung together) * DAG models qualitative: you will not use a DAG model to estimate numeric values. * The benefit of DAG models is that they (1) help you clarify the causal model at the heart of your research question and (2) identify which variables you should and should **not** include in a regression model, based on your causal assumptions. --- ## Directed Acyclic Graphics (DAGs) ### Basic structures * Boxes or circles represent constructs in a causal model -- these are called **nodes**. * Arrows represent relationships. A → B * Relationships can follow any functional form -- linear, polynomial, sinusoidal, etc. * DAGs only allow single-headed arrows; constructs cannot cause each other, nor can you cycle from a construct back to itself. * A ⇄ B is not allowed; A ← U → B is * Constructs cannot cause themselves, so no feedback loops. --- ```r library(ggdag) dag.obj = dagify(B ~ A) ggdag(dag.obj) ``` ![](13-dag_files/figure-html/unnamed-chunk-1-1.png)<!-- --> --- ## DAGs **Paths** lead from one node to the next; they can include multiple nodes, and there may be multiple paths connecting two nodes. There are three kinds of paths: * Chains * Forks * Inverted forks --- ## Paths: Chains Chains have the structure A → B → C. Chains can transmit associations from the node at the beginning to the node at the end, meaning that you will find a correlation between A and C. These associations represent *real* causal relationships. ```r dag.obj = dagify(edu ~ iq, income ~ edu, labels = c("edu" = "Education", "iq" = "Intelligence", "income" = "Income")) ggdag(dag.obj, text = FALSE, use_labels = "label", label_size = 25) + theme_bw() ``` --- ![](13-dag_files/figure-html/unnamed-chunk-2-1.png)<!-- --> --- ## Paths: Forks Forks have the structure A ← B → C. Forks can transmit associations but they are not causal. They are the structure most relevant for the phenomenon of confounding ```r dag.obj = dagify(edu ~ iq, income ~ iq, labels = c("edu" = "Education", "iq" = "Intelligence", "income" = "Income")) ggdag(dag.obj, text = FALSE, use_labels = "label", label_size = 25) + theme_bw() ``` --- ![](13-dag_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- ## Paths: Inverted Forks Inverted forks have the structure A → B ← C. Inverted forks cannot transmit associations. In other words, if this is the true causal structure, there won't be a correlation between A and C. ```r dag.obj = dagify(income ~ edu, income ~ iq, labels = c("edu" = "Education", "iq" = "Intelligence", "income" = "Income")) ggdag(dag.obj, text = FALSE, use_labels = "label", label_size = 25) + theme_bw() ``` --- ![](13-dag_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- ## Complex paths In reality, the causal pathways between the constructs in your model will be very complex. ```r dag.obj = dagify(gpa ~ iq, edu ~ gpa + iq, income ~ edu + iq, labels = c("edu" = "Education", "iq" = "Intelligence", "gpa" = "Grades", "income" = "Income")) ``` ```r ggdag(dag.obj, text = FALSE, use_labels = "label", label_size = 25) + theme_bw() ``` --- ![](13-dag_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- ## Confounding DAG models are most useful for psychological scientists thinking about which variables to include in a regression model. Recall that one of the uses of regression is to statisically control for variables when estimating the relationship between an X and Y in our research question. Which variables should we be statistically controlling for? -- We can see with DAG models that we should be controlling for constructs create forks, or a construct that causes both our X and Y variable. These constructs are known as **confounds**, or constructs that represent common causes of our variables of interest. Importantly, we don't need to control for variables that causes just X and not Y (think of the issue of suppression). --- ```r dag.obj = dagify(gpa ~ iq, edu ~ gpa + iq, income ~ edu + iq, labels = c("edu" = "Education", "iq" = "Intelligence", "gpa" = "Grades", "income" = "Income"), exposure = "edu", outcome = "income") ggdag_paths(dag.obj, text = FALSE, use_labels = "label")+theme_bw() ``` --- ![](13-dag_files/figure-html/unnamed-chunk-8-1.png)<!-- --> What should I control for? --- ```r ggdag_adjustment_set(dag.obj, text = FALSE, use_labels = "label") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- Confounds open "back-doors" between variables that act like causal pathways, but are not. But closing off the back doors, we isolate the true causal pathways, if there are any, and our regression models will estimate those. Based on the causal structure I hypothesized last slide, if I want to estimate how much education causes income, I should control for intelligence, but I don't need to control for grades. `$$\large \hat{\text{Income}} = b_0 + b_1(\text{Education}) + b_2(\text{Intelligence}) + e$$` --- ## More control is not always better We can see with forks that controlling for confounds improves our ability to correctly estimate causal relationships. So controlling for things is good... Except the other thing that DAGs should teach us is that controlling for the wrong things can dramatically hurt our ability to estimate true causal relationships and can create spurious correlations, or open up new associations that don't represent true causal pathways. Let's return to the other two types of paths, chains and inverted forks. --- ## Control in chains Chains represent mediation; the effect of construct A on C is *through* construct B. Intelligence → Educational attainment → Income What happens if we control for a mediating variable, like education? ```r dag.obj = dagify(m ~ x, y ~ m, labels = c("x" = "Intelligence", "m" = "Education", "y" = "Income"), exposure = "x", outcome = "y") ``` --- .pull-left[ ```r ggdag_dseparated(dag.obj, text = FALSE, use_labels = "label") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-11-1.png)<!-- --> ] .pull-right[ ```r ggdag_dseparated(dag.obj, controlling_for = "m", text = FALSE, use_labels = "label") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-12-1.png)<!-- --> ] --- Controlling for mediators removes the association of interest from the model -- you may be removing the variance you most want to study! Inverted forks also teach us what not to control for: colliders. A **collider** for a pair of construct is any third construct that is caused by both constructs in the pair. Controlling for (or conditioning on) these variables introduces bias into your estimates. Intelligence → Income attainment ← Educational What happens if we control for the collider? ```r dag.obj = dagify(c ~ x, c ~ y, labels = c("x" = "Intelligence", "c" = "Income", "y" = "Education"), exposure = "x", outcome = "y") ``` --- .pull-left[ ```r ggdag_dseparated(dag.obj, text = FALSE, use_labels = "label") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-14-1.png)<!-- --> ] .pull-right[ ```r ggdag_dseparated(dag.obj, controlling_for = "c", text = FALSE, use_labels = "label") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-15-1.png)<!-- --> ] --- ## Collider bias ### Unexpected colliders Missing data or restricted range problems arise in collider bias. ```r dag.obj = dagify(c ~ x, c ~ y, labels = c("x" = "Respitory Disease", "c" = "Hospitalization", "y" = "Locomotor Disease"), exposure = "x", outcome = "y") ``` --- .pull-left[ Sampling from a specific environment (or a specific subpopulation) can result in collider bias. Be wary of: * Missing data * Subgroup analysis * Any post treatment variables This will extend to not just controlling for constructs, but estimating interactions with them. If your moderating variable is a collider, you've introduced collider bias. ] .pull-right[ ```r ggdag_dseparated(dag.obj, controlling_for = "c", text = FALSE, use_labels = "label") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] .small[Sackett et al. (1979)] --- ## Example A health researcher is interested in studying the relationship between dieting and weight loss. She collects a sample of participants, mesaures their weight, and then asks whether they are on a diet. She returns to these participants two months later and assess weight again. What should the researcher control when assessing the relationship between dieting and weight loss? --- ```r dag.obj = dagify(diet ~ wi, wf ~ diet + wi, loss ~ wi + wf, labels = c("diet" = "Diet", "wi" = "Initial Weight", "wf" = "Final Weight", "loss" = "Weight loss"), exposure = "diet", outcome = "loss") ``` --- ```r ggdag_adjustment_set(dag.obj, text = FALSE, use_labels = "label") + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-19-1.png)<!-- --> --- What should be controlled here? ![](13-dag_files/figure-html/unnamed-chunk-20-1.png)<!-- --> --- ```r ggdag_adjustment_set(dag.obj) + theme_bw() ``` ![](13-dag_files/figure-html/unnamed-chunk-21-1.png)<!-- --> --- There is a **strong** assumption of DAG models, and that it's that no relevant variables have been ommited from the graph and that causal pathways have been correctly specified. This is a tall order, and potentially impossible to meet. Moreover, two different researchers may reasonably disagree with what the true model looks like. It's your responsibility to decide for yourself what you think the model is, provide as much evidence as you can, and listen when an open mind to those who disagree. ![](images/xkcd.png) <!-- --- --> <!-- Correlation can imply causation, provided the appropriate temporal precedence is considered and the correct number and specification of third-variable causes has been included in the model. -->