Instructions

Please complete this assignment using the RMarkdown file provided. Once you download the RMarkdown file please (1) include your name in the preamble, (2) rename the file to include your last name (e.g., “weston-homework-1.Rmd”). When you turn in the assignment, include both the .Rmd and knitted .html files.

To receive full credit on this homework assignment, you must earn 30 points. You may notice that the total number of points available to earn on this assignment is 65 – this means you do not have to answer all of the questions. You may choose which questions to answer. You cannot earn more than 30 points, but it may be worth attempting many questions. Here are a couple things to keep in mind:

  1. Points are all-or-nothing, meaning you cannot receive partial credit if you correctly answer only some of the bullet points for a question. All must be answered correctly.

  2. After the homework has been graded, you may retry questions to earn full credit, but you may only retry the questions you attempted on the first round.

  3. The first time you complete this assignment, it must be turned in by 9am on the due date (March 6). Late assignments will receive 50% of the points earned. For example, if you correctly answer questions totaling 28 points, the assignment will receive 14 points. If you resubmit this assignment with corrected answers (a total of 30 points), the assignment will receive 15 points.

  4. You may discuss homework assignments with your classmates; however, it is important that you complete each assignment on your own and do not simply copy someone else’s code. If we believe one student has copied another’s work, both students will receive a 0 on the homework assignment and will not be allowed to resubmit the assignment for points.

Data: There are three datasets used in this homework assignment. Please refer below to descriptions of each.

-homework-happy: The dataset looks at happiness in college as a relationship with school success, friendship quality, SES, and an intervention group (1 = control, 2 = study skills training, 3 = social skills training). (Note that the variable names have spaces in them, which may make it tricky to work with. You might consider renaming the variables.)

-homework-world: Remember this? These data from the World Happiness Survey, which we used in PSY 611. As a reminder, for each country we have the average response to the happiness ladder, the log of the gross domestic product per capita, the proportion of participants who believed they had friends and family they could count on, average life expectancy at birth, the proportion of participants satisfied with their freedom, the average amount donated to charity (generosity, residual, adjusting for GDP), the average response to two corruption items, and the country development status (first world, second world, third world, or other).

-homework-health: In this study, students reporting to Health Services for anxiety complaints were asked to participate in a clinical treatment study. Those agreeing to participate were randomly assigned to one of the following conditions: waiting list control, meditation, medication, cognitive behavior therapy, or exercise. After three-weeks of treatment (or waiting in the case of the controls), participants completed an anxiety inventory (scores could range from 0 to 100), with higher numbers indicating greater anxiety.

2-point questions

Question 1

Using the data homework-happy, create a regression model predicting happiness from friendship quality, SES, and school success. Interpret each regression coefficient.

Question 2

Using the data homework-happy, calculate the semi-partial and partial correlations between happiness and friendship quality, controlling for SES and school success Consider health to be the outcome. Use a series of multiple regression analyses (i.e., do not use a function for semi-partial and partial correlations). What do you conclude from these analyses?

Question 3

Health researchers commonly acknowledge that cardiac arrest is caused by high levels of cholesterol. They also agree that cholesterol is caused by smoking and weight. Moreover, it is generally agreed that both the choice to smoke and choices that contribute to weight are caused by generally unhealthy life styles. (Note, these are not the only causal factors contributing to either smoking or weight, but we’re trying to keep the example manageable, for a 2-point problem). Build a DAG model that represents these known relationships. What are the different paths that transmit associations from smoking to cardiac arrest?

Question 4

  • In the question above, what variables, if any should a researcher control for if they want to estimate the causal association between smoking and cardiac arrest?

  • Should the researcher control for cholesterol? Why or why not?

Question 5

  • Using the dataset homework-happy, run a two-predictor regression model predicting happiness by friendship and school success and the interaction between the two.

  • Describe your hypothesis for the interaction using the study variables (i.e., don’t write a formula, but describe the hypothesis in words).

  • Describe in words exactly what the coefficients \(b_1\), \(b_2\), and \(b_3\) are telling us in this model.

5-point questions

Question 1

  • Using the homework-world dataset, conduct a separate analysis of variance on each of the measures Happiness, GDP, Life Expectancy, Freedom, Generosity and Corruption, with World as the grouping variable. In a single table, report the degrees of freedom, F-statistic, p-value, and effect size \((\eta^2)\) for each analysis (one row for each analysis).

  • If the F test is significant for an analysis, conduct follow-up pairwise comparisons using a Holm correction. Report these in a table (one table per analysis).

Question 2

Using the dataset homework-happy, run a two-predictor regression predicting happiness by friendship and SES.

  • Check each of the six assumptions discussed in class. List each assumption and state how you examined (or would examine) that assumption. Note that not every assumption can be directly examined. Include plots where applicable and interpret.

  • Examine possible outliers in the dataset by examining one measure of leverage, one measure of distance, and one measure of influence.

Question 3

Refer to the model built in Question 2.5.

  • Conduct 3 simple regression equations (using the model from above): the regression of Y on \(X_1\) at the mean of \(X_2\), at one standard deviation above the mean of \(X_2\), and at one standard deviation below the mean of \(X_2\). Your choice of moderating variable depends on your hypothesis. Write out and interpret each of these simple slope equations.

  • Graph the three simple slopes within a single plot.

  • Are any of these simple slopes significantly different from 0?

  • Now run the model again, but using standardized variables. Compute simple slopes. Describe the differences between the standardized and unstandardized models.

10-point questions

Question 1

This question uses the dataset homework-heath. Health service clinicians have the following research hypotheses:

  1. Any treatment is better than doing nothing.
  2. Medication is the most effective treatment overall.
  3. Freshmen suffer from more anxiety than upperclassmen, regardless of what is done for them to alleviate their anxiety.
  4. Upperclassmen, compared to freshmen, are better able to benefit from meditation. 5. Freshmen, by contrast, are likely to lapse into rumination during meditation, making their anxiety symptoms worse. Upperclassmen, compared to freshmen, are better able to benefit from exercise. Freshmen are less likely to see the stress-reduction benefits of exercise. Instead they see exercise as detracting from time that could be spend studying, producing even worse anxiety symptoms.
  • Conduct a factorial analysis of variance of the anxiety measure using class (freshmen, upperclassmen) and treatment condition (waiting list control, meditation, medication, cognitive behavior therapy, or exercise) as factors. Report a summary of the results (source of variance, df, SS, MS, F, and p) in a table.

  • Display in a graph the means and 95% confidence intervals for the 2 x 5 design.

  • Calculate \(\eta^2\) and partial \(\eta^2\) for the main effects and interaction. Report these in a table. Which sums of squares method did you use and why?

  • Conduct follow-up comparisons. Use some kind of adjustment to account for multiple comparisons. List the significant comparisons in a table.

  • Check the homogeneity of variance and normality assumptions. Are they met?

  • Given all that you learned from this analysis, comment on the support for each of the research hypotheses.

Question 2

Homoscedasticity is the assumption that the variance of the outcome (Y) is the same across all levels of the predictor (X). If you find that you have heteroscedasticiy in your dataset, you may choose to use a robust estimation procedure, like weighted least squares to analyze your data. In this question, you’ll apply a WLS approach to a regression model and compare the results to the OLS solution, with the goal of understanding how and why WLS deals with heteroscedasticiy.

  • Using the dataset homework-world, create a regression model predicting Support from Corruption. Interpret the slope coefficient.

  • Check the assumption of homogeneity of variance. What do you conclude?

A weighted least squares approach weights each observation or case in the dataset by some value, so some cases contribute more to the final regression model than others. In matrix algebra terms:

\[\hat{\beta} = (X'WX)^{-1}X'WY\] (instead of) \[\hat{\beta} = (X'X)^{-1}X'Y\] Estimating the weights is a bit of art. One common strategy is to extract the residuals from the OLS solution, take the square root of the absolute value of these residuals, and regress these new values onto X \((\text{i.e., }\sqrt{|e|} = b_0+b_1X)\). The predicted values of this regression model become the weights.

  • Use this method to calculate weights for each observation your dataset. Save the weights as an object called weight.obj.

  • Now run a WLS regression by using the same code as you would for the OLS, but add the argument weights = weight.obj. How do the regression coefficients compare?

  • Create a figure that shows (a) the raw data, (b) the best fit line according to the OLS solution, and (c) the best fit line according to the WLS solution. Looking at this plot, how would you interpret the WLS solution in terms of the cases that it is fitting to (or which cases is it ignoring)? How might this solve the issue of heteroscedasticity?

20-point questions

Question 1

In lecture (17-interactions), we simulated the power to detect an interaction when X and Z were drawn from a normal distribution. However, in this simulation, we assumed that X and Z were uncorrelated.

  • Repeat the simulations performed in class with the following changes:
  1. Include a third scenario in which X and Z are correlated with each other.
  2. Vary the sample size – specifically, run each experiment with 50 subjects, 100 subjects, 150 subject… up to 2000 subjects.
  3. Simulate each version of each experiment 1,000 times, not 100 times. (So you should run 1,000 simulations for the experimental study with 50 subjects, and 1,000 simulations for the experimental study with 100 subjects, etc).

Don’t forget to set a seed so your results are reproducible.

  • For each scenario, calculate (a) the average effect size of the interaction and (b) the proportion of studies that found a significant effect. Display these results in a table.

  • Create a figure that shows the relationship of power (i.e., the proportion of studies that found a significant effect) to sample size for each of the three scenarios. How many participants do you think you need to achieve 80% power for each of the three scenarios?