Assignment-5

.pdf

School

McMaster University *

*We aren’t endorsed by this school

Course

2B03

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

5

Uploaded by MinisterAnt14343 on coursehero.com

2B03 Assignment 5 Regression Modeling and Prediction (Chapters 10 & 11) Matthew Musulin 400329990 Due Thursday December 2 2021 Instructions: You are to use R Markdown for generating your assignment output file. You begin with the R Markdown script downloaded from A2L, and need to pay attention to information provided via introductory material posted to A2L on working with R, R Markdown, and downloading data from ODESI. Having downloaded all necessary files, placed them in the same folder/directory, and added your answers to the R Markdown script, you then are to generate your output file using “Knit to PDF” and, when complete, upload both your R Markdown file and your PDF file to the appropriate folder on A2L. 1. Define the following terms in a sentence (or short paragraph) and state a formula if appropriate (this question is worth 5 marks). i. Test of Significance A formal procedure for comparing observed data with a hypothesis we want to prove true. The hypothesis is usually a statement about the population parameters. The results of the test are expressed in terms of a probability that measures how well the data and the hypothesis agree. ii. Coefficient of Determination The measure of how well an estimated regression line fits the sample data. R 2 = SSM/SST iii. Multiple Regression Analysis When there are several explanatory variables this is known as multiple regression analysis. iv. Individual Prediction Interval An estimate of an interval in which future observations will fall, with a certain probability, given what has already been observed. Prediction intervals are often used in regression analysis. 2. An economist is studying the relationship between unemployment and inflation, and has collected the following data. Inflation appears in columns, unemployment in rows (this question is worth 5 marks). Unemployment Abated Inflation Unchanged Accelerated Total Lower 5 5 10 20 Unchanged 5 35 20 60 Higher 20 0 0 20 Total 30 40 30 100 The data in the table above summarize the relationship between unemployment and inflation based on 100 months of data. For instance, for 35 months, inflation and unemployment were unchanged, while for 10 months inflation had accelerated and unemployment was lower. 1
Using the data in the table above, conduct an appropriate hypothesis test of independence between inflation and unemployment at the 5% level of significance. row1 = c( 5 , 5 , 10 ) row2 = c( 5 , 35 , 20 ) row3 = c( 20 , 0 , 0 ) Matrix = matrix(c(row1, row2, row3), nrow= 3 , byrow= TRUE) chisq.test(Matrix, correct= TRUE) ## ## Pearson ' s Chi-squared test ## ## data: Matrix ## X-squared = 65.278, df = 4, p-value = 2.249e-13 Therefore, because the p-value 2.49e-13 < 0.05, we reject H0. There is sufficient evidence that indicates a dependence between unemployment and inflation and the result is statistically significant. 3. Consider the following dataset on the final grade received in a particular course ( grade ) and attendance ( attend , number of times present when work was handed back during the semester out of a maximum of six times). Note that R has the ability to read datafiles directly from a URL, so here (unlike the odesi data that you manually retrieved) you do not have to manually download the data providing you are connected to the internet (this question is worth 5 marks). course <- read.table( "https://socialsciences.mcmaster.ca/racinej/files/attend.RData" ) attach(course) i. Run a regression of grade on attend using the R command lm() . What is the impact of a 1 unit increase of attend on the expected grade based on your model? attach(course) ## The following objects are masked from course (pos = 3): ## ## attend, grade lm(attend~grade) ## ## Call: ## lm(formula = attend ~ grade) ## ## Coefficients: ## (Intercept) grade ## -0.53515 0.05971 Therefore, a one unit increase in attendance will increase the probability of getting a higher mark. i. In class we distinguished between correlation and causation and cautioned against inferring causation from statistical correlation. Do your results suggest that an individual who increased their attendance by 1 unit would also experience an increase in their expected grade? Why or why not? Explain the roles of confounders in this context (e.g. along the lines of Sir. R. A. Fisher’s concerns). A correlation between variables does not immediately mean that the change in one variable is the cause of the change in the values of the other variable. Causation shows that one event is the result of the occurrence of the other event. In this case, the results show that the increased attendance would also experience a higher expected grade. 2
4. This question requires you to download data obtained from Statistics Canada. If you are working on campus go to www.odesi.ca (off campus users must first sign into the McMaster library via libaccess at library.mcmaster.ca/libaccess, search for odesi via the library search facilities then select odesi from these search results). Next, select the “Find data” field in odesi and search for “Labour Force Survey June, 2021”, then scroll down and select the Labour Force Survey, June 2021 [Canada] . Next click on the “Explore & Download” icon, then click on the download icon (i.e., the diskette icon, square, along the upper right of the browser pane) and then click on “Select Data Format” then scroll down and select “Comma Separated Value file” (csv) which, after a brief pause, will download the data to your hard drive (you may have to extract the file from a zip archive depending on which operating system you are using). Finally, make sure that you place this csv file in the same directory as your R code file (this file ought to have the name LFS-71M0001-E-2021-June_F1.csv , and in RStudio select the menu item Session -> Set Working Directory -> To Source File Location). There will be another file with (almost) the same name but with the extension .pdf that is the pdf documentation that describes the variables in this data set. Note that it would be prudent to retain this file as we will use it in future assignments (this question is worth 10 marks). lfp <- read.csv( "LFS-71M0001-E-2021-June_F1.csv" ) Next, open RStudio, make sure this csv file and your R Markdown script are in the same directory (in RStudio open the Files tab (lower right pane by default) and refresh the file listing if necessary). Then read the file as follows: This data set contains some interesting variables on the labour force status of a random subset of Canadians. We will focus on the variable HRLYEARN (hourly earnings) described on page 22 of the pdf file LFS-71M0001-E-2021-June.pdf . We will also consider other variables so that we can conduct multiple regression analysis. i. Following assignment 1, consider hourly earnings and highest educational attainment for people in the survey and consider both high school graduates ( EDUC==2 ) and those holding a bachelors degree ( EDUC==5 ). To construct these subsets we can use the R command subset as follows (the ampersand is the logical operator and - see ?subset for details on the subset command): hs <- subset(lfp, FTPTMAIN== 1 & EDUC== 2 & HRLYEARN > 0 )$HRLYEARN ba <- subset(lfp, FTPTMAIN== 1 & EDUC== 5 & HRLYEARN > 0 )$HRLYEARN These commands simply tell R to take a subset of the data frame lfp for full-time workers having either a high school diploma or university bachelors degree for those reporting positive earnings, and then retain only the variable HRLYEARN and store these in the variables named hs (hourly earnings for high-school graduates) or ba (hourly earnings for university graduates). Conduct a t -test of the hypothesis that the average wage is equal for the two groups using the R function t.test() (see ?t.test() for details). t.test(hs,ba) ## ## Welch Two Sample t-test ## ## data: hs and ba ## t = -52.007, df = 13783, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -11.81520 -10.95693 ## sample estimates: ## mean of x mean of y ## 24.87665 36.26271 Therefore because the p-value is 2.2e-16 (0.000) we reject H0. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help