- QQ：99515681
- 郵箱：[email protected]
- 工作時間：8:00-23:00
- 微信：codehelp

STA220 Data Analysis Project Instruction

New Due Date: Tuesday June 11th

Submit in Tutorials at 6:10p.m.

Purpose:

The objective of this project is to give you the opportunity to use some of the statistical techniques that you

have learned in this course for exploring a real data set.

Submission Format of Data Analysis Report:

You are required to give your answers to related questions based on the analysis of this data in the template

posted on our Quercus page. Please print your filled-in template and attach your R outputs to it.

You may work individually or in groups of no more than three students. Your group members can be from

different tutorial sections in our class. If you are working in a group, please think of creating a team name

and note that you will only submit one report (filled-in template with R output of the data analysis project)

on the new due date in your tutorial (or your team member’s tutorial) on Tuesday June 11th at 6:10p.m..

The ways in which your data analysis project will be assessed is described on page 5 of this document.

Context of Data:

The Organisation of Economic Cooperation and Development (OECD) gathers various information

regarding OECD countries and its partners in order to promote policies that aims to improve the economic

and social well-being of people around the world (http://www.oecd.org/about/).

This agency collects quantitative information on many domains and makes the collected data available for

public use (e.g., researchers) so that interested individuals can further investigate relationships among a set

of variables. A domain named “Social Protection and Well-being” includes a yearly collection of data

“Better Life Index” from OECD countries. This information can be retrieved from: http://stats.oecd.org

2

From the “Better Life Index” (BLI), the most recent data published in 2019 but collected in 2018, we will

analyze a quantitative variable named “Social Network Support”. Information regarding this variable can

be retrieved from: http://www.oecd.org/statistics/OECD-Better-Life-Index-definitions-

2019.pdf#_ga=2.145820212.1027110605.1559482147-696144184.1473183978

(note that this definition-document is posted on our Quercus page in the module Data Analysis Project).

This variable is a sub-component of the Social connections/Community component in BLI, which reflects

percentage of males and females aged 15-years and over in 36 OECD countries who perceive their social

network as having relatives or friends that they can count on to help them in times of need and trouble.

OECD indicates that they obtained and calculated this information based on a certain Poll.

Let us recap the variables of interest in our data analysis:

1. Percentage of people (15 years of age and older) having social network support

2. Sex of the respondents identified as Male, Female

I recommend that you read about this data here:

http://www.oecdbetterlifeindex.org/#/11111111111

Also, click on “Community” on the right hand-side menu to be directed to another web-link:

http://www.oecdbetterlifeindex.org/topics/community/

Scroll down that page and you can click and read about each country’s supported network %.

R Activity:

1. Understanding and comparing distributions of percentages of males and females who reported having a

Social Network Support, in the 36 OECD countries.

2. Examining the relationship between percentage of males and females who reported having a Social

Network Support, in the 36 OECD countries. That is, we aim to predict percentages of males’ social

network support from percentage of females’ social network support.

Overview of Steps:

1. Save the data file SocNet_BLI2019.txt in your computer (your default R working directory).

2. Save the R Script Support_Net_BLI2019.R

3. Open RStudio. Go to file > Open File

Search for the saved Support_Net_BLI2019.R Rscript in your computer and open it.

4. Run each line of code step-by-step. Please make the necessary changes to some codes as specified in the

R script. If you are working as a team, change the specified variable names with your team name. Also,

make sure to give appropriate titles to the required plots. The codes do not give titles, but they direct you to

write titles. Copy your R outputs and paste them in a document (e.g., word document) to analyze this data.

5. Work on the related questions on the next page in order to interpret the results of your data analysis.

3

Part 1. Identify the Elements of Statistics and Method of Data Collection.

1. Who are the cases in this study?

2. Identify the population of interest in the context of this study.

3. Identify the sample in the context of this study.

4. Identify the population parameter(s) of interest in the context of this study.

5. What is/are the variable(s) of interest in this study? Identify their type and their scale of measurements.

6. Think about the purpose of this study. Why this study was conducted?

7. Where was this study conducted?

8. When was the study conducted?

9. How was the data for this study collected? Hint: Read the web-page on OECD community:

http://www.oecdbetterlifeindex.org/topics/community/ to find the answer to this question.

Part 2. Compare percentages of perceived social network support between males and females.

1. Suppose that the researchers are interested to investigate the relationship between percentages of

perceived social network support and the sex of the respondents in the 36 OECD countries. Identify the

response variable and the explanatory variable in the context of this study.

2. Use the side-by-side boxplots and the summary statistics to compare distributions of percentage of

perceived social network support of males and females in the OECD countries. That is, compare the

shapes, centres, and spreads of both distributions and note/identify any outliers.

3. Use the boxplot and summary statistics for the differences between females and males’ percentages of

perceived social network support (in each country) to describe what is apparent in this plot that is not

apparent in the other (the side-by-side boxplots of percentages of perceived social network support by sex).

Describe the shape, centre, and spread of this distribution. Indicate which countries are suspect outliers

(pointed individually on the boxplot) and what makes them unusual. That is, use the 1.5IQR rule to

determine whether the outlying points are suspect outliers. Also, find the number of standard deviations

that the potential outlier(s) is/are away from the overall mean of this distribution. Discuss why this graph

(boxplot of differences) is more useful for learning about differences between males and females in the

OECD countries?

4

Part 3. Predict percentage of males’ perceived social network support from percentage of females’

perceived social network support.

1. Use the scatterplot of percentage of males perceived social network support verses percentage of

females perceived social network support to describe the relationship.

2. What is the estimated correlation coefficient? Interpret this value.

3. If we examined only those countries with percentage of perceived social network support of over 90 for

both sexes, what would happen to the correlation? And, discuss why would that happen to correlation?

4. Fit a linear regression model relating percentage of males perceived social network support to

percentage of females perceived social network support. That is, fit a straight line for predicting percentage

of males perceived social network support from percentage of females perceived social network support.

What is the equation of the regression line?

5. What does the regression line tell us in the context of this study?

6. What does the slope of regression line mean in the context of this study?

7. Note that the slope of the line does not differ much from 1.00. What would a slope of 1.0 indicate about

the nature of the relationship? If we fitted a model with the slope fixed at 1.00, what prediction equation

would you expect to get? (Hint: Refer to the summary statistics described by males and females. Find the

mean percentage of perceived social network support for males and for females to answer this question).

8. Can we, at all, interpret the value for y-intercept in the regression equation? Justify your answer.

9. What is the standard deviation of residuals? Interpret this value in the context of this problem.

10. Use the plots of residuals to assess the overall adequacy of linear regression model fit to this data. State

the assumption(s) about the residuals that each of the constructed plot checks and determine whether the

assumption(s) is/are met.

11. In which country or countries do the male respondents have “somewhat unusually” less percentage of

perception of having a social network support in relation to the female respondents, according to the

regression model? Give the residual(s) to make and justify your argument.

12. Give and interpret the R2

value in the context of this study.

5

Assessment of Data Analysis Project

Last Name of Student

1.________________________________________

2. ________________________________________

3. ________________________________________

Part 1, Question: Point(s) Point(s) Received

1: Identify the cases in this study 1

2: Identify population of interest in this study 1

3: Identify the sample in this study 1

4: Identify population parameter(s) of interest 1

5: Identify variable(s) of interest in this study 2

6: Identify the purpose 1

7: Location (where) of this study 1

8: Time (when) of this study 1

9: Data collection (how) 1

Total 10

Part 2, Question: Point(s) Point(s) Received

1: Identify the response and explanatory variables 2

2: Interpretation of Side-by-side boxplots 6

3: Interpretation of boxplot of differences 12

Total 20

Part 3, Question: Point(s) Point(s) Received

1: Interpretation of scatterplot 1

2: Interpretation of correlation coefficient, r 1

3: Interpretation of restricting the range 1

4: Identifying the equation of the regression line 1

5: Interpretation of the regression equation 1

6: Interpretation of the estimated slope 1

7: Realization of fixing slope at 1 2

8: Interpretation of the estimated y-intercept 1

9: Interpretation of the standard deviation of residuals 1

10: Diagnostic check of residuals using plots 2

11: Detection of unusual residual values 2

12: Interpretation of the value of R-squared 1

Total 15

Inclusion of R outputs: necessary modifications made

to variable names and titles of plots 5

Total Points 50

Marked by TA:

Comment (if any):

版權所有：編程輔導網 2018 All Rights Reserved 聯系方式：QQ:99515681 電子信箱：[email protected]

免責聲明：本站部分內容從網絡整理而來，只供參考！如有版權問題可聯系本站刪除。