聯系方式

您當前位置:首頁 >> Python編程Python編程

日期:2019-05-06 11:13

158.222-2019 Semester 1 Massey University

Page 1 of 11

ASSIGNMENTS 2 AND 3 – PREDICTING HAPPINESS AND DOING OTHER STUFF

Deadline: Hand in by midnight 4 May 2019 (end of week 8)

Evaluation: Assignment 2: 10% of your final course grade.

Assignment 3: 10% of your final course grade.

Late Submission: Refer to the course guide.

Work These assignments are to be done individually.

Purpose: Implement the entire data science/analytics workflow. Use regression techniques to solve realworld

problems. Gain skills in extracting data from the web using APIs and web scraping. Build on

the data wrangling, data visualization and introductory data analysis skills gained up to this point

as well as problem formulation and presentation of findings. Gain skills in kNN regression

modelling or supervised learning, and unsupervised learning.

Learning outcomes 1 - 5 from the course outline.

Please note that all data manipulation must be written in python code in the Jupyter Notebook

environment. No marks will be awarded for any data wrangling that is completed in excel.

Assignments 2 and 3 are related to each other and have the same due date. However, they are to be

submitted separately. Create a separate notebook for each assignment.

These assignments will take longer than you think, so…do not leave starting these assignments until

the last minute. You have the tools you need to start now.

As of the week 5 lecture, you will have been introduced to tools that will assist you in completing

assignment 2. By week 7 (before semester break) you will be able to complete most of assignment

3, except for task 3 (unsupervised learning), which you will can complete after the week 8 lecture.

Terminology:

Note that ‘feature’ and ‘variable’ refer to the same thing: an array of values that represent a

data attribute. For the purposes of your work, this will usually be in the form of a Pandas

series (column in a dataframe). These terms are used interchangeably in this specification.

The World Bank refers to country attributes as ‘indicators’.

‘Predictors’, ‘explanatory variables/features’, ‘input variables/features’ are terms used

interchangeably and refer to arrays of values that represent data attributes, that can be input

into a model and are distinct from the target variable (the value you are trying to predict).

Before commencing, read through both assignment specifications for context.

158.222-2019 Semester 1 Massey University

Page 2 of 11

****************

*** Plagiarism***

****************

It is mandatory that any assessment items that you submit during your University

study are your own work. Massey University takes a firm stance on academic

misconduct, such as plagiarism and any form of cheating.

Plagiarism is the copying or paraphrasing of another person’s work, whether

published or unpublished, without clearly acknowledging it. It includes copying the

work of other students and reusing work previously submitted by yourself for another

course. It also includes the copying of code from unacknowledged sources.

Academic integrity breaches impact on students as it disadvantages honest students

and undermines the credibility of your qualification. Plagiarism, and cheating in tests

and exams will be penalised; it is likely to lead to loss of marks for that item of

assessment and may lead to an automatic failing grade for the course and/or

exclusion from reenrolment at the University.

Please see the Academic Integrity Guide for Students on the University website for

more information. The Guide steps you through the University Academic Integrity

Policy and Procedures. For example, you will find definitions of academic integrity

misconduct, such as plagiarism; how misconduct is determined and managed; and

where to find resources and assistance to help develop the skills of academic writing,

exam preparation and time management. These skills will help you approach

university study with academic integrity.

158.222-2019 Semester 1 Massey University

Page 3 of 11

ASSIGNMENT 2: DATA ACQUISITION AND REGRESSION TO PREDICT HAPPINESS

In Assignment 2 you will be integrating data from two sources:

the World Happiness Index

and one of:

the World Bank API, or

a web-scraped source of your choosing

Your goal is to build Regression models for predicting happiness, following a good process, including:

careful selection of explanatory variables (features) through engaging your critical thinking in choosing data

sources, exploratory data analysis and optional feature set expansion;

good problem formulation;

good model experimentation (including explanation of your experimentation approach), and

thoughtful model interpretation

TASK 1: DATA ACQUISITION AND INTEGRATION (25 MARKS)

a) Static Data: Import Table 2.1 of the World Happiness Report data (1 mark)

To begin building your feature set, download the “WHRData.xls” static dataset from Stream and read in the

data. The dataset is from the 2019 World Happiness Report. Learn more about this report here:

http://worldhappiness.report/ed/2019/

Data definitions and other feature documentation can be found here:

https://s3.amazonaws.com/happiness-report/2019/WHR19_Ch2A_Appendix1.pdf

You should familiarise yourself with the data documentation before proceeding. As a bare minimum, you will

need to identify which variable represents ‘Happiness’.

Note: if you are unable to meet the challenges laid out in Task 1 b) and c) you will still be able to continue with

Tasks 2 and 3 using the static dataset.

b) Dynamic data (14 marks)

To expand your feature set with dynamic data, do ONE of either option 1 or option 2.

OPTION 1:

API Data: Identify, import and wrangle indicators from the World Bank API

The World Bank API is briefly introduced in Lecture 5. Your task is to identify and import 5 or more World Bank

indicators (features) that you would like to have as options for inclusion in your models for predicting

happiness.

Identify: To identify 5 or more appropriate indicators, you will need to explore the World Bank API

documentation and figure out for yourself how to find which indicators are available and then how to

identify and request them. Finding your own way through the documentation is a deliberate part of this

challenge. Briefly explain your process and why you chose your indicators. These links will provide you

with a start:

https://datahelpdesk.worldbank.org/knowledgebase/articles/889392-api-documentation

https://datahelpdesk.worldbank.org/knowledgebase/articles/898599-api-indicator-queries

158.222-2019 Semester 1 Massey University

Page 4 of 11

Import and wrangle your chosen indicators so that they are in the right shape for integration with the WHR

data. In Lecture 5, only one indicator is imported. To import many indicators in a tidy fashion (i.e. without

repeating code) will involve the use of a loop and/or function, depending on your approach.

Note: By default, you may not be returned all data you require. You may have to set arguments to obtain

the full range (look out for the ‘per_page’ argument). Also note that you can specify a date range.

Task 1b) Option 1 marking:

6 marks for identifying indicators and explaining why you chose them. We expect curiosity and initiative in

exploring the World Bank API and figuring out how to use it to identify appropriate indicators. 0/6 marks

for simply importing a subset of the indicators that you have already been given codes for in Lecture 5.

8 marks for the import and wrangling of data – highest marks for elegant and tidy solutions.

OPTION 2:

Web-scraped data: Source, import, parse and wrangle web data to add to your feature set

Source: Go to the internet and find another data source with which to expand your feature set that:

o can be web-scraped (downloading a csv or excel file from a website is not web-scraping),

o you think may improve your predictive model, and

o can be meaningfully integrated with the WHR data and your World Bank data.

In case it is not obvious, you will be looking for data that can be linked on both country name and one

or more years of the data you have already acquired.

Import, parse, wrangle: Scrape the data and wrangle it into the shape it needs to be in in order to

integrate it later.

Explain: Give a brief explanation of source choice and wrangling process before your wrangling script.

Task 1b) Option 2 marking:

3 marks for finding an appropriate and good quality data source.

8 marks for effective and tidy import/parse/wrangle code. Also, highest marks for the biggest

challenge taken on (some sources, such as Wikipedia tables, are easier to scrape than other sources)

3 marks for briefly explaining your source choice and wrangling process

c) Integration: By whichever means appropriate, clean labels and integrate the two datasets from a) and b) into

one dataframe (10 marks)

Inspect and clean labels for integration: To integrate your data without losing rows, you must ensure

compatibility of labels in the features you are joining with. This may involve data cleaning/updating

using old-fashioned gruntwork. For instance, one country can have different names in different

datasets (e.g. Dem. People's Rep. of Korea vs North Korea). Do data checks pre and post-integration to

ensure you have not lost data. Outer joins combined with filtering may assist you in this process. Data

loss due to a country being present in one dataset but genuinely not present in another is acceptable.

Include a brief explanation of your process before your cleaning script.

Integrate your data into one dataframe.

Task 1c) Marking:

6 marks for checking label compatibility for integration (via scripting) and, if required,

cleaning/updating those labels

2 marks for briefly explaining your process

2 marks for the final integration. At this point the final integration should be a straightforward line (or

few lines) of code.

158.222-2019 Semester 1 Massey University

Page 5 of 11

TASK 2: DATA CLEANING AND EXPLORATORY DATA ANALYSIS (EDA) (24 MARKS)

a) EDA – data quality inspection (12 marks)

Explore: Explore your integrated dataset with a view to looking for data quality issues. This could

involve looking at summary statistics, plots, inspection of nulls and duplicates – whatever you think is

appropriate, there is no single correct way of doing this. Clean your data if and as required and save

the cleaned dataset to csv for use in assignment 3.

Inspect target variable: Do a visual inspection of the distribution of your target variable (Happiness).

Explain whether it needs transformation to conform to a normal distribution. Transform if required.

Explain: Include a brief explanation of your process before your quality inspection and cleaning script.

Task 2a) marking:

10 marks for your code/outputs (explore/inspect): Did you produce code and outputs appropriate for

inspecting and addressing (if and as required) data quality issues?

2 marks for briefly explaining your process and discussion of the distribution of the target variable.

b) EDA – the search for good predictors (12 marks)

Explore: Explore your data with the goal of finding features that could be good predictors of your

target variable. This should include:

o Inspection of correlations between features

o Pairs plot/scatter matrix

o Any other visualisation that you deem appropriate

Explain: Include a brief explanation of your process before the script for exploring your predictors.

Discuss: Briefly discuss your findings, e.g. “I have chosen this subset of features as good candidates for

model predictors because…” (warning: do not copy and paste this text into your report, we will

deduct marks if you do.) It is also OK to choose features for reasons other than them being the best

predictors – perhaps you are curious as to whether a given feature would have any effect in a model.

Note: You are looking for features that are well correlated with the Target variable. You are also looking out for

features that are highly correlated with each other. Be aware that while models can have predictive power

while including highly correlated predictors (multicollinearity), the effects of those correlated predictors will be

masked by each other. Where there is multicollinearity, interpretation of specific feature coefficients is

uncertain. Bear this in mind later when interpreting your models.

Note: You may find that your chosen predictors come from the same data source. That is OK.

Task 2b) marking:

8 marks for your code/outputs (explore): Did you produce outputs appropriate for finding good

predictors? Is your code elegant and concise?

4 marks for your words (explain and discuss): Did you explain your process and discuss your findings?

**BONUS QUESTION**

Up to 5 marks will be awarded for feature set expansion via the creation of derived predictor/s that make a significant

and novel contribution to your final model. How you do this is up to you. Being an extension task, we will give no further

guidance, and a very high standard is set for achieving maximum marks. Ingenuity and initiative will be rewarded.

158.222-2019 Semester 1 Massey University

Page 6 of 11

TASK 3: MODELLING (44 MARKS)

Build the best regression model you can, with Happiness as the target variable, within whichever bounds you set yourself

in your problem formulation.

Formulate a problem: You know ‘Happiness’ is your target variable, but what else are you interested in with

respect to this problem? Would you like to simply find the model with the most predictive power? Are you

interested in understanding how particular features of interest to you affect Happiness? Or perhaps you are

interested in finding the most parsimonious model possible, while still retaining predictive power? Another

approach is to look at models for a particular group or groups. Perhaps you would like to filter your dataset to

include only OECD countries? Or perhaps you would like to build different models for developed, developing

and underdeveloped countries? (the World Bank API has this data). Maybe you have some other ideas? Briefly

explain how you will be approaching this regression problem. This will help you to focus your experimentation.

Experiment and explain: Explore regression models in a way that is appropriate to your problem formulation.

Experiment with single variable (optional) and multiple variable (required) linear regression. Consider using a

step-wise algorithm. Optionally, experiment with polynomials. Explain your approach to experimentation.

o Do not use joint plots as a substitute for regression modelling. Zero marks will be given to any model

experimentation that relies on joint plots.

o Do use a module for modelling, and do not code up your regression model from scratch.

o Do consider ‘Year’ as a predictor to include in your model.

o Do display model statistics

Note: If you are interested in the predictive power of your model, your best model is likely to include multiple

explanatory variables so don’t waste effort bulking out the assignment with single variable models.

Note: when you have more than one predictor in your model, you will not be able to produce the regression plots

from Lecture 4 because they are two dimensional (target vs one predictor). That is OK. There are other ways to

visualise, for instance you could plot summary statistics (like RMSE or RSq) from your different models, that you

have collated into a dataframe.

Write elegant code: Experimenting with many different models will involve repetition of code, so employ loops

and functions for model creation and evaluation. Functions and loops = less code = easier to read reports and

easier and more effective experimentation.

Evaluate/interpret: To compare models, you will need to interpret model outputs. For instance, the

probability of the F statistic tells us whether there is a significant relationship between the response and

predictors as expressed by the model. R-Squared tells us about the strength of that relationship (and how good

our model would be for prediction). Consider the coefficients for your predictors – are they significant and

doing heavy lifting in the model, or are they surprisingly superfluous? Can the coefficients be interpreted or is

multicollinearity an issue? You may like to calculate RMSE and interpret that in context.

Present preferred/final model: settle on a preferred or final model for further inspection.

o Residuals: Produce a plot of residuals and fitted values and explain whether it is likely that this model

fulfils the necessary assumptions of homoscedasticity (homoscedastic residuals should not fan out) and

linearity (the residuals should randomly scatter around the fitted line and not follow a curved shape).

You could find code for this online, or you could look up the code in the exercise hints for Lecture 4.

For the purposes of this assignment you are not expected to analyse the residuals beyond a visual

inspection. We would usually inspect residuals before interpreting any model output, not just the final

model. That requirement is waived here to pare down the scope.

o Describe what the coefficients of the model mean, remembering to mention what units they are in as

appropriate (e.g. sealevel = 0.58*temp_celsius : ‘for every degree Celsius increase in average global

temperature, sea level rises by 58 centimetres’).

o Explain how reliable the model was. Was it a good fit and good for prediction? How did the residuals

look, do you think they conformed with assumptions? Could you recommend this model to a client?

o Optional - Plot the confidence intervals and prediction bands for that model and describe what they

tell you (there are no extra marks for this option)

158.222-2019 Semester 1 Massey University

Page 7 of 11

Note: As we do not delve deeply into statistics in this course, and to keep the assignment scope manageable, we will

not be holding your work in this assignment to a high statistical standard (for instance, looking for outliers, high

leverage points, etc). We would like you to demonstrate curiosity, your ability to use the tools provided, and show

that you can select good predictive features, experiment and evaluate a model.

Task 3 Marking:

2 marks for problem formulation

8 marks for rich experimentation in modelling (maximum of half marks if no multiple-variable models included)

8 marks for producing appropriate outputs for model evaluation. Highest marks for creative approaches to this

(e.g. producing visualisations that display model statistics from multiple models for ease of comparison.)

8 marks for elegance of code (effective use of loops/functions)

8 marks for interpretation of outputs

10 marks for presentation of final/preferred model:

o Residuals plot – 4 marks

o Interpretation of residuals plot – 2 marks

o Coefficient explanation – 1 mark

o Discussion of model reliability – 3 marks

TASK 4: PRESENTATION - ‘REPORT-ERIZE’ YOUR WORK (7 MARKS)

Go back through what you have done and turn your Assignment 2 work into something that looks like a report that you

could hand to a client (a technically savvy client, as you still need to include/provide your scripting for marking). Include

a brief introduction that describes the modelling problem you formulated, and a brief description of the datasets that

you use. Add a conclusion. Use formatted markdown boxes that include headings and subheadings. Do also include

text/headings that break the ‘fourth wall’ to clearly delineate the different tasks of the assignment (eg ‘Task 1b’) and for

the sake of fulfilling requirements for marking (e.g. descriptions of process). Any formatting that makes the task of

marking easier would be most appreciated and ensures we do not overlook areas where marks should be specifically

awarded.

Put your name and ID on your report. It seems obvious, but this is a common omission. As an incentive, there will be a

3-mark deduction for reports that are missing a name and ID.

Clear out any unnecessary code and outputs that clutter your work. Run your text through a spell checker extension. See

the end of assignment 3 (‘Assignment 3 requirements’) for more tips on how to tidy up a report by hiding scripts.

HAND-IN:

Zip-up all your notebooks, python files and dataset(s) into a single file. Submit this file via stream. Make sure that your

jupyter notebook has been run with all outputs visible. Download an HTML version of your notebook (with outputs

showing) and include this in your zip file.

ASSIGNMENT 3 STARTS ON THE NEXT PAGE

158.222-2019 Semester 1 Massey University

Page 8 of 11

ASSIGNMENT 3: KNN REGRESSION, SUPERVISED AND UNSUPERVISED LEARNING

PROJECT OUTLINE

In this project you will be producing a Jupyter Notebook report. You will apply techniques taught so far to either build

kNN regression models or supervised learning models. You will also build unsupervised learning models. You will use the

happiness dataset you built in assignment 2 that you may optionally expand. If you do supervised learning, you may

optionally choose a different dataset.

You do not need to repeat the analysis from assignment 2 - assignment 3 extends this work. You may nonetheless find

that further data wrangling and analysis is required. In that case, such work will be considered when marking.

TASK 1 – IMPORT THE CSV YOU SAVED IN TASK 2A) OF ASSIGNMENT 2 (NO MARKS FOR THIS)

TASK 2 – BUILD KNN REGRESSION MODELS OR SUPERVISED LEARNING MODELS (60 MARKS)

Choose ONE Option:

OPTION 1 – KNN REGRESSION MODELS

Formulate: Using your assignment 2 dataset, creatively formulate a problem that enables you to perform kNN

regression for prediction. It is acceptable if this problem is the same problem you explored in your regression

analysis in assignment 2. Describe this problem in your introduction.

Model: Experiment with models for this prediction containing different subsets of features.

Marking expectations (what we are specifically looking for in your modelling):

o Models with multiple input features

o Scaling of all input features

o train/test split for all models so that the models can be meaningfully evaluated (train with the training

data, evaluate with the testing data). This is not explicitly demonstrated in the kNN regression lecture,

but it is demonstrated in the supervised learning lecture. Some guidance on how to achieve this with

kNN regression (if you need it) is included in the appendix.

o Experimentation with input feature subsets

o Experimentation with model parameters - different distance metrics and different values of k

Evaluation and interpretation - Generate, interpret and compare evaluation metrics for your various models.

Ideally, this will involve some visualisations such as plotting metrics for different models. Consider questions

such as which values of k are most robust for the size of your dataset and your problem domain?

Discussion - How reliable are your prediction models? Could you recommend any to a client? Would you expect

this model to preserve its accuracy on data beyond the range it was built on?

Note: As with assignment 2, there are plots in the kNN regression lecture than cannot be reproduced for multivariable

models. Do not let this prevent you from producing models with many features - they will give you the best

results. There are other plots you could produce, e.g. plotting metrics across different models to compare them.

OPTION 2 – SUPERVISED LEARNING

Formulate: Using your assignment 2 dataset, or a dataset of your choosing, creatively formulate a classification

problem for which you can build supervised learning models. Describe this problem in your introduction.

Explore features: Explore the ability of features to discriminate between your chosen or derived class labels. For

instance, as in the lecture, you could plot histograms or box plots of different features by class label and see if

the distributions are noticeably different. Consider exploring other types of plots.

Model: Create models using different subsets of input features for prediction.

Marking expectations (what we are specifically looking for in your modelling):

o Models with multiple input features

o Scaling of all input features

o a train/test split for all models so that the models can be meaningfully evaluated (train with the

training data, evaluate with the testing data). There is guidance for this in Lecture 7.

o Experimentation with input feature subsets, feature selection and algorithms

o Evaluate and interpret - Generate, interpret and compare evaluation metrics for your models. This

should involve visualisations such as plotting metrics across different models. Consider crossvalidation.

158.222-2019 Semester 1 Massey University

Page 9 of 11

Note: Target/class labels: If you would like to use your assignment 2 dataset for Option 2, you will need categories to

predict. There are many ways of doing this – you could see whether there are appropriate categorical features from

the World Bank API or other data sources that you could integrate into the dataset. Alternatively, and more simply,

you could derive labels from an existing feature. For instance, you could create ‘high’, ‘’medium’ and ‘low’ happiness

labels according to happiness score (or do something similar with any feature of your choosing that you would like to

predict). If you use an entirely new dataset, we would expect some EDA.

Note: Input features: Feel free to derive new input features.

Note: You can use Python’s scikit-learn module for machine learning or try using other algorithms. There are many

other Python implementations of machine learning algorithms such as Neural Networks (PyBrain) which are not

implemented in scikit-learn. Which you may use if you wish.

TASK 3 – BUILD UNSUPERVISED LEARNING MODELS (30 MARKS)

? Feature selection: Choose different subsets of input features from your assignment 2 dataset for clustering.

Scale all the input features that you will be using

Perform cluster analyses with scikit-learn using the input feature sets. Multi-variable clustering models are

expected. Do not create models using the ‘coding from scratch’ algorithms in the lecture. Do create cluster

models using scikit-learn.

Visualise, evaluate, interpret and discuss your results (there is some guidance for visualising clusters in the

appendix)

TASK 4: PRESENTATION - ‘REPORT-ERIZE’ YOUR WORK (10 MARKS)

Refer to Task 4, Assignment 2 for what to do here.

Assignment 3 Requirements:

The Python code in the submitted notebooks must be entirely self-contained and all the experiments and the graphs

must be replicable. Do not use absolute paths, but instead use relative paths if you need to. Consider hiding away some

of your Python code in your notebook by putting them into .py files that you can import and call. This will help the

readability of your final notebook by not allowing the python code to distract from your actual findings and discussions.

Do not dump dataframe contents in the notebook – show only 5-10 lines at a time – as this severely affects readability.

You may install and use any additional Python packages you wish that will help you with this project. When submitting

your project, include a README file that specifies what additional python packages you have installed in order to make

your project repeatable on my computer, should I need to install extra modules.

HAND-IN:

Zip-up all your notebooks, python files and dataset(s) into a single file. Submit this file via stream. Make sure that your

jupyter notebook has been run with all outputs visible. Download an HTML version of your notebook (with outputs

showing) and include this in your zip file.

Marking criteria are on the next page

158.222-2019 Semester 1 Massey University

Page 10 of 11

Marking criteria - Marks will be awarded for different components of the project using the following rubric. In all cases

expect higher marks for elegant code (use of loops/functions):

Component Marks Requirements and expectations

You will receive a total mark for each task and not marks per the numbers in

parentheses. These are indicative weightings provided to help focus your effort on

what is important (to your marker??)

Task 2

kNN regression option

60 Problem formulation (2)

Modelling:

o Implementation of a train/test split (8) (higher weighting here compared

to the supervised learning as this is not implemented in the knn

regression lecture)

o Scaling of input features (3)

o Quality of experimentation with:

feature subsets containing multiple features (10)

model parameters (10)

o Quality of evaluation and interpretation*

Quality of evaluation process and metrics including visualisations (12)

Quality of interpretations (10)

Quality of final discussion (5)

*If you did not do a train/test split and train on the training set, evaluate with the

testing set, you will not have quality evaluations so expect low marks here if that is

the case

Task 2

Supervised learning

option

60 Problem formulation and creation/integration of class labels (6)

Initial input feature exploration via visualisations (6)

Modelling:

o Implementation of a train/test split (3)

o Scaling of input features (3)

o Quality of experimentation (20)

(the more you can explore feature selection, different feature subsets

containing multiple features, and different algorithms, the richer your

experimentation will be).

o Quality of evaluation and interpretation*

Quality of evaluation process and metrics including visualisations (12)

Quality of interpretations (10)

*If you did not do a train/test split and train on the training set, evaluate with the

testing set, you will not have quality evaluations so expect low marks here if that is

the case

Task 3

Unsupervised learning

30 Feature subset selection and feature scaling (4)

Quality of experimentation: creation of different cluster models with feature

subsets containing multiple features. Note the requirement to use scikit-learn

for clustering. (6)

Visualisation, evaluation, interpretation and discussion of cluster models (20)

Task 4

Presentation

10 Report structure (4)

Tidy code and outputs (4)

Spelling (2)

Appendix is on the next page

158.222-2019 Semester 1 Massey University

Page 11 of 11

APPENDIX

TRAIN TEST SPLIT WITH KNN REGRESSION

Do a train test split of your data before doing any kNN modelling (you will get spurious metrics otherwise). To achieve

this, you would need to do something like this:

from sklearn.cross_validation import train_test_split

X = df_std #the standardised explanatory variables

y = np.array(df['Target_feature])

# split into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

If you plan to borrow the ‘calculate_regression_goodness_of_fit’ function for your analysis, you would need to change

this line:

y_mean = y.mean()

to this:

y_mean = ys.mean()

and then let your common sense guide you as to the other necessary changes to the lecture code. Think about where

the training sets should be used and where the testing sets should be used (hint, the training sets should be used for the

model fit, the testing sets should be used for prediction and goodness of fit calculation).

VISUALISING CLUSTERS

Look at 2 or 3 different features at a time in scatter plots with points coloured according to cluster and see if you can

discern which features were important in defining the clustering (there are other ways of doing this, but for the purposes

of this assignment we are satisfied if you simply look at some visualisations). There is no guarantee that you will be able

to see a clear difference but have a go and show what you have done. Try to describe the effect that each feature has on

the clustering, if it is discernible.

The examples below are artificial, but I provide them to give you the general idea. In the example on the left, the feature

Y is important in defining the clusters, not X. In the example on the right, X is important in defining the clusters, not Y:


You will likely need to iterate through subsets of your features and produce such plots to get an idea of which features

may have been important in defining your clusters (functions are your friend). In reality, it will be a combination of

them. If one feature is really dominant, you should double check that you have scaled your features.

Avoid univariate clustering.


版權所有:編程輔導網 2018 All Rights Reserved 聯系方式:QQ:99515681 電子信箱:[email protected]
免責聲明:本站部分內容從網絡整理而來,只供參考!如有版權問題可聯系本站刪除。

黑龙江体彩22选5