聯系方式

您當前位置:首頁 >> Java編程Java編程

日期:2019-08-09 11:00

Resit Assignment – CSC3060 “AIDA”

Release date: Friday 5th August

Deadline: 11:00pm Sunday 11

th August 2019.

This version: 2019-07-04.

Introduction

This assignment re-assesses key practical and theoretical learning outcomes from the CSC3060

module.

With the exception of Section 1, this assignment must be completed in R. Section 1 can be

completed in Python (recommended), Java or R. Convenient and commonly used machine learning

packages are available for R and Python, such as “class”, “caret” and “randomForest” (in the case of

R). When you use a procedure that has an element of randomness (e.g. creating cross-validation

folds) please use the seed value 3060 (your code should give the same results each time it runs).

Please read carefully the information about the assessment criteria and marking process at the end of

this document.

Section 1 (5%): Creating a dataset

This section asks you to build a dataset of images composed of handwritten numbers, letters, and

punctuation marks. Each image is represented by a black & white matrix with size 20 rows by 20

columns. In the matrix, the number “1” represents black pixels and “0” represents white pixels. As

such, each image can be stored in a .CSV file containing the matrix (and no headers), as in these

examples:

Class a b 1 3

Example

Figure 1: Examples of handwritten images and their matrix representation.

The goal in Section 1 is to create a dataset containing 10 images of each of the digits {0, 1, 2, 3, 4}, 10

images of the letters {a, b, c, d, e} and 10 images of the punctuation symbols {;, ,, ?, !, :}. Each image

should be obtained by processing a hand-written symbol (preferably with a touch screen, using the

lab computers, although it is fine if you create them using the computer mouse). The quality of the

drawing is not essential, as long as each symbol can easily be read by a human. The characters

should be drawn in proportional size to each other (e.g. the comma symbol should be drawn at a

size which is proportional to the letter symbols, as in the example below:

Figure 2: Examples of handwritten images for the letter “d” and the comma symbol, drawn at

proportional sizes.

The largest symbols should fit reasonably well in the 20x20 box (i.e. do not draw a tiny character in

one corner of the 20x20 box; this will make your life easier when it comes to doing analyses!). You

can reuse any suitable digits or letters you may have already created for the first lab assignment.

You may use whatever means you prefer to obtain the images and .csv files. However, a suggestion is

to use the software GIMP (http://www.gimp.org). Using GIMP, you can create a new image with 20

by 20 points (pt), advanced options 1 pixel/pt, color space grayscale, fill with background colour. This

will give you a small white square, which you can magnify to e.g. 2000% in order to make it easier to

draw on. To draw on the image, you can select the pencil tool and adjust the brush size to (e.g.) 1

pixel. The standard file formats of GIMP are useful to save the images, but we need a more easily

readable format. One good option is to export as PGM, type ASCII. In this format, each image becomes

a text file with a header consisting of the following four lines:

P2

# CREATOR: ...

20 20

255

The third and fourth lines of the header specify the pixel array size and the maximum allowed pixel

value, respectively. (The images are greyscale, with 0 representing fully black and 255 representing

fully white).1

The remaining lines of the file specify the pixel values, with one value on each line; the total number

of pixel values should correspond to the specified array size (i.e. 20*20=400).

For our purposes, a number < 128 represents a black pixel, while a number >= 128 represents a white

one. Such a format can be easily converted into a matrix containing ones and zeros, as presented in

Figure 1 above. You shall save each image matrix as a comma-delimited csv file consisting of a 20x20


1

For further information about this image format, see https://en.wikipedia.org/wiki/Netpbm_format

Page 3 of 8

array of 1s and 0s, following the specification above. Use the filename STUDENTNR_LABEL_INDEX.csv,

where STUDENTNR is your student number (e.g. 123456), INDEX is a number from 01 to 10 (always

two digits, with zero-padding), indexing the set of 10 images you must create for each symbol, and

LABEL is a numeric code that uniquely identifies the type of symbol.

We will use the following codes to label the 15 different types of images:

Symbol Label

For example, if your student number is 123456, then 123456_25_10.csv would be the 10th image

you created for the letter ‘e’. (As well as creating the csv files, you may also want to keep the PGM

files, in case you need to inspect the data later on).

As part of your submission, upload the csv files that you create in a directory called “images”, along

with any code you wrote to create the csv files, in a folder called “section1_code” (see submission

instructions at the end of this document).

It is important to upload the images in the correct csv format as these files will be used to verify your

calculations in the subsequent sections.

In your report, briefly explain in your own words how you created the images and obtained the

matrices from them.

Section 2 (15%): Feature engineering

Using each 20x20 matrix obtained from an image as described above, you must create an array of

characteristics that describe some features of the image. Each feature will be a number (i.e. each

feature is a numeric variable). There are 18 features in total. In the feature definitions that follow, a

pixel has 8 neighbours, which can be referred to as follows:

Features to be calculated:

Page 4 of 8

Feature

Index

Feature Short

Name

Feature Description

label The true symbol in the image (represented by one of the 15 LABEL codes).

Note that the label is not a true feature, and should not be used as a

feature for statistical tests or during model training.

index The index of this image instance (a number from 01 to 10).

1 nr_pix The number of black pixels in the image.

2 height Number of rows containing at least one black pixel

3 width Number of columns containing at least one black pixel

4 tallness Ratio of height to width; i.e. feature 2 / divided by feature 3

5 rows_with_5+ Number of rows with five or more black pixels

6 cols_with_5+ Number of columns with five or more black pixels

7 1neigh Number of black pixels with exactly 1 neighbouring pixel

8 2neigh Number of black pixels with exactly 2 neighbouring pixels

9 4+neigh Number of black pixels with 4 or more neighbours

10 max_dist The maximum Euclidean distance between any 2 black pixels, in the

image measured in pixel units from the centre of the pixel. For example,

a centre pixel has a distance of 1.414 from its lower right neighbour

(Euclidean distance: sqrt(1

2+12

)) and the lower-right neighbour has

distance 2.828 from the top left neighbour (sqrt(2

2+2

2

)).

11 nr_regions Two black pixels A and B are connected if they are neighbours of each

other, or if a black pixel neighbour of A is connected to B (this definition

is actually symmetric); a connected region is a maximal set of black pixels

which are connected to each other; this feature has the number of

connected regions in the image.

12 nr_eyes In a written character, an “eye” is a region of whitespace that is

completely surrounded by lines of the character. For example, “A”

contains one eye, “B” contains two eyes, and “C” contains no eyes. A

region of whitespace is an eye if there is a ring of black pixels surrounding

it which are all connected (i.e. they form a chain of neighbours). This

feature is the number of eyes in the image.

13 [your label] Design any other feature you like, which you think may be useful for

distinguishing between symbols.

14 [your label] Design any other feature you like, which you think may be useful for

distinguishing between symbols. This should not be a simple modification

of feature 13.

Your task in this section is to write code to calculate each of the features above. In calculating pixel

neighbours, you can assume that the images are padded on each side with white pixels.

Save your calculated features in a file called STUDENTNR_features.csv, where STUDENTNR is

your student number. This file will consist of 150 rows, with each row listing the comma-separated

feature values for each of your 150 images (10 images for each of 15 categories).

For example, the features for your eighth “e” image may be as follows:

25,8,14,8,4,12,8,8,1,11,0,7,1,1,1,2

The 8 rows that correspond to the 8 instances of a particular character should be grouped together in

the features file, and the order of the 8 rows should correspond to the INDEX used in the image

Page 5 of 8

filenames. In other words, the 150 rows of STUDENTNR_features.csv should be sorted first by

the label and secondly by the index.

If you cannot calculate a particular feature, you may use a random integer between 0 and 10 for the

feature values instead, or you can use your own manual estimate of the features’ value, provided you

clearly indicate that this is what you have done in both the report and the source code. (You will lose

marks for not calculating the feature, but you can use the random/estimated values in the analyses

that follow in the subsequent sections).

In your report, briefly describe and explain the code you have written to calculate the features above.

If you ran into difficulties, you should still explain your thought processes and your attempts to

calculate the features. In the case of features 15 and 16, you should explain your rationale for choosing

the features you did, as well as how they are calculated (i.e. you should give a justification for why you

think these features should be useful). These features should not be simple modifications of each

other, or of other features.

Finally, create a single feature file, called STUDETNR_features.CSV, containing in 150 rows the

individual features calculated for all 150 images.

Section 3: Statistical analyses of feature data (40%)

In this section, you will perform statistical analyses of the feature data, in order to explore which

features are important for distinguishing between different kinds of symbols.

You shall use descriptive statistics (mean, variance, etc.), null hypothesis testing, and confidence

intervals to perform your analysis of the data. You are encouraged to provide tables, figures, and/or

graphs in the report to support your discussions and findings. When performing tests, always

consider whether multiple test correction is needed.

It is your responsibility to define the appropriate assumptions to run the tests, and to choose an

appropriate test according to the data characteristics and the question that you are studying. You

are not restricted to the hypothesis tests that were discussed in the lectures. Recall to always justify

the approach that you choose to employ. You may assume a significance level of 0.05 for the

analyses when running hypothesis testing.

In particular, in the report you should address each of the following subtasks, using appropriate

statistical tests, tables, graphs, etc.

1. Construct suitable histograms for the nr_pix, height, and cols_with_5+ features, for each of

the following sets of items: (a) the 50 digits, (b) the 50 letters (c) the 50 punctuation symbols

and (d) the full set of 150 items. Briefly describe the shape of the distributions and comment

on any interesting patterns across the datasets. Visually assess the skew and normality of

the distributions.

2. Present summary statistics (e.g. mean and standard deviation) about all the features, for (a)

the 50 digits, (b) the 50 letters, (c) the 50 punctuation symbols and (d) all 150 items. Briefly

discuss the summary statistics, and whether they already suggest which features may be

useful for discriminating digits and letters. For features you feel may be interesting for

discrimination between groups, consider suitable visualisations (e.g. histogram of feature

values for the three groups2

). State what type of variable (continuous, categorical, etc) each


2 https://stackoverflow.com/questions/36049729/r-ggplot2-get-histogram-of-difference-between-two-groups

Page 6 of 8

variable is.

3. Assume that the nr_pix variable is sampled from a population which is normally distributed.

Estimate the mean and variance of the distribution from the available data for the 150

items. Plot the theoretical normal distribution for nr_pix and compare to the corresponding

histogram.

4. Assuming that the nr_pix variable is from a normally distributed population as above, what

is the cut-off value for the nr_pix variable such that a randomly sampled image has a 5%

probability of having a nr_pix value that is above that cut-off value?

5. Certain statistical tests assume that data are normally distributed. For each of the feature

variables 1-10, identify variables with extreme skew and investigate transformations of the

variables. Which features do you choose to transform? Explain how you reach your decision

and how the transformation changes the distribution.

6. Investigate the relationship between the “height” variable and the “tallness” variable. Are

these variables linearly associated? Consider suitable visualisation. Describe a statistical test

to measure the degree of association between these two variables.

7. For every pair of features, calculate a measure of the degree of association between the

features. Present the results in a suitable graph or table, indicating which pairs of features

are significantly correlated.

8. Is the nr_pix feature useful to discriminate between the 3 different groups of letters,

numbers and punctuation symbols? (Note that here we are looking for differences between

3 different groups, so consider a statistical method that tests for statistically significant

differences between more than 2 groups). State clearly the statistical test used, the

assumptions of the test, and how the assumptions relate to your data. If you use ANOVA,

provide the full results table for the statistical test, and describe how each element of the

results table is calculated.

9. For every feature, use ANOVA or similar statistical test to test whether there is a difference

between the three groups for that feature. Present the results (i.e. F-scores) for all features

in a suitable graph or table and indicate for which features there is a significant difference

between groups.

10. Fit multiple regression models to predict nr_pix from subsets of the remaining variables.

Consider at least three regression models using different numbers of predictor variables and

compare the models in terms of their goodness-of-fit. Justify your choice of measure of

goodness-of-fit.

For all questions above, you shall explain your reasoning, assumptions and steps of the procedure

(including the statistical analysis) when preparing the report. If you are generating p-values for

analysing the statistical significance of some features, make sure to explain how they were obtained.

It is your task to decide and justify what the most appropriate inference to be performed in each

case is, and to discuss the results you obtained.

Section 4: Machine learning (40%)

In this section, you will use the features you developed and analysed above to solve classification

problems. Specifically, you will fit classifiers to your image data, in order to build and evaluate useful

models that can predict the class labels for unseen images.

1. Using the width feature variable only, fit a logistic regression model to discriminate between

Page 7 of 8

the category of “digits” and the category of “punctuation symbols”. Present the results table

for the logistic regression, including the coefficient estimates, the z-scores and associated pvalues.

Briefly interpret the results of the logistic regression.

2. Using the logistic regression model you calculated above, find (a) the digit with the greatest

probability of being incorrectly classified as a punctuation symbol, and (b) the punctuation

symbol with the greatest probability of being incorrectly classified as a digit.

3. Using any 4 features that you think should be useful (justifying your choices, e.g. on the basis

of results in section 3), use logistic regression to discriminate between the “letter” and

“punctuation symbol” categories. Use 5-fold crossvalidation to evaluate the accuracy of your

fitted model. Briefly interpret your results.

4. In this question, we aim to build a classifier to discriminate between the three classes of

“digit”, “letter” and “punctuation symbol”. Perform k-nearest-neighbour classification with k

= {1,3,5,9,11,13,15,17,19,21,23} using 5-fold cross-validation, and using any 3 features you

think should be suitable (justifying your choices). (Consider suitable transformations of

features, such as feature scaling).

5. Build the best k-nearest-neighbour model you can that discriminates between the three

groups, evaluated using 5-fold crossvalidation. Experiment systematically with different

values of k and different sets of features, explaining and justifying your choices. Your goal is

to try to come up with the best k-nearest-neighbour model you can for classifying the three

categories of symbols.

6. Give a brief conceptual overview of the random forest method for classification. Build a

random forest model that discriminates between the three groups, evaluated using 5-fold

crossvalidation. Your goal is to try to come up with the best final model you can for

classifying the three categories of symbols. Briefly compare and interpret your results for

knn and random forest.

Assessment criteria and marking process

The most important criteria in marking is the quality and clarity of your report (approximately 65%

weighting). In your report, you should clearly demonstrate that you understand the methods used in

the assignment. Explain your reasoning, assumptions and steps of the procedures used. You should

explain and interpret your results. What are your results telling you? Are the results what you would

expect? If you ran into difficulties, explain what they were and the efforts you made to try to

overcome them.

Code has a weighting in marking of approximately 30% overall. Your code should be clear and

logically organised, and accurately calculate the values required, but code efficiency and code

sophistication is not important (this assignment does not require complex programming). If you use

freely licenced code, packages, or libraries (which is encouraged), these should be appropriately

referenced (e.g. by citing a URL in a comment). The code must be easy to use and the comments

must include information about the required steps to replicate the results that you have obtained

and are presenting in your report (transparency and replicability are essential in data analysis).

Attention to detail and following the assignment instructions accurately will also be considered in

marking (approximately 5% weighting). Each sub-task has a precise specification. Make sure you

carefully follow the instructions, and use the features specified for each task, the specified

procedures (number of cross-validation folds, seed value, etc). Make sure you upload your

deliverable files in the specified formats.

Page 8 of 8

Instructions for submission

This exercise is to be completed individually, and the dataset is generated by yourself and it is not

expected to be significantly similar to the data of other students. You shall deliver any generated

source code with suitable documentation/comments. Plagiarism may be severely punished.

For your report, use should use the Word document template provided. You may use LaTeX to

prepare your report if you wish, but please follow the general layout and formatting of the Word

template.

The maximum word count for each section of the report is as follows:

Section 1: 200 words

Section 2: 800 words

Section 3: 3000 words

Section 4: 2000 words

These values should be regarded as maximums only; it should be possible to give appropriate

answers with fewer words. You may include as many figures and tables in your report as you feel is

suitable, and these do not contribute to the word count. You should explain how you have

performed the analysis, but do not explain details of code in your report - use the source code

comments for that. Properly cite any sources that you have used. Use point 11 font for the text

body.

You should upload a single zip file, containing a folder with the following directory structure:

STUDENTNR/ # Top-level directory

STUDENTNR_report.pdf # The assignment report

STUDENTNR_features.csv # The features calculated in section 2

code_section1/ # Directory of source code files for section 1

code_section2/ # Directory of source code files for section 2

code_section3/ # Directory of source code files for section 3

code_section4/ # Directory of source code files for section 4

images/ # Directory of 100 image files created in section 1

Ensure that your name, student number and the module details (AIDA CSC3060) are in the header of

the submitted pdf. You must submit the assignment online, using the QOL webpage of the AIDA

module, by the specified date. It is your responsibility to ensure that the assignment is uploaded

correctly (and that the zip file is not corrupt, etc) and you should take steps to check and verify the

upload. A RAR file is not a ZIP file. By submitting this assignment you acknowledge that it is your own

work and that you are aware of university regulations regarding academic offences, including (but

not restricted to) plagiarism and collusion.


版權所有:編程輔導網 2018 All Rights Reserved 聯系方式:QQ:99515681 電子信箱:[email protected]
免責聲明:本站部分內容從網絡整理而來,只供參考!如有版權問題可聯系本站刪除。

黑龙江体彩22选5