#### 您當前位置：首頁 >> Matlab編程Matlab編程

###### 日期：2019-04-29 09:28

Chapter 7

Assignment

Due: To Be Determined by Class Vote by the End of the First Week of Teaching

The penalty for late hand-in of coursework is 10% of the total marks per working day. No

credit will be given after more than 5 working days. A working day is a 24-hour period

starting from the original hand-in deadline, but skipping days on which the School is closed.

7.1 Individual Assignment (40%)

We will study the so-called South-African Heart-Disease (SAHD) data. Necessary background

is in [3, Section 4.4.2].

Briefly, observations on n = 462 individuals are available. On each individual, the data

consists of: a binary indicator (value 0 or 1) of the presence of Coronary Heart Disease

(1=have disease, 0=does not have disease), viewed as a response; and the following raw

inputs: systolic blood pressure; tobacco use; LDL, also known as “bad cholesterol”; (binary)

indicator of family history of heart disease; (measure of) obesity; (measure of) alcohol use;

and age. In data file SAHD.mat, the response variable is chd, and the inputs are sbp,

tobacco, ldl, famhist, obesity, alcohol, age respectively.

We want to regress the response on inputs, where an input may be a raw input or a function

of it (e.g., a quadratic function). The main tool is a logit (logistic) regression of the form

Yi ～ Bernoulli(μi) and independent, where logit(μi) = X

pj=1Xj,iβj (7.1)

where logit(μ) = log(μ/(1?μ)), and (Yi

, X1,i, . . . , Xp,i) is the i-th observation of the response

and the inputs. Another possibility is probit regression of the form

Yi ～ Bernoulli(μi) and independent, where μi = Φ(X

p

j=1

Xj,iβj ), (7.2)

where Φ() is the Normal(0,1) cdf.

[3, Section 4.4.2] fit logit models of the form (7.1). In [3, Section 5.2.2, pages 146-148], a

more advanced model in equation (5.6) there captures the effect of a raw input Xj on the

response by a spline (piecewise-polynomial) function hj (Xj ); the (estimated) functions are

shown in Figure 5.4 for selected inputs. These findings suggest the following:

Variables sbp, tobacco, and obesity may appear in (7.1) in quadratic form (b1x +b2x2,

where x is the raw input) or in cubic form (b1x + b2x2 + b3x3).

To capture the effect of age, which does not seem to resemble a low-order polynomial,

consider as inputs the binary indicators of categorized age, similar to what is done in [5,

Example 10]. For example, the sample deciles of age, namely 18, 28, 34, 40, 45, 49, 54,

58, 61, and 64, define the intervals (0, 18], (18, 28], (28, 34], ..., (61, 64], which give 10

categories of age.

47

Standard tasks (model estimation; testing the significance a certain coefficient; computing

the AIC; etc.) may be done via sahd.m Part 1, which is similar to sco.m (lab 3, Section

6.2.2). (Part 2 supports the group assignment.) A very thorough exploration could involve

transformed raw inputs beyond those seen there.

Those wishing to have matlab on a non-University computer should start at software.soton.ac.uk.

Install at least matlab (the basic product); and the toolboxes on statistics and optimization.

7.1.1 Deliverables and Marking Scheme

Submit a typed report to the Faculty Office, with the usual cover about academic integrity.

The audience is a hypothetical manager or analyst familiar with our notes, the brief and the

analyzes in [3] cited above.

Discuss results, and analysis as necessary, on the following:

1. Develop a preferred model of the response. A better (more parsimonious) model is often

found by removing an input Xj if the hypothesis “βj = 0” cannot be rejected at some level

α (the corresponding t value satisfies |t| < Φ1

(1 α/2), where Φ1

() is the Normal(0,1)

inverse cdf); and then re-fitting the model. Use of a model-selection criterion, such as

Akaike Information (AIC), is recommended.

2. Use the preferred model to describe the effect of each of obesity and age on the likelihood

of heart disease. For background on interpretation, see Section 4.5.4 and the second-fromlast

paragraph in [3, Section 4.4.2].

3. Brief explanation of how results were obtained can appear in an appendix. If significant

extensions are made to any codes provided to you during the course (sahd.m or other),

then: (a) submit the code on blackboard (individual assignment), and state in the report

the submitting user (e.g., “xy1g09”) (so the code can be matched to the report); and (b)

provide instructions on how the code can be used.

Avoid re-stating the brief or the codes provided. Length: up to 400 words, excluding tables,

figures, and their captions.

Marking Scheme:

40%. Plausibility of the proposed model. The plausibility will be judged against the

findings in [3] and against experiments the manager could attempt; thus your analysis

should be informed by the above. For example, while Table 4.3 of [3] suggests that

obesity and systolic blood pressure do not have a significant effect, the reverse is found in

[3, Figure 5.4]; the latter suggests that appropriate functions of these inputs should enter

the regression (e.g., as a quadratic function, or a categorization).

30%. Correctness of estimation of any model being proposed.

30%. Quality of presentation (clarity, coherence, ease of reading).

7.2 Group Assignment (60%)

We wish to develop a classification (prediction) method of the response (0/1 indicator of heart

disease) based on the inputs. The aim is to minimize the expected loss (per observation,

as in (5.7)) under the following cost structure: correct classification costs nothing; misclassification

of class 0 (true is 0, prediction is 1) costs 1 unit; and mis-classification of class

1 (true is 1, prediction is 0) costs 10 units.

Similar to the analysis seen in lab 4, we consider classifiers defined by a Bernoulli (GLM)

model of the response and a classification threshold t, as in (6.1); and the main aim is to

minimize the expected loss by choice of a classifier. The more extensive the set of models

and thresholds considered, the smaller the loss we can expect. Moreover, a “better” model

(with an appropriate threshold) is more likely to minimize the loss; thus, the model chosen

for the individual assignment seems (a priori) a strong contender.

48

Script sahd.m, Part 2, implements an analysis of the form above, similar to cvsco.m (lab 4).

Specifically, a set of classifiers is developed; the expected (out-of-sample) loss is estimated

via cross-validation (section 5.3, equation (5.10)); and the classifier minimizing the estimated

loss is selected.

A second (subsequent) task is to estimate the expected loss, L say, of the selected classifier.

The CV of the selected classifier (minimum CV across all classifiers considered) is an optimistic

(biased low) estimate of L [3, Section 7.2, page 222]. To obtain an unbiased estimate,

the following is proposed: (a) on the first two thirds of the data, apply cross-validation for

classifier selection; and (b) on the last third of the data, the selected classifier is applied; the

average loss is an unbiased estimate of L. For this task, minor modification of sahd.m (or a

new code) may be necessary.

7.2.1 Deliverables and Marking Scheme

Submit a typed report to the Faculty Office, with the usual cover about academic integrity.

The audience is a hypothetical manager or analyst familiar with our notes, the brief and the

analyzes in [3] cited above.

Discuss results, and analysis as necessary, on the following:

1. State a preferred classifier, with the aim of minimizing the expected loss. In estimating

the loss, leave-one-out cross-validation (Section 5.3) seems preferable (the modest size of

the sample makes this practical).

2. Provide an unbiased estimate of the expected loss L of the selected classifier.

3. Brief explanation of how results were obtained can appear in an appendix. If significant

extensions are made to any codes provided to you during the course (sahd.m or other),

then: (a) submit the code on blackboard (group assignment), and state in the report the

submitting user (e.g., “xy1g09”) (so the code can be matched to the report); and (b)

provide instructions on how the code can be used.

Avoid re-stating the brief or the codes provided. Length: up to 500 words, excluding tables,

figures, and their captions.

Marking Scheme:

50%. Thoroughness of exploration of potential classifiers, and success in minimizing the

expected loss.

20%. Correctness of estimation of the expected loss L of the selected classifier.

30%. Quality of presentation: clarity, coherence, ease of reading.

Group Work

Work in a group of size up to 4. Each group makes a single submission of the deliverables.

Group members are responsible for functioning as a team. A common mark will be given

to the members of a group. In exceptional cases where there is strong evidence of lack of

contribution by a member, a correction may be made. A key requirement for making such a

correction is majority opinion (i.e., 2 or more against 1). To help resolve such cases, keeping

minutes of work meetings may be appropriate.

49