Description
CH 5440 Multivariate Data Analysis
Assignment 1
1. (a) Let 𝑢ଵ, 𝑢ଶ,⋯,𝑢ே and 𝑦ଵ, 𝑦ଶ,⋯,𝑦ே be a set of N measurements of two variables u and
y which are linearly related. We are interested in determining the linear regression parameter
a where y = au + b. Assume that the measurements of u and y contain errors, with standard
deviations and, respectively. (a) If the ratio of the error variances 𝜌 ൌ ఙച
మ
ఙഃ
మ is known,
derive the weighted TLS (WTLS) estimates of a and b in terms of 𝑠௨௨, 𝑠௬௬, 𝑢ത,𝑦ത, 𝜌. (b) How
will the solution for a change if it is given that the constant b is 0?
Note: The WTLS regression problem when the error variances are known is the solution of
the following minimization problem. Multiply the objective function by 2 and replace the
ratio of the error variances by .
Differentiate the objective function with respect to the decision
variables and solve resulting set of nonlinear algebraic equations for obtaining the parameters
a and b.
min
ఈ,ఉ,௫ොഢ
ሺ𝑦 െ 𝑎𝑢ො െ 𝑏ሻଶ/𝜎ఌ
ଶ ሺ𝑢 െ 𝑢ොሻଶ/𝜎ఋ
ଶ
(b) From the solution of WTLS, obtain the solutions of the regression parameters for IOLS
and OLS in the limit as 0 and . Also obtain the solution for the estimates of ui and
yi for each case (OLS, IOLS, WTLS) in terms of the regression parameters and measurements
2. Carbon-dioxide (CO2) is one of the major greenhouse gases that is implicated in the gradual
warming of the earth’s temperature. Measured concentrations of CO2 (in ppm) and
atmospheric temperature (spatially and temporally averaged over a year) available from
USEPA’s Climate Change Indicators website (www.epa.gov/climate-indicators) between
1984 and 2014 is given in Table 1.
The temperatures are deviation in deg F from the average
temperature in the period 1901-2000. Climate models recommend that the global temperature
increase should be kept below 2 deg C (3.6 deg F) by cutting down on CO2 emissions. Using
OLS and TLS regression for estimate the maximum permissible level of CO2 in the
atmosphere that can meet this goal.
Assume that the level of CO2 increases linearly with time,
estimate using the given data how many years it will take for CO2 to reach the maximum
permissible. Note that this is a simplified analysis because other greenhouse gases such as
methane, nitrous oxide, water vapour, etc. have not been considered. In order to improve your
model you are encouraged to use other reliable data sources you can find (cite the sources from
where you obtain additional data).
3. The level of phytic acid in urine samples was determined by a catalytic fluorimetric (CF)
method and the results were compared with those obtained using an established extraction
photometric (EP) technique. The results, in mg/L, are the means of triplicate measurements, as
shown in Table 2.
(a) Is the new method (CF) a good substitute for the established method (EP) for measuring
the level of phytic acid in urine? Justify your conclusion using linear regression between
the two methods for different modelling assumptions regarding the accuracy of the
respective measurement techniques.
(b) Estimate the level of phytic acid in urine if EP measurement is 2.31 mg/l and CF
measurement is 2.20 mg/l, for different modelling.
4. Image analysis is used to identify defects in infrastructures such as bridges, roads or in
manufactured products such as glass sheets, rolled steel sheets etc.
One of the first steps in
image analysis is annotation of the defect using an annotation tool such as CVAT, where each
defect is marked using a polygon enclosing the defect. The corners of the polygon are pixels
which are indicated by the x and y coordinates of the pixel in the image.
Table 3 gives the x
and y coordinates of the corner points of the polygons for three different defects found from a
drone image of a concrete pillar of a bridge. It is required to estimate the orientation of the
defect and check if it is aligned with the horizontal or vertical axis (indicating perhaps it is due
to corrosion of vertical or horizontal steel reinforcement bars buried within the concrete).
Identify which of the three defects could be due to corrosion of steel reinforcement bars.
Table 1. Measured average atmospheric CO2 concentration and temperature
Year CO2 Temp (0
F) Year CO2 Temp (0
F)
1984 344.58 0.27 1999 368.33 0.792
1985 346.04 0.234 2000 369.52 0.756
1986 347.39 0.414 2001 371.13 0.972
1987 349.16 0.666 2002 373.22 1.08
1988 351.56 0.666 2003 375.77 1.098
1989 353.07 0.522 2004 377.49 1.026
1990 354.35 0.774 2005 379.8 1.17
1991 355.57 0.72 2006 381.9 1.098
1992 356.38 0.45 2007 383.76 1.098
1993 357.07 0.504 2008 385.59 0.972
1994 358.82 0.612 2009 387.37 1.134
1995 360.8 0.81 2010 389.85 1.26
1996 362.59 0.576 2011 391.63 1.026
1997 363.71 0.918 2012 393.82 1.116
1998 366.65 1.134 2013 396.48 1.188
2014 398.61 1.332
CH5440: MULTIVARIATE DATA ANALYIS
ASSIGNMENT 2
1. The following gases carbon dioxide (CO2), methane (CH4), nitrous oxide (N2O) and
Ozone (O3) in the atmosphere are implicated in increasing global temperatures, and are
known as greenhouse gases.
The concentration of these gases in the atmosphere and
corresponding global average temperatures obtained from the EPA website
(https://www.epa.gov/climate-indicators/weather-climate) between the years 1984 to
2014 is given in the Excel file ghg-concentrations_1984-2014.xlsx (units for different
variables are also given in Excel sheet).
(a) Develop a multilinear regression model between global temperature (deviations) and
concentrations of greenhouse gases using OLS. Is the global temperature positively
correlated with increase in the concentration of these gases?
(b) Estimate the error variance in temperature measurements and confidence intervals
(CIs) for all regression coefficients. Based on residual analysis, remove samples
suspected of being outliers (one at a time) until there are no outliers.
(c) Improve the regression model obtained in step (b) by dropping unimportant
(insignificant) variables (one at a time).
(d) The effect of different gases on the global temperature is expressed in terms of CO2
equivalents or global warming potential (GWP). Is it possible to make any inference
regarding GWP of the gases from the regression coefficients?
Compare the GWP
obtained from regression coefficients to the values obtained over a 20 year time horizon:
CO2 (1), CH4 (86), N2O (289).
Notes: Water vapour, which is present in significant amount is the atmosphere is also a
greenhouse gas, but it remains almost constant and is relatively unaffected by human
activity. CFCs/HCFCs which are also greenhouse gases are however being monitored
only in recent years.
2. Consider the problem of developing a correlation between saturated pressure (Psat) and
saturated temperature T (boiling point). For pure components, the Antoine equation
given below generally fits the data well
𝑙𝑙𝑙𝑙 𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑠 = 𝐴𝐴 − 𝐵𝐵/(𝑇𝑇 + 𝐶𝐶)
For n-hexane, the values of the constants are A = 14.0568, B = 2825.42, and C = 230.44
where Psat is given in kPa and T in deg C. Using this correlation a data set consisting of
100 samples have been generated in the temperature range 10 – 70 deg C.
Gaussian
measurements errors to both the true temperature and saturated pressures with standard
deviations of 0.18 deg C and 2 kPa, respectively, have been added to generate the
measurements (available in vpdata.mat)
(a) The Classius-Clapeyron equation is a theoretically derived model between Psat and T
and is given by
𝑙𝑙𝑙𝑙 𝑃𝑃𝑠𝑠𝑠𝑠𝑠𝑠 = 𝐴𝐴′ − 𝐵𝐵′
𝑇𝑇
Assuming that temperature measurements are noise-free and pressure measurements are
noisy, use linear regression to obtain estimates of parameters A’ and B’.
(b) Assuming that temperature measurements are noise-free and pressure measurements
are noisy, use nonlinear regression to obtain estimates of parameters A, B and C.
(c) Assuming both pressures and temperature measurements are noisy apply weighted
total least squares obtain estimates of parameters A, B, and C. Use the inverse of standard
deviation of errors as weights to set up the nonlinear optimization problem.
(d) For the models obtained in (a), (b), and (c) report the maximum error in predicting the
saturated pressures using the identified model for the sample data.
Use MATLAB function lsqnonlin to estimate the nonlinear model parameters in (b) and
(c)
3. A zoologist obtained measurements of the mass (in grams), the snout-vent length
(SVL) and hind limb span (HLS) in mm of 25 lizards. The mean and covariance
matrix of the data about the mean are given by
=
=
34 102 186
21 64 102
7 21 34
129
68
9
x S
(a) The largest eigenvalue of the above covariance matrix is 250.4. Determine the
normalized eigenvector corresponding to this eigenvalue. Also determine the remaining
eigenvalues and corresponding mutually orthogonal eigenvectors.
(b) How many principal components should be retained, if at least 95% of the variance
in the data has to be captured?
(c) Assuming that there are two linear relationships among the three variables, determine
one possible set of these linear relations.
(d) Using the PCA model, determine the scores for a female lizard with the following
measurements: mass = 10.1 gms, SVL = 73mm and HLS = 135.5mm.
(e) Using the PCA model, estimate the mass of a lizard whose measured SVL is 73mm
(f) Using the PCA model, estimate the mass of a lizard whose measured SVL is 73mm
and measured HLS is 135.5 mm.
Note: The first and second problem can be solved using MATLAB, while the third
problems should be solved manually and can be verified using MATLAB. Submit
the MATLAB codes along with your solution
CH 5440 Multivariate Data Analysis in Process Monitoring and Diagnosis
Assignment 3
1. Model identification using PCA
Consider the flow process shown in Fig. 1 consisting of five streams, the flow rates
of all of which are measured. A data set (flowdata3.mat) consisting of 1000
samples corresponding to different steady states have been obtained.
(a) Apply PCA to identify the linear constraint model relating the variables
(assuming that you know that the number of linear relations that exist between
variables). In order to verify whether your constraint model is good, choose F3
and F5 as independent variables and obtain the relationship between the
dependent and independent variables (regression form of the model) using your
estimated constraint model and find the maximum absolute difference (maxdiff)
between estimated regression model coefficients and true regression model
coefficients. Report the eigenvalues and maxdiff value.
(b) Apply IPCA to estimate diagonal error variances and identify the linear steady
state model relating the flow variables (assuming that you know that the number
of linear relations that exist between variables). Report the estimated variances,
eigenvalues and maxdiff value.
(c) Apply IPCA assuming incorrectly that there are four constraints. Report the
eigenvalues obtained? Are you able to determine from the eigenvalues that the
number of constraints has been incorrectly guessed? Give reasons for your
answer.
(d) From the constraint model identified in (b) suggest a procedure (a measure)
by which you can determine a set of independent variables for the process.
Determine the best and worst possible choice of independent variable set for this
system based on your proposed measure and justify whether these inferences
(obtained from data) are consistent with the physical process.
2. Multivariate calibration model using PCA
Multivariate calibration of spectral measurements is a technique that is used in
chemometrics to develop a model relating spectral measurements (obtained using
instruments such as UV, FIR or NIR or MS spectrophotometers) to properties such
as concentration or other properties of species (usually liquid or gases).
The
application we consider is to obtain a model relating UV absorbance spectra to
compositions (concentrations) of mixtures. Such a model is useful in online
monitoring of chemical and biochemical reactions.
Twenty six samples of different concentrations of a mixture of Co, Cr, and Ni ions
in dilute nitric acid were prepared in a laboratory and their spectra recorded over
the range 300-650 nm using a HP 8452 UV diode array spectrophotometer (data
in Inorfull.mat).
(Water and ethanol are generally used as solvents since these do
not absorb in the UV range. Also the nitrate ions do not absorb in the UV range.
So an aqueous solution of nitric acid is used to dissolve the metals in this
experiment). Five replicates for each mixture were obtained.
The measurements
were made at 2 nm intervals giving rise to an absorbance matrix of size 130 x 176.
The concentrations of the 26 samples, which is a 26 x 3 matrix are also given in
the data file. In order to predict the concentration of the mixture using absorbance
measurements, it is necessary to build a calibration model relating concentration
of mixtures to its absorbance spectra.
According to Beer-Lambert’s law the
absorbance spectra of a dilute mixture is a linear (weighted) combination of the
pure component spectra with the weights corresponding to the concentrations of
the species in the mixture.
If absorbances are measured only a minimum number of wavelengths, then OLS
can be used to build a calibration model. For example, if a mixture containing ns
non-reacting species, then absorbances at ns wavelengths need to be measured.
Typically, the wavelengths are chosen corresponding to the maximum absorbing
wavelengths of individual species.
However, if we measure absorbances at nw >
ns wavelengths, then the absorbance matrix will not be full column rank. In this
case, Principal Component Regression can be used to develop a multivariate
calibration model. In this method PCA is first applied to the absorbance matrix to
obtain the scores corresponding to different mixtures.
In the second step, a
regression model is used to relate the concentrations to the scores using OLS
(assuming concentrations are the dependent variables). In order to use this model
for predicting the concentrations of a mixture whose absorbance spectra is given,
we first obtain the scores and then use the OLS regression model to predict the
concentrations. Note that the true rank of the absorbance matrix is equal to the
number of species in the mixture.
The quality of the linear calibration model is evaluated using leave-one-sampleout cross-validation (LOOCV) and computing the root mean square error (RMSE)
in predicting the left out sample concentrations. Pick the first replicate for each
mixture to obtain a data matrix of size 26 x 176 and use it for the following different
multivariate calibration modelling methods.
For each method report the LOOCV
RMSE results in the form of a table for number of PCs chosen between 1 and 5.
Based on the RMSE values indicate whether you are able to estimate the number
of species correctly?
(a) Develop a multivariate calibration model using PCR.
(b) The absorbances are very noisy near the ends of the instrument. Estimate the
standard deviation of errors in absorbance measurements using the five replicates
for each wavelength and for each mixture. Assume that the error standard
deviations vary significantly with respect to wavelength but are almost same for all
mixtures (verify this by plotting the estimated standard deviations wrt wavelength
and mixtures).
Therefore, obtain the average standard deviation or errors with
respect to each wavelength. Use these standard deviations to scale the
absorbance measurements for each wavelength before applying PCR to develop
the calibration model (known as scaled PCR).
(c) Use IPCA to estimate the error variances with respect to wavelength in step 1
of PCR and use it to develop the calibration model (known as IPCR).
(d) If the error variances varies with respect to both mixtures and wavelengths,
then Maximum Likelihood PCA (MLPCA) proposed by Wentzell et al. (1997) can
be used to reduce the rank of the absorbance matrix and then use OLS to develop
the calibration model (also known as MLPCR).
Write a MATLAB function to
implement MLPCA given a data matrix, corresponding error standard deviation
matrix, and number of factors (or PCs). The function should return the scores
matrix.
Use this function and the standard deviation of errors for each wavelength
and mixture estimated directly from the replicate measurements to develop the
calibration model using MLPCR.



