Data-146

View on GitHub

The libraries I used for test corrections were as follows

import pandas as pd 
import math as md
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression as LR
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold
import seaborn as sns

from sklearn.datasets import fetch_california_housing

Assigned Variables

X = data.data
X_names = data.feature_names
X_df = pd.DataFrame(data = X, columns=X_names)
y = data.target
y_names = data.target_names

#15

To do this problem all I did was simply use

np.corrcoef(X_df['MedInc'], y)

I did this for all the variables that I had to compare with and found MedInc having the strongest correlation which was 0.68807521

I later learned with corrections and importing StandardScaler that there was a better way to do that. I did that as follows.

Xs = ss.fit_transform(X)
Xs_df = pd.DataFrame(Xs, columns = X_names)
Xsy_df = Xs_df.copy()
Xsy_df['y'] = y
Xsy_df.corr()

The data frame looked like below

The above graph is almost like a heatmap without the colors. Obviously a heatmap can be used to measure the strength of a correlation by color, but I found this way easier when finding which variables have the strongest and weakest correlations.

#16

After standardizing the data and inputting the code

np.corrcoef(Xs_df['MedInc'], y)

The correlations stayed the same after standardizing.

17

To find the coefficient of determination I ran the following code.

 np.round(np.corrcoef(X_df['MedInc'], y)[0][1]**2,2)

I was able to get a value of .47

#18

When creating the DoKFold speaking and class and going through the code slower I realized one of my largest problems I had was I was trying to do to much at once. I made two different KFold functions, one for regression and the other one for MSE. Making things that more complicated increases the chance of making a simple typing mistake when creating your code.

The new upgraded DoKFold code is below

def DoKFold(model, X, y, k, standardize=False, random_state=146):
    import numpy as np
    from sklearn.model_selection import KFold
    kf = KFold(n_splits=k, shuffle=True, random_state=random_state)
    if standardize:
        from sklearn.preprocessing import StandardScaler as SS
        ss = SS()
    train_scores = []
    test_scores = []

train_mse = []
test_mse = []

for idxTrain, idxTest in kf.split(X):
    Xtrain = X[idxTrain, :]
    Xtest = X[idxTest, :]
    ytrain = y[idxTrain]
    ytest = y[idxTest]

    if standardize:
        Xtrain = ss.fit_transform(Xtrain)
        Xtest = ss.transform(Xtest)

    model.fit(Xtrain, ytrain)

    train_scores.append(model.score(Xtrain, ytrain))
    test_scores.append(model.score(Xtest, ytest))

    ytrain_pred = model.predict(Xtrain)
    ytest_pred = model.predict(Xtest)

    train_mse.append(np.mean((ytrain-ytrain_pred) ** 2))
     test_mse.append(np.mean((ytest - ytest_pred) ** 2))

return train_scores, test_scores, train_mse, test_mse

As you can see above, this function is capable of find MSE and training and testing correlations

Using this drastically made things easier. I then ran the linear regression as follows.

k = 20
train_scores, test_scores, train_mse, test_mse = DoKFold(LR(), X, y, k, True)
print(np.mean(train_scores), np.mean(test_scores))
print(np.mean(train_mse), np.mean(test_mse))

This result had a train score of 0.6063019182717753 and then a testing score of 0.6019800920504694

#19

The ridge regression was also much easier since I did not have to readjust for MSE or change functions. The ridge regression was ran as follows.

a_range = np.linspace(20, 30, 101)
k = 20

rid_tr =[]
rid_te =[]
rid_tr_mse = []
rid_te_mse = []
for a in a_range:
    rid_reg = Ridge(alpha=a)
    train_scores,test_scores, train_mse, test_mse = DoKFold(rid_reg,X,y,k,standardize=True)
    rid_tr.append(np.mean(train_scores))
    rid_te.append(np.mean(test_scores))
    rid_tr_mse.append(np.mean(train_mse))
    rid_te_mse.append(np.mean(test_mse))

idx = np.argmax(rid_te)
print(a_range[idx], rid_tr[idx], rid_te[idx], rid_tr_mse[idx], rid_te_mse[idx])

The optimal alpha level was 25.8, the training score was .6062707743487987 and then a testing score of

0.602011168741859.

#20

The lasso regression was run almost the same as the ridge regression except for the fact that the variables were different.

from sklearn.linear_model import Lasso

k = 20

las_a_range = np.linspace(0.001, 0.003, 101)

las_tr = []
las_te = []
las_tr_MSE = []
las_te_MSE = []

for a in las_a_range:
    las = Lasso(alpha = a)
    train_scores,test_scores,train_MSE,test_MSE = DoKFold(las, X, y, k, True)

las_tr.append(np.mean(train_scores))
las_te.append(np.mean(test_scores))
las_tr_MSE.append(np.mean(train_MSE))
las_te_MSE.append(np.mean(test_MSE))

idx = np.argmax(las_te)

print(las_a_range[idx], las_tr[idx], las_te[idx], las_tr_MSE[idx], las_te_MSE[idx])

The results came with an alpha value of 0.00186, a training score of 0.6061563795668891 and a testing score of 06021329052825213.

#21

I really did not know what I was doing for 21 and even more for 22. To start I did not use the real variable with the smallest coefficient. I assumed it came from the 5 variables given to me from #15. The real lowest coefficient was AveOccup. I also tried doing some crazy thing where I would change the code at the top when defining variables, so really I was extremely off.

lin = LR(); rid = Ridge(alpha=25.8); las = Lasso(alpha = 0.00186)
lin.fit(Xs, y); rid.fit(Xs, y); las.fit(Xs, y);
lin.coef_[5], rid.coef_[5], las.coef_[5]

I wish I knew it was that simple, would have saved me unnecessary stress and a not wrong answer. I also failed to standardize the data when going about this question so my answer was no where near what I needed.

#22

This was very simple since most of the work was done above.

lin.coef_[0], rid.coef_[0], las.coef_[0]

#23

These questions also were much simpler when going over it, when applying MSE into the DoKFold custom function.

idx = np.argmin(rid_te_mse)
print(a_range[idx], rid_tr[idx], rid_te[idx], rid_tr_mse[idx], rid_te_mse[idx])

The optimal alpha value was 26.1

#24

idx = np.argmin(las_te_mse)
print(las_a_range[idx], las_tr[idx], las_te[idx], las_tr_mse[idx], las_te_mse[idx])

The optimal alpha levvel is 0.00184

Takeaways

The biggest takeaway I had from the midterm was that sometimes I made things more complicated then it needed to be. Since I did extra work that was unnecessary I made simple mistakes that could be as basic as a typo or assigning variables the wrong way. I think I need to take a step back and look at the big picture of what I am doing before going in and just typing in code.