kaggle入门之旅:house prices

最近的工作内容一直围绕着机器学习展开,断断续续看了不少教程,这里正好拿着kaggle的House Prices: Advanced Regression Techniques练练手,总结一下机器学习的大致流程。

Data Exploration

看了许多教程和机器学习的经验,都提到过一句话:搞机器学习,对数据和业务的认识很重要,所以下面就先来看看数据长什么样子,以及数据可视化中常用的方法和工具,python中数据的读取以及处理的瑞士军刀非pandas莫属,而可视化有matplotlib和seaborn两个方便的工具

1
2
3
4
from __future__ import print_function
import numpy as np
import pandas as pd
import seaborn as sns

Loading Data

这里直接将train data和test data放到一个DataFrame中,方便后面做哑编码时train和test对应关系乱掉,当然也可以只用train data进行分析,这里只是为了后面预处理的时候方便些

1
2
3
df_train = pd.read_csv('train.csv',index_col=0)
df_test = pd.read_csv('test.csv',index_col=0)
df_features = pd.concat([df_train, df_test], keys=['train', 'test'])

Basic Info

pandas中有几个常用的方法可以查看数据的整体情况:

  • DataFrame.head() : 查看DataFrame的开头几行,对数据有个大概了解
  • DataFrame.describe() : 查看DataFrame中数值型数据的基本统计信息,包括数量、均值、方差、最大最小值、百分位数
  • DataFrame.info() : 查看DataFrame中每个字段的数据缺失情况,以及字段的数据类型
  • DataFrame.get_dtype_counts() : 各个数据类型的个数
  • DataFrame.shape: 数据的维度,即样本量和特征个数
  • DataFrame.size: 数据个数,也就是【样本量*特征个数】

对于本例的房价数据,用以上方法可以看到数据共有80个特征,其中有43个是category类型(需要哑编码),34个是整型,另外3个是float类型。

Checking for NANs

对于机器学习来说,对NAN的处理是非常重要的,NAN不处理掉,后面都无法训练数据。先来瞅一眼train和test中各个特征的缺失率,从下表可以看到,PoolQC、MiscFeature、Alley、Fence这几个特征的缺失率特别的高,FireplaceQu也达到了50左右,在预处理的时候可以考虑将这几个特征删除掉

1
2
3
4
5
6
NANs = pd.concat([df_train.dtypes, df_train.isnull().sum(), df_train.isnull().sum()/df_train.shape[0], 
df_test.isnull().sum(), df_test.isnull().sum()/df_test.shape[0]],
axis = 1, keys=['dtype','Train', 'Train_percentage', 'Test', 'Test_percentage'])
NANs_sort = NANs.sort_values(['Train_percentage','Test_percentage'], axis=0, ascending=False)
# NANs_sort[(NANs_sort.Train>0) | (NANs_sort.Test>0)] #print the NAN count greater than zero
NANs_sort
dtype Train Train_percentage Test Test_percentage
PoolQC object 1453 0.995205 1456.0 0.997944
MiscFeature object 1406 0.963014 1408.0 0.965045
Alley object 1369 0.937671 1352.0 0.926662
Fence object 1179 0.807534 1169.0 0.801234
FireplaceQu object 690 0.472603 730.0 0.500343
LotFrontage float64 259 0.177397 227.0 0.155586
GarageCond object 81 0.055479 78.0 0.053461
GarageFinish object 81 0.055479 78.0 0.053461
GarageQual object 81 0.055479 78.0 0.053461
GarageYrBlt float64 81 0.055479 78.0 0.053461
GarageType object 81 0.055479 76.0 0.052090
BsmtExposure object 38 0.026027 44.0 0.030158
BsmtFinType2 object 38 0.026027 42.0 0.028787
BsmtCond object 37 0.025342 45.0 0.030843
BsmtQual object 37 0.025342 44.0 0.030158
BsmtFinType1 object 37 0.025342 42.0 0.028787
MasVnrType object 8 0.005479 16.0 0.010966
MasVnrArea float64 8 0.005479 15.0 0.010281
Electrical object 1 0.000685 0.0 0.000000
MSZoning object 0 0.000000 4.0 0.002742
BsmtFullBath int64 0 0.000000 2.0 0.001371
BsmtHalfBath int64 0 0.000000 2.0 0.001371
Functional object 0 0.000000 2.0 0.001371
Utilities object 0 0.000000 2.0 0.001371
BsmtFinSF1 int64 0 0.000000 1.0 0.000685
BsmtFinSF2 int64 0 0.000000 1.0 0.000685
BsmtUnfSF int64 0 0.000000 1.0 0.000685
Exterior1st object 0 0.000000 1.0 0.000685
Exterior2nd object 0 0.000000 1.0 0.000685
GarageArea int64 0 0.000000 1.0 0.000685
MSSubClass int64 0 0.000000 0.0 0.000000
MiscVal int64 0 0.000000 0.0 0.000000
SalePrice int64 0 0.000000 NaN NaN

Data Visualization

0. target visualization

拿到一个机器学习任务,首先看看其目标变量长生么样子,对于预测任务,可以通过seaborn.distplot查看其直方图,对于分类任务可以用seaborn.countplot来查看各个分类的数量情况

1
sns.distplot(df_train.SalePrice)
<matplotlib.axes._subplots.AxesSubplot at 0x11e51f550>

png

房价预测这个例子中要求使用对目标变量求log之后计算RMSE,那么再来看看取log之后的分布情况:

1
sns.distplot(np.log(df_train.SalePrice))
<matplotlib.axes._subplots.AxesSubplot at 0x115c21c10>

png

在一个机器学习任务中,通常会有多种数据类型,主要可以分为两类:numerical typecategorical type,对于两种类型的数据需要才有用不同的可视化方法去查看其分布规律,seaborn中对于两种数据类型都提供了简便的可视化方法。

1. numeric type visualization

对于numerical type,我们比较关心的是每个特征与目标变量,特征与特征之间的关系以及分布情况,同时采用散点图的方式,可以直观的查看出异常点。seaborn中常用的numerical type可视化方法有:pairplot,jointplot, lmplot等等。对于pairplot,如果特征特别多(如本例),不建议一起绘制出来,那样速度太慢,分开绘制会比较好。这里简单的列举几个特征与目标变量Y之间的散点图

1
2
3
4
5
numeric_columns = df_train.dtypes[df_train.dtypes != 'object'].index  #get the numeric columns for visualization
# numeric_columns = numeric_columns.drop(['SalePrice']) #drop the target column

sns.pairplot(df_train[['LotArea', 'KitchenAbvGr','MasVnrArea','SalePrice']].dropna()) #如果feature数量比较少,可以直接用numeric_columns
# 注:这里seaborn自带的dropna不起作用,详见:https://github.com/mwaskom/seaborn/issues/1087
<seaborn.axisgrid.PairGrid at 0x11f0bdf10>

png

2. numeric与target相关系数

通过相关系数(皮尔逊相关系数)可视化关系图,可以numeric type的特征之间的关联程度,包括以target、各个feature之间的关联程度,虽然没有上面的散点图直观,当时可以纵观全局,并且有具体的数值,再通过上面的散点图(pairplot)以及线性回归图(lmplot)探索变量间的具体情况

在选特征或者后续处理中,应该选择与目标变量相关系数高的特征,特征之间的相关系数如果比较高,那么需要考虑降维处理。

1
2
3
corrmat = df_train[numeric_columns].corr()
lm = sns.heatmap(corrmat,vmax=1, linewidths=0.01, square=True,annot=True,cmap='viridis',linecolor="white")
lm.figure.set_size_inches(25,25)

png

3. categorical type visualization

打印每个category的类别名称以及类别数量

1
2
3
4
category_columns = df_train.dtypes[df_train.dtypes == 'object'].index
# df_train[category_columns].apply(pd.value_counts) #若果feature数量比较少,可以用这种方式
for idx, item in enumerate(category_columns):
print('feature: {0} value counts:\n{1} \n'.format(item,df_train[item].value_counts()))
feature: MSZoning value counts:
RL         1151
RM          218
FV           65
RH           16
C (all)      10
Name: MSZoning, dtype: int64 

feature: Street value counts:
Pave    1454
Grvl       6
Name: Street, dtype: int64 

feature: Alley value counts:
Grvl    50
Pave    41
Name: Alley, dtype: int64 

feature: LotShape value counts:
Reg    925
IR1    484
IR2     41
IR3     10
Name: LotShape, dtype: int64 

feature: LandContour value counts:
Lvl    1311
Bnk      63
HLS      50
Low      36
Name: LandContour, dtype: int64 

feature: Utilities value counts:
AllPub    1459
NoSeWa       1
Name: Utilities, dtype: int64 

....

可视化categorical type中每个feature与target的分布关系,需要一个一个可视化,最常用的方式当属seaborn.swarmplotseaborn.boxplot

1
2
category_columns = df_train.dtypes[df_train.dtypes == 'object'].index
sns.swarmplot(x="BsmtFinType1", y="SalePrice", data=df_train);

png

1
sns.boxplot(x="BsmtFinType1", y="SalePrice", data=df_train)
<matplotlib.axes._subplots.AxesSubplot at 0x11b1cd210>

png

4. visualize relation between each feature and target

通过上面整体的对数据进行一些洞察之后,我们可以挑选比较重要的数据进行分析,看看其与target之间的具体关系

比如OverallQual与价格的皮尔逊相关系数高达0.79,可以拿出来具体看看

1
sns.jointplot(x="OverallQual", y="SalePrice", data=df_train, kind="reg");

png


Data Preprocessing

Data Preprocessing理论上应当归于与Feature Engineering,而特征工程包括了太多的内容,这里主要讲一下其中的预处理部分,等以后经验多了在总结总结特征工程这一部分。

数据预处理这一步,很大程度上依赖于前面的Data Exploration,常见的处理过程有:

  • Missing Data: 缺失值处理
  • Outlier: 异常值/离群点
  • Encoding: categorical type 编码

Missing Data

这里采用简单的填充方式,用均值、众数、中位数等形式填充,更好的方法应该结合每个feature的特点去填充相应的missing data

1. 删除NAN特别多的字段

1
df_features.drop(['PoolQC','MiscFeature','Alley','Fence'],axis=1,inplace=True)

2. fill NAN(categorical type)

1
2
3
for item in df_features.dtypes[df_features.dtypes == 'object'].index:
mode = df_features[item].mode()[0]
df_features[item].fillna(mode,inplace=True)

3. fill NAN(numerical type)

1
2
3
for item in df_features.dtypes[df_features.dtypes == 'float64'].index:
mode = df_features[item].mean()
df_features[item].fillna(mode,inplace=True)

encoding categorical features

对于categorical type,是无法直接用于训练模型的,需要进一步处理成数值型,才可以用于后续操作。常用的encoding方式有两种:LabelEncoderOneHotEncoder,sklearn中对这两种方式都做了实现,pandas中用get_dummies方法实现。在机器学习中用的最多的是哑编码(也就是Dummy coding或OneHotEncoder), 当然如果某个特征的取值有成百上千个,用Dummy coding这种方式显然就不合适了,最好的方式是用LabelEncoder,或者更具这个特征的分布特征,把重要的几个取值和剩余的取值进行哑编码。

1
2
# pandas方式
df_features_encode = pd.get_dummies(df_features)
1
2
3
4
5
6
# sklearn方式
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])

# enc.fit_transform(df_train)
OneHotEncoder(categorical_features='all', dtype=<type 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)

Training Model

1
2
3
4
5
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
from sklearn import metrics, linear_model
from sklearn.feature_selection import SelectKBest,f_regression
from sklearn import ensemble, svm, preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

prepare the trainning data

1
2
3
4
y_train = df_features_encode.loc['train',['SalePrice']]
y_train = np.log(y_train['SalePrice'])
X_train = df_features_encode.loc['train',:].copy()
X_train.drop('SalePrice', axis=1, inplace=True)

Feature Selection

如果参数太多,也可以通过SelectKBest选出最相关的参数

1
X_new = SelectKBest(f_regression, k=20).fit_transform(X_train, y_train)

Tuning Parameters

调参是搞机器学习的人最津津乐道的部分,调参的方式有很多种,可以自己手动设置参数范围,用for循环来调参,不过这总方式太原始了,sklearn中提供了GridSearchCVRandomizedSearchCV两种调参方法,不过这里我们采用另一更好的调参方法:hyperopt,相比sklearn提供的两种方法更加高效,同时也提供了更加方便的参数输出结构,方便可视化参数结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
space = {
""
"max_features": hp.quniform("max_features", 0.05,1,0.1),
'n_estimators': hp.choice('n_estimators', range(1,500)),
"min_samples_leaf": hp.choice("min_samples_leaf", range(1,10))
# "alpha" : hp.choice("alpha",[0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75])
}

def rmse(y_true, y_pred):
return np.sqrt(metrics.mean_squared_error(y_true, y_pred))

def mape(y_true, y_pred):
mask = y_true > 0
return np.mean(np.abs((y_true[mask] - y_pred[mask]) / y_true[mask]))

loss_func = metrics.make_scorer(rmse, greater_is_better=False) #定义损失函数


def train_model(params):
clf = ensemble.RandomForestRegressor(**params)
# clf = linear_model.Ridge(**params)
score = np.mean(cross_val_score(clf, X=X_train, y=y_train, scoring=loss_func))
return score

best_score = -100**10
count = 0
def objective(params):
global best_score, count
count += 1
acc = train_model(params)
if acc > best_score:
best_score = acc
print(str(count) + ': new best:', best_score, params)
return {'loss': -acc, 'status': STATUS_OK}

trials = Trials()
best = fmin(objective, space=space, algo=tpe.suggest, max_evals=100, trials=trials)
print('\n best param: ', best)
1: new best: -0.141484028235 {'max_features': 0.30000000000000004, 'n_estimators': 366, 'min_samples_leaf': 3}
2: new best: -0.141484028235 {'max_features': 1.0, 'n_estimators': 463, 'min_samples_leaf': 2}
3: new best: -0.141484028235 {'max_features': 0.2, 'n_estimators': 446, 'min_samples_leaf': 3}
4: new best: -0.141484028235 {'max_features': 0.1, 'n_estimators': 359, 'min_samples_leaf': 2}
5: new best: -0.141484028235 {'max_features': 0.6000000000000001, 'n_estimators': 202, 'min_samples_leaf': 3}
6: new best: -0.141484028235 {'max_features': 0.30000000000000004, 'n_estimators': 211, 'min_samples_leaf': 4}
7: new best: -0.141484028235 {'max_features': 0.30000000000000004, 'n_estimators': 96, 'min_samples_leaf': 8}
8: new best: -0.141484028235 {'max_features': 0.30000000000000004, 'n_estimators': 410, 'min_samples_leaf': 6}
9: new best: -0.141484028235 {'max_features': 0.4, 'n_estimators': 372, 'min_samples_leaf': 8}
10: new best: -0.141433479217 {'max_features': 0.5, 'n_estimators': 354, 'min_samples_leaf': 2}
11: new best: -0.141433479217 {'max_features': 0.2, 'n_estimators': 162, 'min_samples_leaf': 9}
12: new best: -0.141433479217 {'max_features': 0.5, 'n_estimators': 484, 'min_samples_leaf': 7}
13: new best: -0.141433479217 {'max_features': 0.9, 'n_estimators': 160, 'min_samples_leaf': 6}
14: new best: -0.141433479217 {'max_features': 0.30000000000000004, 'n_estimators': 224, 'min_samples_leaf': 6}
15: new best: -0.141433479217 {'max_features': 0.1, 'n_estimators': 223, 'min_samples_leaf': 7}
16: new best: -0.141433479217 {'max_features': 0.30000000000000004, 'n_estimators': 251, 'min_samples_leaf': 2}
...
 best param:  {'max_features': 0.4, 'n_estimators': 366, 'min_samples_leaf': 0}

Parameters Visualization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import matplotlib.pyplot as plt
# plot params with socre
parameters = trials.trials[0]['misc']['vals'].keys()
f, axes = plt.subplots(nrows=len(parameters)/3+1, ncols=3, figsize=(15,10))
cmap = plt.cm.jet
for i, val in enumerate(parameters):
xs = np.array([t['misc']['vals'][val] for t in trials.trials]).ravel()
ys = [-t['result']['loss'] for t in trials.trials]
# print zip(sorted(zip(xs, ys)))
# xs, ys = sorted(zip(xs, ys))

ys = np.array(ys)
axes[i/3,i%3].scatter(xs, ys, s=20, linewidth=0.01, alpha=0.5, c=cmap(float(i)/len(parameters)))
axes[i/3,i%3].set_title(val)
# axes[i/3,i%3].set_ylim([0.9,1.0])

png


Prediction

1
2
X_predict = df_features_encode.loc['test',:].copy()
X_predict.drop('SalePrice',axis = 1, inplace=True)
1
2
3
4
rf = ensemble.RandomForestRegressor(max_features= 0.4, n_estimators=366, min_samples_leaf= 0)
rf.fit(X_train,y_train)
y_pred = np.exp(rf.predict(X_predict))
pd.Series(y_pred).to_csv('y_pred2.csv')