The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
Dataset can be downloaded at Github Link
Data Description
Bank client data
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical:'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
Related with the last contact of the current campaign
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other attributes
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
Social and Economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Import Libraries, Load Data¶
#write code below, you can make multiple cells
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier, NearestCentroid
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import f_regression, f_classif
from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.linear_model import SGDClassifier, LassoCV
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SelectFromModel
from sklearn.calibration import CalibratedClassifierCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler, MaxAbsScaler
from sklearn.feature_selection import VarianceThreshold, SelectPercentile
from sklearn.kernel_approximation import Nystroem
from sklearn.neural_network import MLPClassifier
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import make_pipeline as make_imb_pipeline
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import EditedNearestNeighbours, CondensedNearestNeighbour
# from xgboost import XGBClassifier
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Loading the data¶
plt.rcParams["figure.dpi"] = 100
data_df = pd.read_csv("./data/data.csv")
data_df.drop("duration", axis=1, inplace=True)
holdout_df = pd.read_csv("./data/holdout.csv")
holdout_df.drop("duration", axis=1, inplace=True)
subscribed = data_df.subscribed
data_df = data_df.drop("subscribed", axis=1)
subscribed = subscribed == "yes"
data_df.head()
data_df.shape
Step1 - Exploration and Preparation¶
Exploration:¶
Descriptive Statistics for Continuous Variables¶
#write code below, you can make multiple cells
# Summary statistics of quantitative variables and categorical variables
data_df.describe()
We make the following observations from the descriptive statistics above:
- The variables 'prev_days' has many '999' (missing) values.
- 'prev_contacts' = 0 for most of the data
- Rest of the continuous variables seem clean and well populated.
Descriptive Statistics for Categorical Variables¶
data_df.describe(include=["object"])
- We have 10 categorical variables
categorical_vars = data_df.describe(include=["object"]).columns
continuous_vars = data_df.describe().columns
continuous_vars
categorical_vars
Visualizing Distribution of Continuous Variables¶
_ = data_df.hist(column=continuous_vars, figsize = (16,16))
Visualizing Distribution of Categorical Variables¶
# Count plots of categorical variables
fig, axes = plt.subplots(4, 3, figsize=(16, 16))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.7, hspace=0.3)
for i, ax in enumerate(axes.ravel()):
if i > 9:
ax.set_visible(False)
continue
sns.countplot(y = categorical_vars[i], data=data_df, ax=ax)
Scatter Plot Matrix for checking Correlations (takes long to build on travis, commented)¶
#pd.tools.plotting.scatter_matrix(data_df[continuous_vars], c=subscribed, alpha=.2, figsize=(15, 15), cmap="viridis");
Preparation¶
- Creating dummy variables for categorical variables
- Scaling numerical variables
- Removing variables based on variance
Creating Dummy Variables¶
data_df.head()
# Creating dummy variables
data_dummies_df = pd.get_dummies(data_df, columns=categorical_vars, drop_first=True)
holdout_dummies_df = pd.get_dummies(holdout_df, columns=categorical_vars, drop_first=True)
#data_dummies_df.columns
output_df = pd.DataFrame(holdout_df["ID"])
holdout_dummies_df.drop("ID", axis=1, inplace=True)
#holdout_dummies_df.columns
holdout_dummies_df['credit_default_yes'] = 0
holdout_dummies_df = holdout_dummies_df[data_dummies_df.columns]
Creating Submission Output¶
Create the output.csv to upload on Kaggle
def create_submission_output(output_df, pipe):
output = pd.concat([output_df, pd.DataFrame(pipe.predict_proba(holdout_dummies_df)[:, 1])], axis=1)
output.rename(columns = {0:"subscribed"}, inplace=True)
output.set_index("ID").to_csv("output.csv")
return output.head()
Step2 - ModelSet1¶
In this step, the following steps relevant to the Machine Learning models used have been shown:
- feature engineering
- validation
- feature selection
- final model selection
1. Logisitc Regression CV with StandardScalar¶
(a) First we build a simple logistic regression model by standardizing all the features using the Standard scaler.¶
Standard Scaling just the continuous Variables
categorical_dummies = pd.get_dummies(data_df[categorical_vars], drop_first=True).columns
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_union
select_categorical = FunctionTransformer(lambda X: X[categorical_dummies],validate = False)
select_continuous = FunctionTransformer(lambda X: X[continuous_vars],validate = False)
fu = make_union(select_categorical, make_pipeline(select_continuous, StandardScaler()))
pipe = make_pipeline(fu, LogisticRegressionCV())
print(np.mean(cross_val_score(pipe, data_dummies_df, subscribed, cv=5, scoring="roc_auc")))
pipe.fit(data_dummies_df, subscribed)
create_submission_output(output_df, pipe)
Standard Scaling all the features
pipe = make_pipeline(StandardScaler(), LogisticRegressionCV())
print(np.mean(cross_val_score(pipe, data_dummies_df, subscribed, cv=5, scoring="roc_auc")))
pipe.fit(data_dummies_df, subscribed)
create_submission_output(output_df, pipe)
plt.figure(figsize=(5, 15))
coef = pd.Series(pipe.named_steps['logisticregressioncv'].coef_.ravel(), index=data_dummies_df.columns)
coef.sort_values().plot(kind="barh")
In one model we standardize only the continuous variables and leave the categorical variables as it is. In another model we standardize all the variables (including categorical variables). The model gives a better cross-val score in the second case. We get an average cross-val score of 78.9%, so we use StandardScaler in all the later models to Scale all the features in the pipeline.
Also, looking at the weights assigned to the features, we observe that around 50% of the variables are those with significant weights and therefore we set a feature selection threshold as 50% for a few models later.
(b) Logistic Regression with Kernel¶
Next we use kernelized logistic regression to make our prediction. Again we use standard scaler before fitting our model.¶
approx = Nystroem(gamma=1./data_dummies_df.shape[1], n_components=300)
pipe = make_pipeline(StandardScaler(), approx, LogisticRegressionCV())
print(np.mean(cross_val_score(pipe, data_dummies_df, subscribed, cv=5, scoring="roc_auc")))
pipe.fit(data_dummies_df, subscribed)
create_submission_output(output_df, pipe)
Logistic Regression using Kernel Approximation did not result in an improvement as compared to LogisticRegressionCV with using a Standard Scaler.
(c) Logistic Regression with Feature Selection¶
- We now perform feature selection using f_classif before fitting a Logistic model on the data.
- Feature engineering is performed by adding interaction features to our feature set.
- Top 50 percentile of features are selected using f_classif.
pipe = make_pipeline(StandardScaler(), SelectPercentile(score_func=f_classif, percentile=50),
PolynomialFeatures(interaction_only=True),
VarianceThreshold(),
LogisticRegressionCV())
print(np.mean(cross_val_score(pipe, data_dummies_df, subscribed, cv=5, scoring="roc_auc")))
pipe.fit(data_dummies_df, subscribed)
create_submission_output(output_df, pipe)
roc_auc_score(y_test, pipe.predict_proba(X_test)[:, 1])
We used SelectPercentile with various values in GridSearch, but as the percentage increases, the value increased which shows that even if the logistic weights are lower for a few features, they are still important for classification.
(d) SelectFromModel : Lasso with Logistic Regression¶
- We wanted to perform feature selection using SelectModel, where we use LassoCV.
- Threshold was set as the median, by observing the important features using Logistic Regression from part a)
select_lassocv = SelectFromModel(LassoCV(), threshold="median")
pipe_lassocv = make_pipeline(StandardScaler(), select_lassocv, LogisticRegressionCV())
print(np.mean(cross_val_score(pipe_lassocv, data_dummies_df, subscribed, cv=10, scoring="roc_auc")))
pipe_lassocv.fit(data_dummies_df, subscribed)
create_submission_output(output_df, pipe_lassocv)
Using SelectfromModel with Lasso, we get comparable CV scores as compared to Logistic Regression with StandardScaler, so we wanted to explore a lot of other parameters using GridSearch.
(e) Logistic Regression Parameter Tuning¶
- Now we try to tune our hyper-parameters to obtain best hyper-parameters.
- Since LogisticRegressionCV already tunes 'C' for logistic regression, we tune max_features and polynomial degree.
pipe = make_pipeline(SelectPercentile(score_func=f_classif, percentile=50),
PolynomialFeatures(interaction_only=True),
VarianceThreshold(),
LogisticRegressionCV())
param_grid = {'selectpercentile__percentile': [50],
'polynomialfeatures__degree': [2]}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring="roc_auc")
grid.fit(data_dummies_df, subscribed)
grid.best_score_
grid.best_params_
create_submission_output(output_df, grid)
We obtain a best cross-val roc_auc score of 78.89% after parameter tuning with features selected using f_classif.
Observations:
For the logistic regression, using the LogisticRegressionCV with just standard scaling works best and its CV score is comparable to the score obtained by doing feature selection using LassoCV. Therefore, we just use the LogisticRegressionCV for our ensemble methods later.
2. MLP Classifier¶
We try another linear classifier, multi-layered perceptron.
Takes long time to build on travis, commented
mlp_pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = 5))
param_grid = {'mlpclassifier__hidden_layer_sizes': [5, 10, 50, 100, 200]}
param_grid = {} # to optimize code running on travis, the best value is selected by doing the runs
grid = GridSearchCV(mlp_pipe, param_grid, cv=5, scoring="roc_auc")
grid.fit(data_dummies_df, subscribed)
grid.best_score_
grid.best_params_
create_submission_output(output_df, grid)
Observations:
We tried using the MLPClassifier, but it did not perform that well as compared to LogisticRegressionCV from ModelSet1.
3. LinearSVC Classifier with Calibration¶
We use a calibrated Linear SVM classifier.¶
scaled_svc = make_pipeline(StandardScaler(), LinearSVC())
clf = CalibratedClassifierCV(scaled_svc)
print(np.mean(cross_val_score(clf, data_dummies_df, subscribed, cv=5, scoring="roc_auc")))
clf.fit(data_dummies_df, subscribed)
create_submission_output(output_df, clf)
Observations:
For the LinearSVC classifiers, as they do not return a predict_proba, we need to use a CalibrationClassifier to get the predict probability values.
For the LinearSVC also, we do not beat the CV validation scores in ModelSet1 using LogisticRegression Models.
Step3 - ModelSet2¶
In this step, the following steps relevant to the Machine Learning models used have been shown:
- feature engineering
- validation
- feature selection
- final model selection
1. Decision Tree Classifier¶
pipe = make_pipeline(DecisionTreeClassifier(max_depth = 6))
#param_grid = {'decisiontreeclassifier__max_depth':range(1, 7)}
param_grid = {}
grid = GridSearchCV(pipe, param_grid=param_grid, scoring='roc_auc', cv=StratifiedShuffleSplit(100))
grid.fit(data_dummies_df, subscribed)
grid.best_score_, grid.best_params_
Observations:
Using the Decision Trees Classifier, we tried all the parameter values for GridSearch, but it wasn't resulting in comparable CV scores to LogisticRegression, so we moved to better models, i.e., RandomForest and GradientBoosting.
2. Random Forest Classifier¶
We tune our random forest classifier by varying the following hyper-parameters:
- N-estimators
- Max Depth
- Max Features
- Criterion
Tuning Level 1
rf = RandomForestClassifier(warm_start=True)
params = {'randomforestclassifier__n_estimators' : range(10,100,10),
'randomforestclassifier__max_depth' : range(3,15,2)}
pipe = make_pipeline(rf)
grid = GridSearchCV(pipe, param_grid = params, scoring='roc_auc', n_jobs=-1,iid=False, cv=5)
grid.fit(data_dummies_df, subscribed)
print(grid.best_score_)
Tuning Level 2
rf = RandomForestClassifier(warm_start=True, n_estimators=80, max_depth=9)
params = {'randomforestclassifier__max_features' : range(7, 53, 9),
'randomforestclassifier__criterion' : ['gini', 'entropy']}
pipe = make_pipeline(rf)
grid = GridSearchCV(pipe,param_grid = params, scoring='roc_auc', n_jobs=-1, iid=False, cv=5)
grid.fit(data_dummies_df, subscribed)
print(grid.best_score_)
Final Model based on learned Tuned Parameters
rf = RandomForestClassifier(warm_start=True, n_estimators=80,
max_depth=9,
max_features = 16,
criterion = 'entropy')
params = {}
pipe = make_pipeline(rf)
grid = GridSearchCV(pipe, param_grid = params, scoring='roc_auc', n_jobs=-1,iid=False, cv=5)
grid.fit(data_dummies_df, subscribed)
print(grid.best_score_)
Observations:
All the parameters were tuned for the Random Forest, but here we are showing just two levels of parameter tuning for brevity.
Finally we obtain a best cross-val score of 79.79%. As this cross val scores are better than what we obtain for ModelSet1, the Random Forests are good candidates for creating ensembles later.
3. Gradient Boosting Classifier¶
We tune our gradient boosting classifier by varying the following variables:
- N-estimators
- learning rate
- max depth
We obtain N-estimators = 80, learning_rate = 0.1, min_samples_split = 1000 and Max_depth = 5
param_test = {}
pipe = make_pipeline(GradientBoostingClassifier(learning_rate=0.1, max_depth=5, n_estimators=80,
random_state=10, min_samples_split = 1000))
grid = GridSearchCV(pipe, param_grid = param_test, scoring='roc_auc', n_jobs=-1, iid=True, cv=5)
grid.fit(data_dummies_df, subscribed)
if grid.best_score_ > 0.79:
assert True
Observations:
All the parameters were tuned for the Gradient Boosting Classifier, but as it takes a lot of time to tune the Gradient Boosting Model, we are just showing results for the best tuned model here.
Finally we obtain a best cross-val score of 79.6%. As this cross val scores are better than what we obtain for ModelSet1, the Gradient Boosting Methods are good candidates for creating ensembles later.
4. Xgboost¶
We also used XG boost classifier. Since we cannot pip-install XGboost, we fail to run it on Travis therefore we comment our code on XGboost. We have uploaded our output on the kaggle competition.
n_estimators = [50]
max_depth = [5]
learning_rate = [0.1]
subsample = [1.0]
colsample_bytree = [0.7]
colsample_bylevel = [0.6]
gamma = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.1]
param_grid = dict(xgbclassifier__n_estimators=n_estimators,
xgbclassifier__max_depth=max_depth,
xgbclassifier__learning_rate=learning_rate,
xgbclassifier__subsample=subsample,
xgbclassifier__colsample_bytree=colsample_bytree,
xgbclassifier__colsample_bylevel=colsample_bylevel,
xgbclassifier__gamma=gamma
)
#kfold = StratifiedKFold(n_splits=5, shuffle=True)
pipe = make_pipeline(StandardScaler(), XGBClassifier())
grid = GridSearchCV(pipe, param_grid, scoring="roc_auc", n_jobs=-1, cv=10)
grid.fit(data_dummies_df, subscribed)
grid.best_params_
grid.best_score_
Observations:
All the parameters were tuned for the XGBoost Classifier. The good aspect of using XGBoost is that it is way faster to train as compared to Gradient Boosting, and with regularization helps in learning a better model.
Finally we obtain a best cross-val score of 79.8%. As this cross val scores are better than what we obtain for ModelSet1, the XGBoost Methods are good candidates for creating ensembles later.
Step4 - Ensemble Methods¶
The aim here was to try poor man's stacking and other ensemble techniques as well.
1. Simple Ensemble: Voting Classifier, Poor-man's Stacking¶
- Logistic Regression CV
- Random Forest with best hyper-parameters, N-estimator = 80, max_depth = 9, max features = 16, criterion = entropy
- Gradient Boosting Classifier with best hyper-parameters, N-estimators = 200, max_depth = 5, min_samples_split = 1000
clf1 = make_pipeline(StandardScaler(),LogisticRegressionCV(random_state=10))
clf2 = RandomForestClassifier(warm_start=True,n_estimators=80,max_depth=9,max_features = 16, criterion = 'entropy', random_state=10)
clf3 = GradientBoostingClassifier(learning_rate=0.1,max_depth=5, n_estimators=200, random_state=10,min_samples_split = 1000)
voting = VotingClassifier([('logreg', clf1),('RandomForest', clf2),('GradientBoost',clf3)], voting='soft')
print(np.mean(cross_val_score(voting, data_dummies_df, subscribed,scoring = 'roc_auc', cv=5)))
## using only voting.fit is POOR MANS STACKING AND USING CROSS_VAL_PREDICT IS ACTUAL STACKING
voting.fit(data_dummies_df,subscribed)
create_submission_output(output_df, voting)
After Voting Classifier, we used Stacking on the output probability from the Voting Classifier and build a LogisticRegression model on those probabilities to obtain the final probability
from sklearn.preprocessing import FunctionTransformer
reshaper = FunctionTransformer(lambda X_: np.rollaxis(X_, 1).reshape(-1, 6)[:, 1::2], validate=False)
stacking = make_pipeline(voting, reshaper,
LogisticRegressionCV())
print(np.mean(cross_val_score(stacking,data_dummies_df, subscribed, scoring = 'roc_auc',cv=5)))
stacking.fit(data_df_dummies, subscribed)
create_submission_output(output_df, stacking)
The second stage stacking was also tried using the RandomForestClassifier, GradientBoosting and LinearRegression classifiers, but as it takes a lot of time to build, we have shown results for LogisticRegression here. Also, it did not help in any accuracy changes.
Observations:
We are able to achieve the best CV outputs using a Voting Classifier with all the best tuned models from ModelSet1 and ModelSet2.
2. Weighted Ensemble:¶
- We use a weighted ensemble model by giving more weightage to the Random Forest Classifier followed by the Gradient Boosting Classifier and Logistic Regression classifier.
clf1 = make_pipeline(StandardScaler(),LogisticRegressionCV(random_state=10))
clf2 = make_pipeline(StandardScaler(), RandomForestClassifier(warm_start=True,n_estimators=80, max_depth=9, max_features = 16, criterion = 'entropy', random_state=10))
clf3 = make_pipeline(StandardScaler(), GradientBoostingClassifier(learning_rate=0.1,max_depth=5, n_estimators=80, random_state=10, min_samples_split = 1000))
voting = VotingClassifier([('logreg', clf1), ('RandomForest', clf2), ('GradientBoost',clf3)],weights = [2, 5, 3], voting='soft')
output = np.mean(cross_val_score(voting, data_dummies_df, subscribed, scoring = 'roc_auc', cv=5))
print(output)
voting.fit(data_dummies_df,subscribed)
create_submission_output(output_df, voting)
Observations:
We are able to achieve the best CV outputs using a Voting Classifier with all the best tuned models from ModelSet1 and ModelSet2. It performed well when we were able to provide weights to separate classifiers for their outputs.
3. Ensemble with XGboost¶
clf1 = make_pipeline(StandardScaler(), XGBClassifier(n_estimators=50, max_depth=5, learning_rate=0.1, subsample=1.0,
colsample_bylevel=0.6, colsample_bytree=0.7, gamma = 1.11))
clf2 = make_pipeline(StandardScaler(), RandomForestClassifier(warm_start=True,n_estimators=80,max_depth=9,max_features = 16, criterion = 'entropy'))
clf3 = make_pipeline(StandardScaler(), GradientBoostingClassifier(learning_rate=0.1,max_depth=5, n_estimators=200, random_state=10,min_samples_split = 1000))
voting = VotingClassifier([('xgb', clf1), ('RandomForest', clf2), ('GradientBoost',clf3)], voting='soft', weights = [1.5, 4, 3])
print(np.mean(cross_val_score(voting,data_dummies_df,subscribed,scoring = 'roc_auc',cv=10)))
voting.fit(data_dummies_df,subscribed)
create_submission_output(output_df, voting)
Observation:
The cross val roc_auc score does not show any improvement by replacing the Logistic Classifier with XGboost. Although we get a similar roc_auc score of close to 0.786.
4. Ensemble with XGBoost, inspired by Owen Zhang¶
clf1 = XGBClassifier(n_estimators=50, max_depth=5, learning_rate=0.1, subsample=1.0, colsample_bytree=0.7, colsample_bylevel=0.6, gamma = 1.11, seed=0)
clf2 = XGBClassifier(n_estimators=50, max_depth=5, learning_rate=0.1, subsample=1.0, colsample_bytree=0.7, colsample_bylevel=0.6, gamma = 1.11, seed=1)
clf3 = XGBClassifier(n_estimators=50, max_depth=5, learning_rate=0.1, subsample=1.0, colsample_bytree=0.7, colsample_bylevel=0.6, gamma = 1.11, seed=2)
voting = VotingClassifier([('xgb1', clf1),('xgb2', clf2),('xgb3',clf3)], voting='soft')
print(np.mean(cross_val_score(voting, data_dummies_df, subscribed, scoring = 'roc_auc', cv=5)))
voting.fit(data_dummies_df,subscribed)
create_submission_output(output_df, voting)
Observation
We tried an ensemble of 3 XGboost classifiers with different random states. This was inspired by Owen Zhang technique of acing the kaggle leader-board. We were not able to observe any significant improvement in cross val score over our simple voting classifier.
Step 5 : Resampling¶
In this section we tried Resampling with the best parameters learnt for LogisticRegression and RandomForestClassifier, as they run faster
1. Undersampling with Logistic Regression : RandomUnderSampler¶
undersample_pipe = make_imb_pipeline(RandomUnderSampler(), LogisticRegressionCV())
scores = cross_val_score(undersample_pipe, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
undersample_pipe.fit(data_dummies_df,subscribed)
create_submission_output(output_df, undersample_pipe)
2. Oversampling with Logistic Regression : RandomOverSampler¶
oversample_pipe = make_imb_pipeline(RandomOverSampler(), LogisticRegressionCV())
scores = cross_val_score(oversample_pipe, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
oversample_pipe.fit(data_dummies_df,subscribed)
create_submission_output(output_df, oversample_pipe)
Observations:
Undersampling and Oversampling give similar roc_auc scores close to 0.787 using Logistic Regression classifier
3. Undersampling with Random Forest Classifier : RandomUnderSampler¶
undersample_pipe_rf = make_imb_pipeline(RandomUnderSampler(),
RandomForestClassifier(warm_start=True,
n_estimators=80,
max_depth=9,
max_features = 16,
criterion = 'entropy'))
scores = cross_val_score(undersample_pipe_rf, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
undersample_pipe_rf.fit(data_dummies_df, subscribed)
create_submission_output(output_df, undersample_pipe_rf)
4. Oversampling with Random Forest Classifier : RandomOverSampler¶
oversample_pipe_rf = make_imb_pipeline(RandomOverSampler(), RandomForestClassifier(warm_start=True,
n_estimators=80,
max_depth=9,
max_features = 16,
criterion = 'entropy'))
scores = cross_val_score(oversample_pipe_rf, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
oversample_pipe_rf.fit(data_dummies_df, subscribed)
create_submission_output(output_df,oversample_pipe_rf)
Random Over Sampler with our tuned Random Forest Classifier gives a mean roc_auc score of 0.797. Again we do not observe any significant improvement by resampling the data.
5. Balancing Class Weights for Logistic Regression¶
scores = cross_val_score(LogisticRegression(class_weight='balanced'),
data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(scores.mean())
6. Balancing Class Weights for Random Forest Classifier¶
from sklearn.ensemble import RandomForestClassifier
scores = cross_val_score(RandomForestClassifier(n_estimators=80,
max_depth=9,
max_features = 16,
criterion = 'entropy',
class_weight='balanced'),
data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
Resampled Ensembles¶
1. Trees: RF¶
In this model we create multiple undersampled random-forests classifiers with different random state for undersampling. Finally we build a voting classifier for all these rf classifiers.
def make_resample_tree(random_state=0):
rf = RandomForestClassifier(warm_start=True,
n_estimators=80,
max_depth=9,
max_features = 16,
criterion = 'entropy')
tree = make_imb_pipeline(RandomUnderSampler(random_state=random_state, replacement=True),
rf)
return "tree_i".format(random_state), tree
classifiers = [make_resample_tree(i) for i in range(100)]
resampled_rf = VotingClassifier(classifiers, voting='soft')
scores = cross_val_score(resampled_rf, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
2. Ensembles using RandomUnderSampling¶
from sklearn.base import clone
def make_resampled_ensemble(estimator, n_estimators=100):
estimators = []
for i in range(n_estimators):
est = clone(estimator)
if hasattr(est, "random_state"):
est.random_state = i
pipe = make_imb_pipeline(RandomUnderSampler(random_state=i, replacement=True),
est)
estimators.append(("est_i".format(i), pipe))
return VotingClassifier(estimators, voting="soft")
For RandomForestClassifier (takes long time to run, so commented for travis)
rf = RandomForestClassifier(warm_start=True,
n_estimators=80,
max_depth=9,
max_features = 16,
criterion = 'entropy')
resampled_tree_test = make_resampled_ensemble(rf)
scores = cross_val_score(resampled_tree_test, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
For LogisticRegression (takes long time to run, so commented for travis)
resampled_lr = make_resampled_ensemble(LogisticRegressionCV())
scores = cross_val_score(resampled_lr, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
Undersampling using Edited Nearest Neigbors (takes long time to run, so commented for travis)¶
enn_pipe = make_imb_pipeline(EditedNearestNeighbours(n_neighbors=5),
LogisticRegressionCV())
scores = cross_val_score(enn_pipe, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
rf = RandomForestClassifier(warm_start=True,
n_estimators=80,
max_depth=9,
max_features = 16,
criterion = 'entropy')
enn_pipe_rf = make_imb_pipeline(EditedNearestNeighbours(n_neighbors=5),
rf)
scores = cross_val_score(enn_pipe_rf, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
Undersampling using both EditedNearestNeighbours and CondensedNearestNeighbours were tried in conjunction with both LogisticRegression and RandomForestClassifier. CondensedNearestNeigbours takes a long time to run and does not help in improving the CV score much, so not used and shown here.
Resampling using SMOTE¶
from imblearn.over_sampling import SMOTE
LogisticRegression
smote_pipe = make_imb_pipeline(SMOTE(), LogisticRegressionCV())
scores = cross_val_score(smote_pipe, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
RandomForestClassifier
rf = RandomForestClassifier(warm_start=True,n_estimators=80,max_depth=9,max_features = 16, criterion = 'entropy')
smote_pipe_rf = make_imb_pipeline(SMOTE(), rf)
scores = cross_val_score(smote_pipe_rf, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
param_grid = {'smote__k_neighbors': [3, 5, 7, 9, 11, 15, 31]}
search = GridSearchCV(smote_pipe_rf, param_grid, cv=10, scoring="roc_auc")
search.fit(data_dummies_df, subscribed)
search.best_params_
From best_params_, we obtain the best value for k_neighbours as 7, which are used later
SMOTEENN and SMOTETOMEK tried, but code commented due to travis
rf = RandomForestClassifier(warm_start=True,n_estimators=80,max_depth=9,max_features = 16, criterion = 'entropy')
smote_pipe_rf = make_imb_pipeline(SMOTE(k_neighbors=7), rf)
scores = cross_val_score(smote_pipe_rf, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
from imblearn.combine import SMOTEENN, SMOTETomek
rf = RandomForestClassifier(warm_start=True,n_estimators=80,max_depth=9,max_features = 16, criterion = 'entropy')
smoteenn_pipe_rf = make_imb_pipeline(SMOTEENN(smote=SMOTE(k_neighbors=7)), rf)
scores = cross_val_score(smoteenn_pipe_rf, data_dummies_df, subscribed, cv=10, scoring='roc_auc')
print(np.mean(scores))
The Under and Over Sampling techniques are applied in conjunction with Random Forest Classifier (Best Tree Classifier)¶
The resampling techniques get us CV values near 0.791-0.794, but still are not able to perform as well as Voting Classifiers or classifiers built in ModelSet1 and ModelSet2. The main reason for this can be the skewness of classes and the chosen undersample not being a good representative of all the variation that exists in that class, whereas for the other class, it always has all the variation captured. Similar observations can be made for the Oversampling case.
We are using the weighted ensemble method for final results, whose CV value is stored in output ::
if output > 0.798:
#print(True)
assert True
Comments
comments powered by Disqus