A significant number of hotel bookings are called off due to cancellations or no-shows. Typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost. This may be beneficial to hotel guests, but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
This pattern of cancellations of bookings impacts a hotel on various fronts:
This increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal - they are facing problems with this high number of booking cancellations and have reached out to your firm for data-driven solutions. You, as a Data Scientist, have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below:
Data Dictionary
# Importing the basic libraries we will require for the project
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Importing the Machine Learning models we require from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
# Importing the other functions we may require from Scikit-Learn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder
# To get diferent metric scores
from sklearn.metrics import confusion_matrix,classification_report,roc_auc_score,plot_confusion_matrix,precision_recall_curve,roc_curve,make_scorer
# Code to ignore warnings from function usage
import warnings;
import numpy as np
warnings.filterwarnings('ignore')
hotel = pd.read_csv("INNHotelsGroup.csv")
# Copying data to another variable to avoid any changes to original data
data = hotel.copy()
Let's view the first few rows and last few rows of the dataset in order to understand its structure a little better.
We will use the head() and tail() methods from Pandas to do this.
data.head()
Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled |
1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled |
2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled |
3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled |
4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
data.tail()
Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
36270 | INN36271 | 3 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 4 | 85 | 2018 | 8 | 3 | Online | 0 | 0 | 0 | 167.80 | 1 | Not_Canceled |
36271 | INN36272 | 2 | 0 | 1 | 3 | Meal Plan 1 | 0 | Room_Type 1 | 228 | 2018 | 10 | 17 | Online | 0 | 0 | 0 | 90.95 | 2 | Canceled |
36272 | INN36273 | 2 | 0 | 2 | 6 | Meal Plan 1 | 0 | Room_Type 1 | 148 | 2018 | 7 | 1 | Online | 0 | 0 | 0 | 98.39 | 2 | Not_Canceled |
36273 | INN36274 | 2 | 0 | 0 | 3 | Not Selected | 0 | Room_Type 1 | 63 | 2018 | 4 | 21 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
36274 | INN36275 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 207 | 2018 | 12 | 30 | Offline | 0 | 0 | 0 | 161.67 | 0 | Not_Canceled |
data.shape
(36275, 19)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
Booking_ID
, type_of_meal_plan
, room_type_reserved
, market_segment_type
, and booking_status
are of object type while rest columns are numeric in nature.
There are no null values in the dataset.
# checking for duplicate values
data.duplicated().sum()
0
Let's drop the Booking_ID column first before we proceed forward, as a column with unique values will have almost no predictive power for the Machine Learning problem at hand.
data = data.drop(["Booking_ID"], axis=1)
data.head()
no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled |
1 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled |
2 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled |
3 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled |
4 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
Let's check the statistical summary of the data.
# Remove _________ and complete the code
data.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
no_of_adults | 36275.0 | 1.844962 | 0.518715 | 0.0 | 2.0 | 2.00 | 2.0 | 4.0 |
no_of_children | 36275.0 | 0.105279 | 0.402648 | 0.0 | 0.0 | 0.00 | 0.0 | 10.0 |
no_of_weekend_nights | 36275.0 | 0.810724 | 0.870644 | 0.0 | 0.0 | 1.00 | 2.0 | 7.0 |
no_of_week_nights | 36275.0 | 2.204300 | 1.410905 | 0.0 | 1.0 | 2.00 | 3.0 | 17.0 |
required_car_parking_space | 36275.0 | 0.030986 | 0.173281 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 |
lead_time | 36275.0 | 85.232557 | 85.930817 | 0.0 | 17.0 | 57.00 | 126.0 | 443.0 |
arrival_year | 36275.0 | 2017.820427 | 0.383836 | 2017.0 | 2018.0 | 2018.00 | 2018.0 | 2018.0 |
arrival_month | 36275.0 | 7.423653 | 3.069894 | 1.0 | 5.0 | 8.00 | 10.0 | 12.0 |
arrival_date | 36275.0 | 15.596995 | 8.740447 | 1.0 | 8.0 | 16.00 | 23.0 | 31.0 |
repeated_guest | 36275.0 | 0.025637 | 0.158053 | 0.0 | 0.0 | 0.00 | 0.0 | 1.0 |
no_of_previous_cancellations | 36275.0 | 0.023349 | 0.368331 | 0.0 | 0.0 | 0.00 | 0.0 | 13.0 |
no_of_previous_bookings_not_canceled | 36275.0 | 0.153411 | 1.754171 | 0.0 | 0.0 | 0.00 | 0.0 | 58.0 |
avg_price_per_room | 36275.0 | 103.423539 | 35.089424 | 0.0 | 80.3 | 99.45 | 120.0 | 540.0 |
no_of_special_requests | 36275.0 | 0.619655 | 0.786236 | 0.0 | 0.0 | 0.00 | 1.0 | 5.0 |
Write your answers here:_ 1.From the summary,the no of week nights has some outliers as the 75th percentile is 3 and maximum value is 17. 2.The lead time also may have outliers on both sides which needs to be further explored 3.Target variable-The no of previous cancellations and the no of previous bookings not cancelled are imbalanced as they have most values as 0 and max value being 13 and 58 4.The mean and median for arrival date are approximately close being 15.5 and 16. 5.The mean for avg_price per room is greater than median which is 99.5.This might be positively skewed and needs a further exploration.
Let's explore these variables in some more depth by observing their distributions.
We will first define a hist_box() function that provides both a boxplot and a histogram in the same visual, with which we can perform univariate analysis on the columns of this dataset.
# Defining the hist_box() function
def hist_box(data,col):
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)}, figsize=(12,6))
# Adding a graph in each part
sns.boxplot(data[col], ax=ax_box, showmeans=True)
sns.distplot(data[col], ax=ax_hist)
plt.show()
Lead Time
using the hist_box function provided and write your insights. (1 Mark)¶# Remove _________ and complete the code
hist_box(data,'lead_time')
Write your answers here:_ 1.The hist_box plot shows a positively skewed distribution. 2.The box plot shows they are outliers in the data. 3.Lead time might be an important factors for evaluating the cancellations.We can explore this by using bivariate analysis.
Average Price per Room
using the hist_box function provided and write your insights. (1 Mark)¶# Remove _________ and complete the code
hist_box(data,'avg_price_per_room')
Write your answers here:_ 1.The box plot shows that the average price per room has outliers on both sides 2.From the hist plot, the average price per room is close to 100. 3.There are some extreme outliers to the right of the box plot .We need to check how many of them are present.
Interestingly some rooms have a price equal to 0. Let's check them.
data[data["avg_price_per_room"] == 0]
no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
63 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 2 | 2017 | 9 | 10 | Complementary | 0 | 0 | 0 | 0.0 | 1 | Not_Canceled |
145 | 1 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 13 | 2018 | 6 | 1 | Complementary | 1 | 3 | 5 | 0.0 | 1 | Not_Canceled |
209 | 1 | 0 | 0 | 0 | Meal Plan 1 | 0 | Room_Type 1 | 4 | 2018 | 2 | 27 | Complementary | 0 | 0 | 0 | 0.0 | 1 | Not_Canceled |
266 | 1 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2017 | 8 | 12 | Complementary | 1 | 0 | 1 | 0.0 | 1 | Not_Canceled |
267 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 4 | 2017 | 8 | 23 | Complementary | 0 | 0 | 0 | 0.0 | 1 | Not_Canceled |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
35983 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 7 | 0 | 2018 | 6 | 7 | Complementary | 1 | 4 | 17 | 0.0 | 1 | Not_Canceled |
36080 | 1 | 0 | 1 | 1 | Meal Plan 1 | 0 | Room_Type 7 | 0 | 2018 | 3 | 21 | Complementary | 1 | 3 | 15 | 0.0 | 1 | Not_Canceled |
36114 | 1 | 0 | 0 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 3 | 2 | Online | 0 | 0 | 0 | 0.0 | 0 | Not_Canceled |
36217 | 2 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 2 | 3 | 2017 | 8 | 9 | Online | 0 | 0 | 0 | 0.0 | 2 | Not_Canceled |
36250 | 1 | 0 | 0 | 2 | Meal Plan 2 | 0 | Room_Type 1 | 6 | 2017 | 12 | 10 | Online | 0 | 0 | 0 | 0.0 | 0 | Not_Canceled |
545 rows × 18 columns
data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()
Complementary 354 Online 191 Name: market_segment_type, dtype: int64
# Calculating the 25th quantile
Q1 = data["avg_price_per_room"].quantile(0.25)
# Calculating the 75th quantile
Q3 = data["avg_price_per_room"].quantile(0.75)
# Calculating IQR
IQR = Q3 - Q1
# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
Upper_Whisker
179.55
# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker
Number of Children
sns.countplot(data['no_of_children'])
plt.show()
data['no_of_children'].value_counts(normalize=True)
0 0.925624 1 0.044604 2 0.029166 3 0.000524 9 0.000055 10 0.000028 Name: no_of_children, dtype: float64
# replacing 9, and 10 children with 3
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)
Arrival Month
sns.countplot(data["arrival_month"])
plt.show()
data['arrival_month'].value_counts(normalize=True)
10 0.146575 9 0.127112 8 0.105114 6 0.088298 12 0.083280 11 0.082150 7 0.080496 4 0.075424 5 0.071620 3 0.065003 2 0.046975 1 0.027953 Name: arrival_month, dtype: float64
Booking Status
sns.countplot(data["booking_status"])
plt.show()
data['booking_status'].value_counts(normalize=True)
Not_Canceled 0.672364 Canceled 0.327636 Name: booking_status, dtype: float64
Let's encode Canceled bookings to 1 and Not_Canceled as 0 for further analysis
data["booking_status"] = data["booking_status"].replace(
{ "Canceled":1,'Not_Canceled':0}
)
# Remove _________ and complete the code
cols_list = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(12, 7))
sns.heatmap(data.corr(),annot=True)
plt.show()
Write your answers here:_ 1.From the heatmap, there is a positive corelation of 0.54 between repeated guest and no of previous bookings not cancelled which means repeated guests are less likely to cancel their bookings. 2.It is suspicious to see there is 0.39 correlation between no of previous cancellations and no of previous bookings not cancelled.This needs further investigation. 3.There is a negative correlation of -0.34 between arrival year and arrival month which is not strong enough to infer anything out of it. 4.Other variables are not highly correlated .
Hotel rates are dynamic and change according to demand and customer demographics. Let's see how prices vary across different market segments
plt.figure(figsize=(10, 6))
sns.boxplot(
data=data, x="market_segment_type", y="avg_price_per_room", palette="gist_rainbow"
)
plt.show()
We will define a stacked barplot() function to help analyse how the target variable varies across predictor categories.
# Defining the stacked_barplot() function
def stacked_barplot(data,predictor,target,figsize=(10,6)):
(pd.crosstab(data[predictor],data[target],normalize='index')*100).plot(kind='bar',figsize=figsize,stacked=True)
plt.legend(loc="lower right")
plt.ylabel('Percentage Cancellations %')
Market Segment Type
against the target variable Booking Status
using the stacked_barplot function provided and write your insights. (1 Mark)¶# Remove _________ and complete the code
stacked_barplot(data,'market_segment_type','booking_status')
Write your answers here:_ 1.The top 2 market segment in the cancellations are Online and Aviation followed by Offline and Corporate. 2.The Complimentary has no cancellations which makes sense.
Repeated Guest
against the target variable Booking Status
using the stacked_barplot function provided and write your insights. (1 Mark)¶Repeating guests are the guests who stay in the hotel often and are important to brand equity.
# Remove _________ and complete the code
stacked_barplot(data,'repeated_guest','booking_status')
Write your answers here:_ 1.From the graph,it is clear that repeated guests tend to cancel less may be due to factors like hospitality, food service, discount price etc 2.Non repeated guests tend to cancel their bookings to approx 30 percentage.
Let's analyze the customer who stayed for at least a day at the hotel.
stay_data = data[(data["no_of_week_nights"] > 0) & (data["no_of_weekend_nights"] > 0)]
stay_data["total_days"] = (stay_data["no_of_week_nights"] + stay_data["no_of_weekend_nights"])
stacked_barplot(stay_data, "total_days", "booking_status",figsize=(15,6))
As hotel room prices are dynamic, Let's see how the prices vary across different months
plt.figure(figsize=(10, 5))
sns.lineplot(y=data["avg_price_per_room"], x=data["arrival_month"], ci=None)
plt.show()
Separating the independent variables (X) and the dependent variable (Y)
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]
X = pd.get_dummies(X, drop_first=True) # Encoding the Categorical features
Splitting the data into a 70% train and 30% test set
Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use the stratified sampling technique to ensure that relative class frequencies are approximately preserved in each train and validation fold.
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30,stratify=Y, random_state=1)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (25392, 27) Shape of test set : (10883, 27) Percentage of classes in training set: 0 0.672377 1 0.327623 Name: booking_status, dtype: float64 Percentage of classes in test set: 0 0.672333 1 0.327667 Name: booking_status, dtype: float64
Both the cases are important as:
If we predict that a booking will not be canceled and the booking gets canceled then the hotel will lose resources and will have to bear additional costs of distribution channels.
If we predict that a booking will get canceled and the booking doesn't get canceled the hotel might not be able to provide satisfactory services to the customer by assuming that this booking will be canceled. This might damage brand equity.
F1 Score
to be maximized, the greater the F1 score, the higher the chances of minimizing False Negatives and False Positives. Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.
# Creating metric function
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize=(8,5))
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels=['Not Cancelled', 'Cancelled'], yticklabels=['Not Cancelled', 'Cancelled'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
We will be building 4 different models:
# Remove _________ and complete the code
# Fitting logistic regression model
lg =LogisticRegression()
lg.fit(X_train,y_train)
LogisticRegression()
# Remove _________ and complete the code
# Checking the performance on the training data
y_pred_train = lg.predict(X_train)
metrics_score(y_train,y_pred_train)
precision recall f1-score support 0 0.82 0.90 0.86 17073 1 0.74 0.58 0.65 8319 accuracy 0.80 25392 macro avg 0.78 0.74 0.75 25392 weighted avg 0.79 0.80 0.79 25392
Write your Answer here: Based on the model the F1 score is 66% on the training data which predicts the booking cancellations of the customer.
Let's check the performance on the test set
# Remove _________ and complete the code
# Checking the performance on the test dataset
y_pred_test = lg.predict(X_test)
metrics_score(y_test,y_pred_test)
precision recall f1-score support 0 0.81 0.90 0.85 7317 1 0.74 0.57 0.65 3566 accuracy 0.79 10883 macro avg 0.77 0.74 0.75 10883 weighted avg 0.79 0.79 0.79 10883
Write your Answer here: The F1 score for the test data is predicted to be 65% which is almost same as training data(66%).This model works well on both train and test data but we need to improve f1 score for better evaluation. From the above results,the precision and recall are high which cant be the case.Hence we find an optimal threshold value to determine the trade off between precision and recall.
Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
Let's use the Precision-Recall curve and see if we can find a better threshold.
# Remove _________ and complete the code
# Predict_proba gives the probability of each observation belonging to each class
y_scores_lg=lg.predict_proba(X_train)
precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(y_train,y_scores_lg[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label='precision')
plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
Write your answers here:_ From the Precision Recall graph,the optimal threshold is approx 0.42 since we are looking to maximize F1 score.
# Setting the optimal threshold
optimal_threshold = 0.42
# Remove _________ and complete the code
# Creating confusion matrix
y_pred_train = lg.predict_proba(X_train)
metrics_score(y_train,y_pred_train[:,1]>optimal_threshold)
precision recall f1-score support 0 0.85 0.85 0.85 17073 1 0.69 0.68 0.69 8319 accuracy 0.80 25392 macro avg 0.77 0.77 0.77 25392 weighted avg 0.79 0.80 0.80 25392
Write your answers here:_ Upon using the optimal threshold ,the model training performance has increased only 1% which is not significant. But the precision has reduced by 5%.
Let's check the performance on the test set
# Remove _________ and complete the code
y_pred_test = lg.predict_proba(X_test)
metrics_score(y_test,y_pred_test[:,1]>optimal_threshold)
precision recall f1-score support 0 0.84 0.85 0.84 7317 1 0.68 0.67 0.67 3566 accuracy 0.79 10883 macro avg 0.76 0.76 0.76 10883 weighted avg 0.79 0.79 0.79 10883
Write your answers here:_ The model performance on the test data also increased slightly by 1% only. We need to check other models to find the better predictor.
To accelerate SVM training, let's scale the data for support vector machines.
# Scaling the data
sc=StandardScaler()
# Fit_transform on train data
X_train_scaled=sc.fit_transform(X_train)
X_train_scaled=pd.DataFrame(X_train_scaled, columns=X.columns)
# Transform on test data
X_test_scaled=sc.transform(X_test)
X_test_scaled=pd.DataFrame(X_test_scaled, columns=X.columns)
Let's build the models using the two of the widely used kernel functions:
Note: Please use the scaled data for modeling Support Vector Machine
# Remove _________ and complete the code
svm = SVC(kernel='linear',probability=True)# Linear kernal or linear decision boundary
model = svm.fit(X= X_train_scaled, y = y_train)
# Remove _________ and complete the code
y_pred_train_svm = model.predict(X_train_scaled)
metrics_score(y_train,y_pred_train_svm)
precision recall f1-score support 0 0.83 0.90 0.86 17073 1 0.74 0.61 0.67 8319 accuracy 0.80 25392 macro avg 0.79 0.76 0.77 25392 weighted avg 0.80 0.80 0.80 25392
Write your answers here:_ SVM model with Linear kernel has f1 score of 0.67 which signifies the performance has much improved compared to logistic model. The precision and recall values are 0.74 and 0.61 for bookings canceled.
Checking model performance on test set
# Remove _________ and complete the code
y_pred_test_svm = model.predict(X_test_scaled)
metrics_score(y_test, y_pred_test_svm)
precision recall f1-score support 0 0.82 0.90 0.86 7317 1 0.74 0.61 0.67 3566 accuracy 0.80 10883 macro avg 0.78 0.75 0.76 10883 weighted avg 0.80 0.80 0.80 10883
Write your answers here:_ The model fitted well on testing data as the F1 score is same as 0.67.Check for optimal threshold
# Remove _________ and complete the code
y_scores_svm=model.predict_proba(X_train_scaled)
precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(y_train, y_scores_svm[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
Write your answers here:_ From the graph,the optimal threshold value is approx 0.42 as we are considering f1 score we need to find the value with both precision and recall cut off value
optimal_threshold_svm=0.42
# Remove _________ and complete the code
y_pred_train_svm=model.predict_proba(X_train_scaled)
metrics_score(y_train, y_pred_train_svm[:,1]>optimal_threshold_svm)
Write your answers here:_ At the optimal threshold of 0.42, the performance of the training model has slightly increased with an f1 score of 0.70.
# Remove _________ and complete the code
y_pred_test_svm=model.predict_proba(X_train_scaled)
metrics_score(y_train, y_pred_test_svm[:,1]>optimal_threshold_svm)
Write your answers here:_ The model with optimal threshold value has performed well on test data also .The f1 score is 0.70 which is same for training data . There is no change in precision and recall values.Lets check for non linear model.Linear kernel performed well when compared to above model
# Remove _________ and complete the code
svm_rbf=SVC(kernel="rbf",probability=True)
svm_rbf.fit(X_train_scaled,y_train)
SVC(probability=True)
# Remove _________ and complete the code
y_pred_train_svm_rbf = svm_rbf.predict(X_train_scaled)
metrics_score(y_train,y_pred_train_svm_rbf)
precision recall f1-score support 0 0.86 0.92 0.89 17073 1 0.81 0.69 0.74 8319 accuracy 0.85 25392 macro avg 0.83 0.80 0.82 25392 weighted avg 0.84 0.85 0.84 25392
Write your answers here:_ rbf kernel give san f1 score of 0.74 which is 5.8% increase than linear kernel .The precision has increased by 14% whereas the recall hasnt cahnged for both kernel. Lets check on test data.
# Remove _________ and complete the code
y_pred_test = svm_rbf.predict(X_test_scaled)
metrics_score(y_test,y_pred_test)
precision recall f1-score support 0 0.85 0.92 0.88 7317 1 0.80 0.66 0.72 3566 accuracy 0.84 10883 macro avg 0.82 0.79 0.80 10883 weighted avg 0.83 0.84 0.83 10883
Write your answers here:_ The f1 score has decreased by 2% on the test data .This may be a factor of Overfitting on the training data.Lets check for optimal threshold to reduce Overfitting.
# Predict on train data
y_scores_svm_rbf=svm_rbf.predict_proba(X_train_scaled)
precisions_svm_rbf, recalls_svm_rbf, thresholds_svm_rbf = precision_recall_curve(y_train, y_scores_svm_rbf[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm_rbf, precisions_svm_rbf[:-1], 'b--', label='precision')
plt.plot(thresholds_svm_rbf, recalls_svm_rbf[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
optimal_threshold_svm=0.38
# Remove _________ and complete the code
y_pred_train_svm_rbf = svm_rbf.predict_proba(X_train_scaled)
metrics_score(y_train,y_pred_train_svm_rbf[:,1]>optimal_threshold_svm)
precision recall f1-score support 0 0.88 0.89 0.88 17073 1 0.76 0.75 0.76 8319 accuracy 0.84 25392 macro avg 0.82 0.82 0.82 25392 weighted avg 0.84 0.84 0.84 25392
Write your answers here:_ The f1 score is 0.76 which has improved slightly while using optimal threshold.The recall has increased but the precision value dropped.
# Remove _________ and complete the code
y_pred_test = svm_rbf.predict_proba(X_test_scaled)
metrics_score(y_test,y_pred_test[:,1]>optimal_threshold_svm)
precision recall f1-score support 0 0.87 0.88 0.88 7317 1 0.75 0.74 0.74 3566 accuracy 0.83 10883 macro avg 0.81 0.81 0.81 10883 weighted avg 0.83 0.83 0.83 10883
Write your answers here:_ F1 score has dropped to 0.74. But the performance of the model has a very slight improvement of 2% on f1 score using otpmal threshold.SVM model with rbf kernel performed well compared to linear kernel and logistic model. But the performance can be still improved.
# Remove _________ and complete the code
model_dt = DecisionTreeClassifier(random_state=1)
model_dt.fit(X_train,y_train)
DecisionTreeClassifier(random_state=1)
# Remove _________ and complete the code
# Checking performance on the training dataset
pred_train_dt = model_dt.predict(X_train)
metrics_score(y_train,pred_train_dt)
precision recall f1-score support 0 0.99 1.00 1.00 17073 1 1.00 0.99 0.99 8319 accuracy 0.99 25392 macro avg 1.00 0.99 0.99 25392 weighted avg 0.99 0.99 0.99 25392
Write your answers here:_ F1 score is 0.99 with high precision of 1 and recall of 0.99. Model has performed well on training set . We check for test data to see the model is not Overfitting.
pred_test_dt = model_dt.predict(X_test)
metrics_score(y_test,pred_test_dt)
precision recall f1-score support 0 0.90 0.90 0.90 7317 1 0.79 0.79 0.79 3566 accuracy 0.87 10883 macro avg 0.85 0.85 0.85 10883 weighted avg 0.87 0.87 0.87 10883
Write your answers here:_ The model has low F1 score when compared to training data. F1 score has reduced by 20%. The precision and recall have also reduced to 0.79 amd hence clearly the model is Overfitting.We need to tune the parameters to reduce Overfitting.
Note: Please use the following hyperparameters provided for tuning the Decision Tree. In general, you can experiment with various hyperparameters to tune the decision tree, but for this project, we recommend sticking to the parameters provided.
# Remove _________ and complete the code
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, cv=5,scoring='f1')
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=50, min_samples_split=10, random_state=1)
# Remove _________ and complete the code
# Checking performance on the training dataset
dt_tuned = estimator.predict(X_train)
metrics_score(y_train,dt_tuned)
precision recall f1-score support 0 0.86 0.93 0.89 17073 1 0.82 0.68 0.75 8319 accuracy 0.85 25392 macro avg 0.84 0.81 0.82 25392 weighted avg 0.85 0.85 0.84 25392
Write your answers here:_ By tuning the hyperparameters,the f1 score still reduced to 75%.The precision increased to 0.82 but the recall got decreased.
# Remove _________ and complete the code
# Checking performance on the training dataset
y_pred_tuned = estimator.predict(X_test)
metrics_score(y_test,y_pred_tuned)
precision recall f1-score support 0 0.85 0.93 0.89 7317 1 0.82 0.67 0.74 3566 accuracy 0.84 10883 macro avg 0.84 0.80 0.81 10883 weighted avg 0.84 0.84 0.84 10883
Write your answers here:_ The F1 score slightly decreased by 1% whereas the precision and recall do not vary significantly. The Decision Tree with default parameters has an Overfitting and is not able to generalize. The tuned model also couldnot generalize the data as the F1 score has not improved much may be its because of Overfitting. The f1 score on the test data for tuned model is 0.74 which is same for rbf kernel but the precision has increased.We need a much better model to predict the performance.
feature_names = list(X_train.columns)
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,max_depth=3,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Remove _________ and complete the code
# Importance of features in the tree building
importances = model_dt.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Write your answers here:_ The top 3 important features are lead time average price per room arrival date There are 9 features of no importance.
# Remove _________ and complete the code
rf_estimator = RandomForestClassifier( random_state = 1)
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(random_state=1)
# Remove _________ and complete the code
y_pred_train_rf = rf_estimator.predict(X_train)
metrics_score(y_train,y_pred_train_rf)
precision recall f1-score support 0 0.99 1.00 1.00 17073 1 1.00 0.99 0.99 8319 accuracy 0.99 25392 macro avg 0.99 0.99 0.99 25392 weighted avg 0.99 0.99 0.99 25392
Write your answers here:_ Model has performed very well on training set. F1 score is 0.99 but there are errors this may be because of Overfitting.
# Remove _________ and complete the code
y_pred_test_rf = rf_estimator.predict(X_test)
metrics_score(y_test,y_pred_test_rf)
precision recall f1-score support 0 0.91 0.95 0.93 7317 1 0.88 0.80 0.84 3566 accuracy 0.90 10883 macro avg 0.90 0.88 0.88 10883 weighted avg 0.90 0.90 0.90 10883
Write your answers here:_ The f1 performance score has been reduced to 0.84 on the test data.The data may have Overfitting on training set.This Random Forest has performed well as it has f1 score of 0.84 which is high when compared to above models.The precision and recall values are also high when compared to other models. Hence Random Forest can be used to predict our model performance.
Let's check the feature importance of the Random Forest
# Remove _________ and complete the code
importances = rf_estimator.feature_importances_
columns = X.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize = (13, 13))
sns.barplot(importance_df.Importance, importance_df.index)
<AxesSubplot:xlabel='Importance'>
Write your answers here:_ From the Random Forest ,we can further conclude that Lead time is the most important feature to predict the booking cancellations. The second important feature is average price per room same as in Decision Tree. Random Forest predicts no of special requests play a major role when compared to arrrival date in predicting booking cancellations by a customer.
To conclude,Random Forest better predicts the target variable ie booking_status.The model was able to predict the cancellations with an F1 score of 0.84.
Write your answers here:_ 1.By analyzying the data,lead time seems to be highly correlated with the booking_status which means customers tend to cancel their booking given there is more time before their stay. Hotels could practice a followup calls to their guests prior to their stay to confirm reservations which can avoid last minute cancelations. 2.The average price per room is also a major factor for cancellations which is evident from the data. This makes sense because if a customer finds a better room at a cheaper price is more likely to cancel their reservation. Management could update their room prices on a daily basis to keep up with the competition in the market. 3.People booking through Online has higher cancellations rate from the data followed by airlines.The business can have a backup plan to offer complimentary benefits such as breakfast or self serve snacks in the room to fill their rooms. 4.Seasonality also plays an important role because some places attract more tourists in particular months.So business could actively promote their hotels to attract customers and be in touch with them in order to prevent cancellations.