Project - Classification and Hypothesis Testing: Hotel Booking Cancellation Prediction

Marks: 40


Problem Statement

Context

A significant number of hotel bookings are called off due to cancellations or no-shows. Typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost. This may be beneficial to hotel guests, but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

This pattern of cancellations of bookings impacts a hotel on various fronts:

  1. Loss of resources (revenue) when the hotel cannot resell the room.
  2. Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  3. Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  4. Human resources to make arrangements for the guests.

Objective

This increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal - they are facing problems with this high number of booking cancellations and have reached out to your firm for data-driven solutions. You, as a Data Scientist, have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below:

Data Dictionary

Importing the libraries required

Loading the dataset

Overview of the dataset

View the first and last 5 rows of the dataset

Let's view the first few rows and last few rows of the dataset in order to understand its structure a little better.

We will use the head() and tail() methods from Pandas to do this.

Understand the shape of the dataset

Check the data types of the columns for the dataset

Dropping duplicate values

Dropping the unique values column

Let's drop the Booking_ID column first before we proceed forward, as a column with unique values will have almost no predictive power for the Machine Learning problem at hand.

Question 1: Check the summary statistics of the dataset and write your observations (2 Marks)

Let's check the statistical summary of the data.

Write your answers here:_ 1.From the summary,the no of week nights has some outliers as the 75th percentile is 3 and maximum value is 17. 2.The lead time also may have outliers on both sides which needs to be further explored 3.Target variable-The no of previous cancellations and the no of previous bookings not cancelled are imbalanced as they have most values as 0 and max value being 13 and 58 4.The mean and median for arrival date are approximately close being 15.5 and 16. 5.The mean for avg_price per room is greater than median which is 99.5.This might be positively skewed and needs a further exploration.

Exploratory Data Analysis

Question 2: Univariate Analysis

Let's explore these variables in some more depth by observing their distributions.

We will first define a hist_box() function that provides both a boxplot and a histogram in the same visual, with which we can perform univariate analysis on the columns of this dataset.

Question 2.1: Plot the histogram and box plot for the variable Lead Time using the hist_box function provided and write your insights. (1 Mark)

Write your answers here:_ 1.The hist_box plot shows a positively skewed distribution. 2.The box plot shows they are outliers in the data. 3.Lead time might be an important factors for evaluating the cancellations.We can explore this by using bivariate analysis.

Question 2.2: Plot the histogram and box plot for the variable Average Price per Room using the hist_box function provided and write your insights. (1 Mark)

Write your answers here:_ 1.The box plot shows that the average price per room has outliers on both sides 2.From the hist plot, the average price per room is close to 100. 3.There are some extreme outliers to the right of the box plot .We need to check how many of them are present.

Interestingly some rooms have a price equal to 0. Let's check them.

Let's understand the distribution of the categorical variables

Number of Children

Arrival Month

Booking Status

Let's encode Canceled bookings to 1 and Not_Canceled as 0 for further analysis

Question 3: Bivariate Analysis

Question 3.1: Find and visualize the correlation matrix using a heatmap and write your observations from the plot. (2 Marks)

Write your answers here:_ 1.From the heatmap, there is a positive corelation of 0.54 between repeated guest and no of previous bookings not cancelled which means repeated guests are less likely to cancel their bookings. 2.It is suspicious to see there is 0.39 correlation between no of previous cancellations and no of previous bookings not cancelled.This needs further investigation. 3.There is a negative correlation of -0.34 between arrival year and arrival month which is not strong enough to infer anything out of it. 4.Other variables are not highly correlated .

Hotel rates are dynamic and change according to demand and customer demographics. Let's see how prices vary across different market segments

We will define a stacked barplot() function to help analyse how the target variable varies across predictor categories.

Question 3.2: Plot the stacked barplot for the variable Market Segment Type against the target variable Booking Status using the stacked_barplot function provided and write your insights. (1 Mark)

Write your answers here:_ 1.The top 2 market segment in the cancellations are Online and Aviation followed by Offline and Corporate. 2.The Complimentary has no cancellations which makes sense.

Question 3.3: Plot the stacked barplot for the variable Repeated Guest against the target variable Booking Status using the stacked_barplot function provided and write your insights. (1 Mark)

Repeating guests are the guests who stay in the hotel often and are important to brand equity.

Write your answers here:_ 1.From the graph,it is clear that repeated guests tend to cancel less may be due to factors like hospitality, food service, discount price etc 2.Non repeated guests tend to cancel their bookings to approx 30 percentage.

Let's analyze the customer who stayed for at least a day at the hotel.

As hotel room prices are dynamic, Let's see how the prices vary across different months

Data Preparation for Modeling

Separating the independent variables (X) and the dependent variable (Y)

Splitting the data into a 70% train and 30% test set

Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use the stratified sampling technique to ensure that relative class frequencies are approximately preserved in each train and validation fold.

Model Evaluation Criterion

Model can make wrong predictions as:

  1. Predicting a customer will not cancel their booking but in reality, the customer will cancel their booking.
  2. Predicting a customer will cancel their booking but in reality, the customer will not cancel their booking.

Which case is more important?

Both the cases are important as:

How to reduce the losses?

Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.

Building the model

We will be building 4 different models:

Question 4: Logistic Regression (6 Marks)

Question 4.1: Build a Logistic Regression model (Use the sklearn library) (1 Mark)

Question 4.2: Check the performance of the model on train and test data (2 Marks)

Write your Answer here: Based on the model the F1 score is 66% on the training data which predicts the booking cancellations of the customer.

Let's check the performance on the test set

Write your Answer here: The F1 score for the test data is predicted to be 65% which is almost same as training data(66%).This model works well on both train and test data but we need to improve f1 score for better evaluation. From the above results,the precision and recall are high which cant be the case.Hence we find an optimal threshold value to determine the trade off between precision and recall.

Question 4.3: Find the optimal threshold for the model using the Precision-Recall Curve. (1 Mark)

Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.

Let's use the Precision-Recall curve and see if we can find a better threshold.

Write your answers here:_ From the Precision Recall graph,the optimal threshold is approx 0.42 since we are looking to maximize F1 score.

Question 4.4: Check the performance of the model on train and test data using the optimal threshold. (2 Marks)

Write your answers here:_ Upon using the optimal threshold ,the model training performance has increased only 1% which is not significant. But the precision has reduced by 5%.

Let's check the performance on the test set

Write your answers here:_ The model performance on the test data also increased slightly by 1% only. We need to check other models to find the better predictor.

Question 5: Support Vector Machines (11 Marks)

To accelerate SVM training, let's scale the data for support vector machines.

Let's build the models using the two of the widely used kernel functions:

  1. Linear Kernel
  2. RBF Kernel

Question 5.1: Build a Support Vector Machine model using a linear kernel (1 Mark)

Note: Please use the scaled data for modeling Support Vector Machine

Question 5.2: Check the performance of the model on train and test data (2 Marks)

Write your answers here:_ SVM model with Linear kernel has f1 score of 0.67 which signifies the performance has much improved compared to logistic model. The precision and recall values are 0.74 and 0.61 for bookings canceled.

Checking model performance on test set

Write your answers here:_ The model fitted well on testing data as the F1 score is same as 0.67.Check for optimal threshold

Question 5.3: Find the optimal threshold for the model using the Precision-Recall Curve. (1 Mark)

Write your answers here:_ From the graph,the optimal threshold value is approx 0.42 as we are considering f1 score we need to find the value with both precision and recall cut off value

Question 5.4: Check the performance of the model on train and test data using the optimal threshold. (2 Marks)

Write your answers here:_ At the optimal threshold of 0.42, the performance of the training model has slightly increased with an f1 score of 0.70.

Write your answers here:_ The model with optimal threshold value has performed well on test data also .The f1 score is 0.70 which is same for training data . There is no change in precision and recall values.Lets check for non linear model.Linear kernel performed well when compared to above model

Question 5.5: Build a Support Vector Machines model using an RBF kernel (1 Mark)

Question 5.6: Check the performance of the model on train and test data (2 Marks)

Write your answers here:_ rbf kernel give san f1 score of 0.74 which is 5.8% increase than linear kernel .The precision has increased by 14% whereas the recall hasnt cahnged for both kernel. Lets check on test data.

Checking model performance on test set

Write your answers here:_ The f1 score has decreased by 2% on the test data .This may be a factor of Overfitting on the training data.Lets check for optimal threshold to reduce Overfitting.

Question 5.7: Check the performance of the model on train and test data using the optimal threshold. (2 Marks)

Write your answers here:_ The f1 score is 0.76 which has improved slightly while using optimal threshold.The recall has increased but the precision value dropped.

Write your answers here:_ F1 score has dropped to 0.74. But the performance of the model has a very slight improvement of 2% on f1 score using otpmal threshold.SVM model with rbf kernel performed well compared to linear kernel and logistic model. But the performance can be still improved.

Question 6: Decision Trees (7 Marks)

Question 6.1: Build a Decision Tree Model (1 Mark)

Question 6.2: Check the performance of the model on train and test data (2 Marks)

Write your answers here:_ F1 score is 0.99 with high precision of 1 and recall of 0.99. Model has performed well on training set . We check for test data to see the model is not Overfitting.

Checking model performance on test set

Write your answers here:_ The model has low F1 score when compared to training data. F1 score has reduced by 20%. The precision and recall have also reduced to 0.79 amd hence clearly the model is Overfitting.We need to tune the parameters to reduce Overfitting.

Question 6.3: Perform hyperparameter tuning for the decision tree model using GridSearch CV (1 Mark)

Note: Please use the following hyperparameters provided for tuning the Decision Tree. In general, you can experiment with various hyperparameters to tune the decision tree, but for this project, we recommend sticking to the parameters provided.

Question 6.4: Check the performance of the model on the train and test data using the tuned model (2 Mark)

Checking performance on the training set

Write your answers here:_ By tuning the hyperparameters,the f1 score still reduced to 75%.The precision increased to 0.82 but the recall got decreased.

Write your answers here:_ The F1 score slightly decreased by 1% whereas the precision and recall do not vary significantly. The Decision Tree with default parameters has an Overfitting and is not able to generalize. The tuned model also couldnot generalize the data as the F1 score has not improved much may be its because of Overfitting. The f1 score on the test data for tuned model is 0.74 which is same for rbf kernel but the precision has increased.We need a much better model to predict the performance.

Visualizing the Decision Tree

Question 6.5: What are some important features based on the tuned decision tree? (1 Mark)

Write your answers here:_ The top 3 important features are lead time average price per room arrival date There are 9 features of no importance.


Question 7: Random Forest (4 Marks)

Question 7.1: Build a Random Forest Model (1 Mark)

Question 7.2: Check the performance of the model on the train and test data (2 Marks)

Write your answers here:_ Model has performed very well on training set. F1 score is 0.99 but there are errors this may be because of Overfitting.

Write your answers here:_ The f1 performance score has been reduced to 0.84 on the test data.The data may have Overfitting on training set.This Random Forest has performed well as it has f1 score of 0.84 which is high when compared to above models.The precision and recall values are also high when compared to other models. Hence Random Forest can be used to predict our model performance.

Question 7.3: What are some important features based on the Random Forest? (1 Mark)

Let's check the feature importance of the Random Forest

Write your answers here:_ From the Random Forest ,we can further conclude that Lead time is the most important feature to predict the booking cancellations. The second important feature is average price per room same as in Decision Tree. Random Forest predicts no of special requests play a major role when compared to arrrival date in predicting booking cancellations by a customer.

To conclude,Random Forest better predicts the target variable ie booking_status.The model was able to predict the cancellations with an F1 score of 0.84.

Question 8: Conclude ANY FOUR key takeaways for business recommendations (4 Marks)

Write your answers here:_ 1.By analyzying the data,lead time seems to be highly correlated with the booking_status which means customers tend to cancel their booking given there is more time before their stay. Hotels could practice a followup calls to their guests prior to their stay to confirm reservations which can avoid last minute cancelations. 2.The average price per room is also a major factor for cancellations which is evident from the data. This makes sense because if a customer finds a better room at a cheaper price is more likely to cancel their reservation. Management could update their room prices on a daily basis to keep up with the competition in the market. 3.People booking through Online has higher cancellations rate from the data followed by airlines.The business can have a backup plan to offer complimentary benefits such as breakfast or self serve snacks in the room to fill their rooms. 4.Seasonality also plays an important role because some places attract more tourists in particular months.So business could actively promote their hotels to attract customers and be in touch with them in order to prevent cancellations.

Happy Learning!