top of page

Hotel Management - Hotel Booking Cancellation Prediction

Writer's picture: Akash KandarkarAkash Kandarkar

Updated: Apr 14, 2023

Intuition-based revenue management strategies such as dynamic pricing, overbooking, and strict cancellation policies can backfire, leading to a loss of sales, deteriorated business reputation, and fall in customer loyalty. Our aim was to use analytics to provide valuable insights into hotel cancellations and make recommendations for improving cancellation rates.




Introduction

The Hotel Industry has changed over the years, with most of the booking by third-party companies. These Online Travel Agencies have changed the cancellation policy from footnote at the bottom of the page to making it the main selling point in their marketing campaign. As a result, customers have become accustomed to free cancellation policies. Back in 2019, D-Edge Hospitality Solutions reported that the global cancelation rate of hotel reservation reached 40%. The dataset contains hotel booking data where it is necessary to evaluate numerous factors that can lead to bookings being cancelled. Few of these factors that can determine cancellations include the lag period between booking made and date booked, type of customer, is he/she a regular customer, location etc. It is necessary to predict if the bookings will still prevail by analyzing the past data which shows which bookings are being cancelled and recording a pattern which can help the hotel industry make a better prediction. Our Project aims to provide valuable insights on hotel cancellations using analytics. To successfully address the needs of the project, we have adopted Kanban Project Methodology. We found this useful as it enables agility and prevents overloading the development process.

Business Understanding

Booking Cancellation is a challenge faced by the hospitality industry because it has a direct impact on their revenue generation. When customers cancel the booking, there are severe implications for the hotels, it affects their occupancy rates. Revenue management strategies, such as dynamic pricing, overbooking, and strict cancellation policies, are employed to address booking cancellation and maximize occupancy rates. However, when done based on intuition only, these strategies might backfire, such as loss of sales, deteriorated business reputation and fall in customer loyalty.


Business Questions

  1. Do people who cancel their booking tend to make booking changes?

  2. Can we predict a pattern based on previous cancellations?

  3. What type of customers usually cancel the booking?

  4. Which type of deposit accounted for more cancellations?

  5. Determine a threshold number of days after which if a customer cancels their booking, they need to pay a convenience fee?

  6. Which country has the maximum cancellation and what needs to be done there?

  7. Which market segment has the maximum cancellation and which market segment needs to be focused on?

Technical Details

There are 40060 rows and 20 columns in our dataset Looking at the columns we can see that ‘IsCanceled’ is the column that tells us if a booking is cancelled or notand this is the column we will be trying to predict.


1. LeadTime (Integer): is an Integer which tells us the Number of days that elapsed between the entering date of the booking into and the arrival date.

2. StaysInWeekendNights (Integer): is an Integer which is Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel.

3. StaysInWeekNights (Integer): is a variable which is Number of weeknights (Monday to Friday) the guest stayed or booked to stay at the hotel.

4. Adults (Integer): which gives Number of adults.

5. Children (Integer): which gives Number of children.

6. Babies (Integer): which gives Number of babies.

7. Meal (Categorical): have of meal booked. Categories are presented in standard hospitality meal packages.

  • Undefined/SC – no meal package.

  • BB – Bed & Breakfast.

  • HB – Half board (breakfast and one other meal – usually dinner)

  • FB – Full board (breakfast, lunch and dinner)

8. Country (Categorical): represents data about country of origin. Categories are represented in the ISO 3155– 3:2013 format

9. MarketSegment (Categorical) Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”

10. IsRepeatedGuest (Categorical): Value indicating if the booking name was from a repeated guest (1) or not (0)

11. PreviousCancellations (Integer): Number of previous bookings that were cancelled by the customer prior to the current booking

12. PreviousBookingsNotCanceled (Integer): Number of previous bookings not cancelled by the customer prior to the current booking

13. ReservedRoomType (Categorical): Code of room type reserved. Code is presented instead of designation for anonymity reasons

14. AssignedRoomType(Categorical): Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g., overbooking) or by customer request. Code is presented instead of designation for anonymity reasons

15. BookingChanges (Integer) Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

16. DepositType(Categorical): Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made. Non-Refund – a deposit was made in the value of the total stay cost. Refundable – a deposit was made with a value under the total cost of stay.

17. CustomerType(Categorical): consists of type of booking, assuming one of four categories.

  • Contract - when the booking has an allotment or other type of contract associated to it

  • Group – when the booking is associated to a group;

  • Transient – when the booking is not part of a group or contract, and is not associated to other transient booking

  • Transient-party – when the booking is transient, but is associated to at least other transient booking

18. RequiredCardParkingSpaces(Integer): Number of car parking spaces required by the customer

19. TotalOfSpecialRequests: (Integer) Number of special requests made by the customer (e.g. twin bed or high floor)


Goal of our Analysis

Using different Models like Support Vector Machine, Regression, Association Rule, we aim to provide the best model to provide insights into what affects booking cancellation and provide a recommendation on how to improve cancellation rates.


Data Acquisition

> glimpse(file)


PACKAGES USED:

The following packages were used:

  • Tidyverse – collection of R packages

  • caret- build machine learning models

  • ggplot2- primarily used for data visualization

  • party- create decision trees

  • ggpubr- produce production-quality visualizations

  • kernlab- Used for kernel-based machine Learning

  • arules- represent, manipulate, and analyze transaction data and patterns.

  • arulesViz- handling and mining association rules.

  • Maps- used to make map outlines and points

  • ggmap- functions to visualize spatial data and models

  • mapproj- convert latitude/longitude to coordinates

  • rworldmap- maps global data


Data Cleaning

Cancellations=read_csv("https://intro-datascience.s3.us-east2.amazonaws.com/Resort01.csv")

>>view(Cancellations)




>>dim(Cancellation)


>>str(Cancellations)



>>summary(Cancellation)


>>sapply(Cancellations,function(x) sum(is.null(x)))



>>sapply(Cancellations,function(x) sum(is.na(x)))


>>table(Cancellations$Country)


We checked the data for missing or null value and found that there are few dummy values in countries and meals which the model does not consider, and it does not affect accuracy.

Variable analysis a) Number of cancellation vs Number of bookings

Row Labels

Count of IsCanceled

Percentage

0

28938

72.24%

1

11122

27.76%

Grand Total

40060

100%




From the table and bar plot, we see that there are 11,122 cancellations and 28,938 bookings. IsCanceled is the response variable for our model. We see that there are 27.76% of cancellation and 72.24% of bookings in our dataset.


b) Lead Time before checking in


Insight about how lead Time is distributed


From the graph of Leadtime, the number of days between booking the hotel and checking in, we can see that the distribution is skewed right. The median is 57 days (about 2 months), meaning people book their stays almost 2 months in advance. The interquartile range is 145 days (about 5 months), so there is variability in terms of how far in advance a client reserves a hotel. It is surprising to see that a vast number of bookings were made on the same day or within 10 days of check-in. The maximum for the distribution is 737 days (about 2 years).

Average LeadTime for cancellation:




From the above bar plot, we can see that when the clients are cancelling their bookings, their average Lead Time before the check in is 129 days (about 4 months) and the average Lead Time when the clients are not cancelling their booking is 79 days (about 2 and a half months).

c) Market Segment

From the table and bar chart of market segment, we can see that most bookings in the dataset are made by Online Travel Agent, followed by Offline Travel Agent/Tour Operators, then clients who have booked directly and then by groups. Similarly, we can see that most cancellation in the dataset are done by group bookings where 73.59% of the bookings are canceled, followed by Online Travel Agents where 54.42% of the bookings are cancelled.

d) Customer Type




From the above table and bar plot, we can see that the maximum bookings have been done by transient customer type, where transient booking type is 94.85% of total booking. Transient Customer type has 45.28% of cancellation on all the bookings they made, which is almost half of the bookings.

e) Countries

Which country and what factors are in that country?



As we can see the from the stats with top countries contributing to cancellation: PRT-17630, GBR - 6814, ESP-3957, IRL-2166, FRA-1611, DEU-1203, CN-710, NLD-514, USA 479 where Portugal (PRT) is country having most of the cancellations as shown in map.


f) Required Parking Space


From here, we can see that if the customer is likely to reserve a parking space in a hotel, the chances of that customer cancelling decrease. There is 18.87% chance that the customer will not cancel his cancellation. g) Special Requests

From here, we can see that when there are no special requests, the percentage of bookings cancelled is higher than the percentage of bookings not cancelled. And as the no of special request increases the percentage of cancellation decreases. With only one special request, there is 31.82% chance of booking not being canceled. Hence, if the hotel takes care of special requests made by the customer, the cancellation percentage will be decreased.

h) Deposit Type



From the above plot, we can see that when a customer puts in a non-refundable deposit, there are still chances of the customer cancelling their reservation. Customers do not mind cancelling their trip even if they pay a refundable deposit.

Different Models After analyzing the data and making sure that we do not have any null values. We produced a set of 6 features which we thought would be the best for running the model. We ran varIMP function on our dataset to see which features have the highest importance leading to accurate prediction of booking cancellations. The higher the value, the more its significance is of that feature. We realized Lead Time, Required Parking Space, Market Segment, Country is a few of the most significant variables and used those features in our CART and SVM Model.

Note: The train() command that computed the Regression Trees and SVMs were not included in these results because their ksvm() and rpart() counterparts had better performances (like accuracy).


a) Linear model • The dataset was modeled using linear regression only using the important variables:



The adjusted R-squared from this model is 0.2827. The adjusted R-squared of 0.2827 means that the "LeadTime", "RequiredCarParkingSpaces", "MarketSegment", "PreviousCancellations", "Country", and "StaysInWeekNights" variables explain 28.27% of the "IsCanceled" (whether or not the client cancelled their booking) variable's variation. The p-value of the model is 2.2e-16, so it is highly likely that the changes caused by these variables do not occur by chance.

The "LeadTime", "RequiredCarParkingSpaces", "PreviousCancellations", and "StaysInWeekNights" are statistically significant without any conditions because they are metric variables. The "MarketSegment" variable is statistically significant when its value is equal to either "Corporate", "Groups", or "Online TA". Similarly, the "Country" variable is statistically significant when its value is equal to either "ARE", "ARG", "AUS", "AUT", "BEL", "BGR", "BRA", "CHE", "CHN", "CMR", "CN", "CPV", "CZE", "DEU", "DNK", "DOM", "ESP", "EST", "FIN", "FRA", "GBR", "HUN", "IND", "IRL", "IRN", "ISL", "ITA", "JAM", "JPN", "LTU", "LVA", "MEX", "MYS", "NLD", "NOR", "NZL", "OMN", "POL", "ROU", "RUS", "SGP", "SRB", "SUR", "SVN", "SWE", "TWN", "UKR", "USA", or "VNM".


These statistical significances mean that, for every unit increase in the "LeadTime" variable, the score for a hotel booking cancellation increases by 0.001021611. For every unit increase in the "RequiredCarParkingSpaces" variable (for every additional day), the score for a hotel booking cancellation decreases by 0.273710333.

If the value for the "MarketSegment" variable is equal to "Corporate", the score for a hotel booking cancellation decreases by 0.055068980. If the value for the "MarketSegment" variable is equal to "Groups", the score for a hotel booking cancellation increases by 0.140619578. If the value for the "MarketSegment" variable is equal to "Online TA", the score for a hotel booking cancellation increases by 0.207684142. For every unit increase in the "PreviousCancellations" variable (for every additional previous cancellation), the score for a hotel booking cancellations increases by 0.025679557. If the value for the "Country" variable is equal to "ARE", the score for a hotel booking cancellation increases by 0.276874182. If the value for the "Country" variable is equal to "ARG", the score for a hotel booking cancellation decreases by 0.209516692. If the value for the "Country" variable is equal to "AUS", the score for a hotel cancellation decreases by 0.242363906. If the value for the "Country" variable is equal to "AUT", the score for a hotel booking cancellation decreases by 0.24405749. If the value for the "Country" variable is equal to "BEL", the score for a hotel booking cancellations decreases by 0.33745448. If the value for the "Country" variable is equal to "BGR", the score for a hotel booking cancellation decreases by 0.40725660. If the value for the "Country" variable is equal to "BRA", the score for a hotel booking cancellation decreases by 0.19908075. If the value for the "Country" variable is equal to "CHE", the score for a hotel booking cancellation decreases by 0.17756059. If the value for the "Country" variable is equal to "CHN", the score for a hotel booking cancellation decreases by 0.36236998. If the value for the "Country" variable is equal to "CMR", the score for a hotel booking cancellation decreases by 0.59487863. If the value for the "Country" variable is equal to "CN", the score for a hotel booking cancellation decreases by 0.34184074. If the value for the "Country" variable is equal to "CPV", the score for a hotel booking cancellation decreases by 0.36656955. If the value for the "Country" variable is equal to "CZE", the score for a hotel booking cancellation decreases by 0.36031245. If the value for the "Country" variable is equal to "DEU", the score for a hotel booking cancellation decreases by 0.35642467. If the value for the "Country" variable is equal to "DNK", the score for a hotel booking cancellation decreases by 0.36575231. If the value for the "Country" variable is equal to "DOM", the score for a hotel booking cancellation decreases by 0.55926935. If the value for the "Country" variable is equal to "ESP", the score for a hotel booking cancellation decreases by 0.18059302. If the value for the "Country" variable is equal to "EST", the score for a hotel booking cancellation decreases by 0.34546621. If the value for the "Country" variable is equal to "FIN", the score for a hotel booking cancellation decreases by 0.31936623. If the value for the "Country" variable is equal to "FRA", the score for a hotel booking cancellation decreases by 0.27557225. If the value for the "Country" variable is equal to "GBR", the score for a hotel booking cancellation decreases by 0.36779672. If the value for the "Country" variable is equal to "HUN", the score for a hotel booking cancellation decreases by 0.30747982. If the value for the "Country" variable is equal to "IND", the score for a hotel booking cancellation decreases by 0.30882464. If the value for the "Country" variable is equal to "IRL", the score for a hotel booking cancellation decreases by 0.33645439. If the value for the "Country" variable is equal to "IRN", the score for a hotel booking cancellation decreases by 0.49488468. If the value for the "Country" variable is equal to "ISL", the score for a hotel booking cancellation decreases by 0.42515268. If the value for the "Country" variable is equal to "ITA", the score for a hotel booking cancellation decreases by 0.26291873. If the value for the "Country" variable is equal to "JAM", the score for a hotel booking cancellation decreases by 0.40549994. If the value for the "Country" variable is equal to "JPN", the score for a hotel booking cancellation decreases by 0.36619693. If the value for the "Country" variable is equal to "LTU", the score for a hotel booking cancellation decreases by 0.41486293. If the value for the "Country" variable is equal to "LVA", the score for a hotel booking cancellation decreases by 0.32654568. If the value for the "Country" variable is equal to "MEX", the score for a hotel booking cancellation decreases by 0.41705237. If the value for the "Country" variable is equal to "MYS", the score for a hotel booking cancellation decreases by 0.50832586. If the value for the "Country" variable is equal to "NLD", the score for a hotel booking cancellation decreases by 0.31872093. If the value for the "Country" variable is equal to "NOR", the

score for a hotel booking cancellation decreases by 0.31954483.


If the value for the "Country" variable is equal to "NZL", the score for a hotel booking cancellation decreases by 0.40964462. If the value for the "Country" variable is equal to "OMN", the score for a hotel booking cancellation decreases by 0.44310324. If the value for the "Country" variable is equal to "POL", the score for a hotel booking cancellation decreases by 0.34839527. If the value for the "Country" variable is equal to "ROU", the score for a hotel booking cancellation decreases by 0.31035796. If the value for the "Country" variable is equal to "RUS", the score for a hotel booking cancellation decreases by 0.17419130. If the value for the "Country" variable is equal to "SGP", the score for a hotel booking cancellation decreases by 0.46411686. If the value for the "Country" variable is equal to "SRB", the score for a hotel booking cancellation decreases by 0.34582290. If the value for the "Country" variable is equal to "SUR", the score for a hotel booking cancellation decreases by 0.42893594. If the value for the "Country" variable is equal to "SVN", the score for a hotel booking cancellation decreases by 0.33152122. If the value for the "Country" variable is equal to "SWE", the score for a hotel booking cancellation decreases by 0.25015330. If the value for the "Country" variable is equal to "TWN", the score for a hotel booking cancellation decreases by 0.56768091. If the value for the "Country" variable is equal to "UKR", the score for a hotel booking cancellation decreases by 0.49165922. If the value for the "Country" variable is equal to "USA", the score for a hotel booking cancellation decreases by 0.23477465. If the value for the "Country" variable is equal to "VNM", the score for a hotel booking cancellation decreases by 0.57948500. Lastly, for every unit increase in the "StaysInWeekNights" variable (for every additional week night a customer stayed in the hotel), the score for a hotel booking cancellation increases by 0.01330873.

a) CART Model: Decision Tree The working of decision tree models is based on repeated partitioning the data into multiple sub-spaces, so that the outcome in each final sub-space is as homogeneous as possible. This approach is technically called recursive partitioning.




Classification And Regression Trees (CART) algorithm is a classification algorithm for building a decision tree based on Gini’s impurity index as splitting criterion. CART is a binary tree build by splitting node into two child nodes repeatedly.


From the above decision tree data we can see that LeadTime, RequiredCarParkingSpaces, MarketSegment, Country are top contributors to the model’s accuracy.



We can see that we get an accuracy of 0.8313 with an NIR value of 0.7224 along with p value being less than alpha level of 0.05 and we can see that we have a good model.


b) SVM


A p-value of <2.2e-16 for Accuracy > NIR, the null hypothesis is rejected, therefore, the accuracy is statistically significantly better than the No Information Rate.


c) Association Rules



There’s a 5.43438% chance that a customer has “Groups” as their Market Segment, “PRT” as their Country, and cancelled their hotel booking.


If a customer has “Groups” as its Market Segment and “PRT” as their Country, there’s a 75.45927% chance the customer will cancel their hotel booking.


There’s a 7.798303% chance that a customer has “Online TA” as their Market Segment, “PRT” as their Country, and cancelled their hotel booking.


If a customer has “Online TA” as its Market Segment, and “PRT” as their Country, there’s a 48.15785% chance the customer will cancel their hotel booking.


Conclusion

We initially started with all 20 variables and then reduced to 11 variables which we thought would be important. After running through CART, we realized the 6 most important variables and went through with them in our completed model. In our introduction, we stated the goal of our analysis is to answer two key questions

  • Does Lead Time and Market Segment related to booking cancelation?

  • Can we predict with accuracy if a booking will be cancelled based on the attributes? Based on our data analysis, these are the significant and potential casual relationships that affect booking cancellation


a) Lead Time

Based on our analysis, we learned Lead Time is one of the most significant factors that affect cancellation. When lead time increases, the customers are more likely to cancel. We observed this from CART analysis.


We see people book their reservations 2 months in advance and if someone cancels their booking when there are less than 60 days (about 2 months) left for their check in, the hotel can charge them as convenience fee.


b) Market Segment

One factor we found from the linear model, which is significant, is Market Segment and CART analysis gave us a similar result. We see that the maximum percentage of cancellation is done by groups and travel agents. The hotel needs to rethink their terms with online travel agents and penalty for group cancellations.


c) Deposit Type

We see customers are not bothered by the monetary penalty when deciding to cancel. For hotels, making Non- Refundable rooms will not prevent cancellation.


d) Required Car Parking/Special Requests

We see that when customers reserve a parking space when they are booking their hotel reservation, the chances of them cancelling their reservation decreases. Similarly, with special requests. Giving a customer what he wants makes the cancellation rate less.


e) Can we predict with accuracy if a booking will be cancelled based on its attributes

Through our analysis, we think the Support Vector Machine model works well to predict hotel cancellations. We tested two other models to see how accurate our prediction models are. CART Model gives us an accuracy of 81.08% whereas Support Vector Machine gives us the highest accuracy of 83.59%


Recommendation

  • Determining a threshold number of days for customers to pay convenience fee if booking is cancelled

  • Customers who book 2 months or more time ago have less chance of cancellation.

  • The cancellations in the category of market segment are done by online travel agents, the hotel can change terms and conditions they have with online agents to avoid more cancellation from online agents

  • The recommendation we have is the booking deposit should be dependent on how many days ago booking is made.

Limitation

  • Lack of customer behavior data

  • From our analysis, we discovered that variables relating to the customer’s behavior, such as special requests, show significance in leading to booking cancellation. Unfortunately, our dataset is limited in providing variables that relate to customer behavior. As we do not know if the special requests made by the customer were taken care of or not. However, we understand such data is not readily available and might be difficult to monitor.

Future


Studies Include more customer behavior data

One recommendation that can improve our study is to source customer behavior data. Examples include datawhich provides information about customers’ browsing sessions, type of special request and if that special request was fulfilled or not. With this additional data, we might understand the customer’s booking journey in greater details and identify alternate good predictors for booking cancellation.

Cross reference results and findings from different hotels

To improve our model and increase its relevance to different hotels, one recommendation is to perform a similar analysis of data from other hotels and cross-reference the results. Doing this allows us to verify our insights.


12 views0 comments

Recent Posts

See All

Comments


Contact Me

Thanks for submitting!

bottom of page