Intuition-based revenue management strategies such as dynamic pricing, overbooking, and strict cancellation policies can backfire, leading to a loss of sales, deteriorated business reputation, and fall in customer loyalty. Our aim was to use analytics to provide valuable insights into hotel cancellations and make recommendations for improving cancellation rates.
![](https://static.wixstatic.com/media/c33de6_5b85d3cbfb1544c89acf10702aa85d2e~mv2.jpg/v1/fill/w_980,h_653,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_5b85d3cbfb1544c89acf10702aa85d2e~mv2.jpg)
Introduction
The Hotel Industry has changed over the years, with most of the booking by third-party companies. These Online Travel Agencies have changed the cancellation policy from footnote at the bottom of the page to making it the main selling point in their marketing campaign. As a result, customers have become accustomed to free cancellation policies. Back in 2019, D-Edge Hospitality Solutions reported that the global cancelation rate of hotel reservation reached 40%.
The dataset contains hotel booking data where it is necessary to evaluate numerous factors that can lead to bookings being cancelled. Few of these factors that can determine cancellations include the lag period between booking made and date booked, type of customer, is he/she a regular customer, location etc. It is necessary to predict if the bookings will still prevail by analyzing the past data which shows which bookings are being cancelled and recording a pattern which can help the hotel industry make a better prediction.
Our Project aims to provide valuable insights on hotel cancellations using analytics. To successfully address the needs of the project, we have adopted Kanban Project Methodology. We found this useful as it enables agility and prevents overloading the development process.
Business Understanding
Booking Cancellation is a challenge faced by the hospitality industry because it has a direct impact on their revenue generation. When customers cancel the booking, there are severe implications for the hotels, it affects their occupancy rates. Revenue management strategies, such as dynamic pricing, overbooking, and strict cancellation policies, are employed to address booking cancellation and maximize occupancy rates. However, when done based on intuition only, these strategies might backfire, such as loss of sales, deteriorated business reputation and fall in customer loyalty.
Business Questions
Do people who cancel their booking tend to make booking changes?
Can we predict a pattern based on previous cancellations?
What type of customers usually cancel the booking?
Which type of deposit accounted for more cancellations?
Determine a threshold number of days after which if a customer cancels their booking, they need to pay a convenience fee?
Which country has the maximum cancellation and what needs to be done there?
Which market segment has the maximum cancellation and which market segment needs to be focused on?
Technical Details
There are 40060 rows and 20 columns in our dataset Looking at the columns we can see that ‘IsCanceled’ is the column that tells us if a booking is cancelled or notand this is the column we will be trying to predict.
1. LeadTime (Integer): is an Integer which tells us the Number of days that elapsed between the entering date of the booking into and the arrival date.
2. StaysInWeekendNights (Integer): is an Integer which is Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel.
3. StaysInWeekNights (Integer): is a variable which is Number of weeknights (Monday to Friday) the guest stayed or booked to stay at the hotel.
4. Adults (Integer): which gives Number of adults.
5. Children (Integer): which gives Number of children.
6. Babies (Integer): which gives Number of babies.
7. Meal (Categorical): have of meal booked. Categories are presented in standard hospitality meal packages.
Undefined/SC – no meal package.
BB – Bed & Breakfast.
HB – Half board (breakfast and one other meal – usually dinner)
FB – Full board (breakfast, lunch and dinner)
8. Country (Categorical): represents data about country of origin. Categories are represented in the ISO 3155– 3:2013 format
9. MarketSegment (Categorical) Market segment designation. In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators”
10. IsRepeatedGuest (Categorical): Value indicating if the booking name was from a repeated guest (1) or not (0)
11. PreviousCancellations (Integer): Number of previous bookings that were cancelled by the customer prior to the current booking
12. PreviousBookingsNotCanceled (Integer): Number of previous bookings not cancelled by the customer prior to the current booking
13. ReservedRoomType (Categorical): Code of room type reserved. Code is presented instead of designation for anonymity reasons
14. AssignedRoomType(Categorical): Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g., overbooking) or by customer request. Code is presented instead of designation for anonymity reasons
15. BookingChanges (Integer) Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
16. DepositType(Categorical): Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories: No Deposit – no deposit was made. Non-Refund – a deposit was made in the value of the total stay cost. Refundable – a deposit was made with a value under the total cost of stay.
17. CustomerType(Categorical): consists of type of booking, assuming one of four categories.
Contract - when the booking has an allotment or other type of contract associated to it
Group – when the booking is associated to a group;
Transient – when the booking is not part of a group or contract, and is not associated to other transient booking
Transient-party – when the booking is transient, but is associated to at least other transient booking
18. RequiredCardParkingSpaces(Integer): Number of car parking spaces required by the customer
19. TotalOfSpecialRequests: (Integer) Number of special requests made by the customer (e.g. twin bed or high floor)
Goal of our Analysis
Using different Models like Support Vector Machine, Regression, Association Rule, we aim to provide the best model to provide insights into what affects booking cancellation and provide a recommendation on how to improve cancellation rates. •
Data Acquisition
> glimpse(file)
![](https://static.wixstatic.com/media/c33de6_5b5ab9d6443b4b9cb2665866be7a9fdf~mv2.png/v1/fill/w_980,h_459,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_5b5ab9d6443b4b9cb2665866be7a9fdf~mv2.png)
PACKAGES USED:
The following packages were used:
Tidyverse – collection of R packages
caret- build machine learning models
ggplot2- primarily used for data visualization
party- create decision trees
ggpubr- produce production-quality visualizations
kernlab- Used for kernel-based machine Learning
arules- represent, manipulate, and analyze transaction data and patterns.
arulesViz- handling and mining association rules.
Maps- used to make map outlines and points
ggmap- functions to visualize spatial data and models
mapproj- convert latitude/longitude to coordinates
rworldmap- maps global data
Data Cleaning
Cancellations=read_csv("https://intro-datascience.s3.us-east2.amazonaws.com/Resort01.csv")
>>view(Cancellations)
![](https://static.wixstatic.com/media/c33de6_ef15cb608e5e44918ace09ed0a09972b~mv2.png/v1/fill/w_962,h_529,al_c,q_90,enc_auto/c33de6_ef15cb608e5e44918ace09ed0a09972b~mv2.png)
![](https://static.wixstatic.com/media/c33de6_5faf94ed5a4245b1a644c63ab6cb488a~mv2.png/v1/fill/w_965,h_535,al_c,q_90,enc_auto/c33de6_5faf94ed5a4245b1a644c63ab6cb488a~mv2.png)
>>dim(Cancellation)
![](https://static.wixstatic.com/media/c33de6_1bb1c04d4d444d228dd1dc3c8cdebb55~mv2.png/v1/fill/w_980,h_96,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_1bb1c04d4d444d228dd1dc3c8cdebb55~mv2.png)
>>str(Cancellations)
![](https://static.wixstatic.com/media/c33de6_f0b047bbef374afcb344d21d7c21c310~mv2.png/v1/fill/w_965,h_447,al_c,q_90,enc_auto/c33de6_f0b047bbef374afcb344d21d7c21c310~mv2.png)
>>summary(Cancellation)
![](https://static.wixstatic.com/media/c33de6_de473469a0e3446c851f0fc43af15309~mv2.png/v1/fill/w_960,h_468,al_c,q_90,enc_auto/c33de6_de473469a0e3446c851f0fc43af15309~mv2.png)
>>sapply(Cancellations,function(x) sum(is.null(x)))
![](https://static.wixstatic.com/media/c33de6_6ce61e4ab2ae4a839c12826bfc167450~mv2.png/v1/fill/w_966,h_244,al_c,q_85,enc_auto/c33de6_6ce61e4ab2ae4a839c12826bfc167450~mv2.png)
>>sapply(Cancellations,function(x) sum(is.na(x)))
![](https://static.wixstatic.com/media/c33de6_0269cd9b33f14b199d674612c0df5465~mv2.png/v1/fill/w_975,h_240,al_c,q_85,enc_auto/c33de6_0269cd9b33f14b199d674612c0df5465~mv2.png)
>>table(Cancellations$Country)
![](https://static.wixstatic.com/media/c33de6_dcd6ba273b0b4a1682ff1fb199063d73~mv2.png/v1/fill/w_968,h_318,al_c,q_85,enc_auto/c33de6_dcd6ba273b0b4a1682ff1fb199063d73~mv2.png)
We checked the data for missing or null value and found that there are few dummy values in countries and meals which the model does not consider, and it does not affect accuracy.
Variable analysis a) Number of cancellation vs Number of bookings
Row Labels | Count of IsCanceled | Percentage |
0 | 28938 | 72.24% |
1 | 11122 | 27.76% |
Grand Total | 40060 | 100% |
![](https://static.wixstatic.com/media/c33de6_bae0014c96f144bba31961f8eefe0f28~mv2.png/v1/fill/w_980,h_580,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_bae0014c96f144bba31961f8eefe0f28~mv2.png)
From the table and bar plot, we see that there are 11,122 cancellations and 28,938 bookings. IsCanceled is the response variable for our model. We see that there are 27.76% of cancellation and 72.24% of bookings in our dataset.
b) Lead Time before checking in
![](https://static.wixstatic.com/media/c33de6_bea9fef58c5f44afa6a8d148ffcbc143~mv2.png/v1/fill/w_980,h_316,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_bea9fef58c5f44afa6a8d148ffcbc143~mv2.png)
Insight about how lead Time is distributed
![](https://static.wixstatic.com/media/c33de6_f2e6b49f29504365b1e92b3364e7d551~mv2.png/v1/fill/w_705,h_132,al_c,q_85,enc_auto/c33de6_f2e6b49f29504365b1e92b3364e7d551~mv2.png)
From the graph of Leadtime, the number of days between booking the hotel and checking in, we can see that the distribution is skewed right. The median is 57 days (about 2 months), meaning people book their stays almost 2 months in advance. The interquartile range is 145 days (about 5 months), so there is variability in terms of how far in advance a client reserves a hotel. It is surprising to see that a vast number of bookings were made on the same day or within 10 days of check-in. The maximum for the distribution is 737 days (about 2 years).
Average LeadTime for cancellation:
![](https://static.wixstatic.com/media/c33de6_a01787a7843d442f849ce91cb5638c43~mv2.png/v1/fill/w_980,h_367,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_a01787a7843d442f849ce91cb5638c43~mv2.png)
From the above bar plot, we can see that when the clients are cancelling their bookings, their average Lead Time before the check in is 129 days (about 4 months) and the average Lead Time when the clients are not cancelling their booking is 79 days (about 2 and a half months).
c) Market Segment
![](https://static.wixstatic.com/media/c33de6_a0e14d95945e4d01a6c6a26ebe4cf4ee~mv2.png/v1/fill/w_911,h_715,al_c,q_90,enc_auto/c33de6_a0e14d95945e4d01a6c6a26ebe4cf4ee~mv2.png)
From the table and bar chart of market segment, we can see that most bookings in the dataset are made by Online Travel Agent, followed by Offline Travel Agent/Tour Operators, then clients who have booked directly and then by groups.
Similarly, we can see that most cancellation in the dataset are done by group bookings where 73.59% of the bookings are canceled, followed by Online Travel Agents where 54.42% of the bookings are cancelled.
d) Customer Type
![](https://static.wixstatic.com/media/c33de6_6abcb19fe47943e79300c6b38c900f78~mv2.png/v1/fill/w_980,h_432,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_6abcb19fe47943e79300c6b38c900f78~mv2.png)
![](https://static.wixstatic.com/media/c33de6_4845abf744334e27859e056831bc3097~mv2.png/v1/fill/w_980,h_624,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_4845abf744334e27859e056831bc3097~mv2.png)
From the above table and bar plot, we can see that the maximum bookings have been done by transient customer type, where transient booking type is 94.85% of total booking. Transient Customer type has 45.28% of cancellation on all the bookings they made, which is almost half of the bookings.
e) Countries
Which country and what factors are in that country?
![](https://static.wixstatic.com/media/c33de6_2c6c28e11c3f49f2a28cf4d83b8b8cc7~mv2.png/v1/fill/w_720,h_719,al_c,q_90,enc_auto/c33de6_2c6c28e11c3f49f2a28cf4d83b8b8cc7~mv2.png)
![](https://static.wixstatic.com/media/c33de6_2e36c249ff944a4392bcc5f7cde8b73c~mv2.png/v1/fill/w_980,h_654,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_2e36c249ff944a4392bcc5f7cde8b73c~mv2.png)
As we can see the from the stats with top countries contributing to cancellation: PRT-17630, GBR - 6814, ESP-3957, IRL-2166, FRA-1611, DEU-1203, CN-710, NLD-514, USA 479 where Portugal (PRT) is country having most of the cancellations as shown in map.
f) Required Parking Space
![](https://static.wixstatic.com/media/c33de6_574c6223acf14e5d95a9e330d7a91f21~mv2.png/v1/fill/w_980,h_450,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_574c6223acf14e5d95a9e330d7a91f21~mv2.png)
From here, we can see that if the customer is likely to reserve a parking space in a hotel, the chances of that customer cancelling decrease. There is 18.87% chance that the customer will not cancel his cancellation. g) Special Requests
![](https://static.wixstatic.com/media/c33de6_4c3a633ad5d942f8b0e3738b4291ed20~mv2.png/v1/fill/w_980,h_575,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_4c3a633ad5d942f8b0e3738b4291ed20~mv2.png)
From here, we can see that when there are no special requests, the percentage of bookings cancelled is higher than the percentage of bookings not cancelled. And as the no of special request increases the percentage of cancellation decreases. With only one special request, there is 31.82% chance of booking not being canceled. Hence, if the hotel takes care of special requests made by the customer, the cancellation percentage will be decreased.
h) Deposit Type
![](https://static.wixstatic.com/media/c33de6_089576b6ae3648c1a125bfc39736b810~mv2.png/v1/fill/w_980,h_386,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_089576b6ae3648c1a125bfc39736b810~mv2.png)
![](https://static.wixstatic.com/media/c33de6_90b07202561f489ea08da1ab7e6a9812~mv2.png/v1/fill/w_980,h_545,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_90b07202561f489ea08da1ab7e6a9812~mv2.png)
From the above plot, we can see that when a customer puts in a non-refundable deposit, there are still chances of the customer cancelling their reservation. Customers do not mind cancelling their trip even if they pay a refundable deposit.
• Different Models
After analyzing the data and making sure that we do not have any null values. We produced a set of 6 features which we thought would be the best for running the model. We ran varIMP function on our dataset to see which features have the highest importance leading to accurate prediction of booking cancellations. The higher the value, the more its significance is of that feature. We realized Lead Time, Required Parking Space, Market Segment, Country is a few of the most significant variables and used those features in our CART and SVM Model.
Note: The train() command that computed the Regression Trees and SVMs were not included in these results because their ksvm() and rpart() counterparts had better performances (like accuracy).
a) Linear model • The dataset was modeled using linear regression only using the important variables:
![](https://static.wixstatic.com/media/c33de6_6072d8660d9046f6b189e469cb7cedad~mv2.png/v1/fill/w_980,h_563,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_6072d8660d9046f6b189e469cb7cedad~mv2.png)
![](https://static.wixstatic.com/media/c33de6_64d582e9c32c4f26b1dca0a35dae629d~mv2.png/v1/fill/w_783,h_281,al_c,q_85,enc_auto/c33de6_64d582e9c32c4f26b1dca0a35dae629d~mv2.png)
The adjusted R-squared from this model is 0.2827. The adjusted R-squared of 0.2827 means that the "LeadTime", "RequiredCarParkingSpaces", "MarketSegment", "PreviousCancellations", "Country", and "StaysInWeekNights" variables explain 28.27% of the "IsCanceled" (whether or not the client cancelled their booking) variable's variation. The p-value of the model is 2.2e-16, so it is highly likely that the changes caused by these variables do not occur by chance.
The "LeadTime", "RequiredCarParkingSpaces", "PreviousCancellations", and "StaysInWeekNights" are statistically significant without any conditions because they are metric variables. The "MarketSegment" variable is statistically significant when its value is equal to either "Corporate", "Groups", or "Online TA". Similarly, the "Country" variable is statistically significant when its value is equal to either "ARE", "ARG", "AUS", "AUT", "BEL", "BGR", "BRA", "CHE", "CHN", "CMR", "CN", "CPV", "CZE", "DEU", "DNK", "DOM", "ESP", "EST", "FIN", "FRA", "GBR", "HUN", "IND", "IRL", "IRN", "ISL", "ITA", "JAM", "JPN", "LTU", "LVA", "MEX", "MYS", "NLD", "NOR", "NZL", "OMN", "POL", "ROU", "RUS", "SGP", "SRB", "SUR", "SVN", "SWE", "TWN", "UKR", "USA", or "VNM".
These statistical significances mean that, for every unit increase in the "LeadTime" variable, the score for a hotel booking cancellation increases by 0.001021611. For every unit increase in the "RequiredCarParkingSpaces" variable (for every additional day), the score for a hotel booking cancellation decreases by 0.273710333.
If the value for the "MarketSegment" variable is equal to "Corporate", the score for a hotel booking cancellation decreases by 0.055068980. If the value for the "MarketSegment" variable is equal to "Groups", the score for a hotel booking cancellation increases by 0.140619578. If the value for the "MarketSegment" variable is equal to "Online TA", the score for a hotel booking cancellation increases by 0.207684142. For every unit increase in the "PreviousCancellations" variable (for every additional previous cancellation), the score for a hotel booking cancellations increases by 0.025679557. If the value for the "Country" variable is equal to "ARE", the score for a hotel booking cancellation increases by 0.276874182. If the value for the "Country" variable is equal to "ARG", the score for a hotel booking cancellation decreases by 0.209516692. If the value for the "Country" variable is equal to "AUS", the score for a hotel cancellation decreases by 0.242363906. If the value for the "Country" variable is equal to "AUT", the score for a hotel booking cancellation decreases by 0.24405749. If the value for the "Country" variable is equal to "BEL", the score for a hotel booking cancellations decreases by 0.33745448. If the value for the "Country" variable is equal to "BGR", the score for a hotel booking cancellation decreases by 0.40725660. If the value for the "Country" variable is equal to "BRA", the score for a hotel booking cancellation decreases by 0.19908075. If the value for the "Country" variable is equal to "CHE", the score for a hotel booking cancellation decreases by 0.17756059. If the value for the "Country" variable is equal to "CHN", the score for a hotel booking cancellation decreases by 0.36236998. If the value for the "Country" variable is equal to "CMR", the score for a hotel booking cancellation decreases by 0.59487863. If the value for the "Country" variable is equal to "CN", the score for a hotel booking cancellation decreases by 0.34184074. If the value for the "Country" variable is equal to "CPV", the score for a hotel booking cancellation decreases by 0.36656955. If the value for the "Country" variable is equal to "CZE", the score for a hotel booking cancellation decreases by 0.36031245. If the value for the "Country" variable is equal to "DEU", the score for a hotel booking cancellation decreases by 0.35642467. If the value for the "Country" variable is equal to "DNK", the score for a hotel booking cancellation decreases by 0.36575231. If the value for the "Country" variable is equal to "DOM", the score for a hotel booking cancellation decreases by 0.55926935. If the value for the "Country" variable is equal to "ESP", the score for a hotel booking cancellation decreases by 0.18059302. If the value for the "Country" variable is equal to "EST", the score for a hotel booking cancellation decreases by 0.34546621. If the value for the "Country" variable is equal to "FIN", the score for a hotel booking cancellation decreases by 0.31936623. If the value for the "Country" variable is equal to "FRA", the score for a hotel booking cancellation decreases by 0.27557225. If the value for the "Country" variable is equal to "GBR", the score for a hotel booking cancellation decreases by 0.36779672. If the value for the "Country" variable is equal to "HUN", the score for a hotel booking cancellation decreases by 0.30747982. If the value for the "Country" variable is equal to "IND", the score for a hotel booking cancellation decreases by 0.30882464. If the value for the "Country" variable is equal to "IRL", the score for a hotel booking cancellation decreases by 0.33645439. If the value for the "Country" variable is equal to "IRN", the score for a hotel booking cancellation decreases by 0.49488468. If the value for the "Country" variable is equal to "ISL", the score for a hotel booking cancellation decreases by 0.42515268. If the value for the "Country" variable is equal to "ITA", the score for a hotel booking cancellation decreases by 0.26291873. If the value for the "Country" variable is equal to "JAM", the score for a hotel booking cancellation decreases by 0.40549994. If the value for the "Country" variable is equal to "JPN", the score for a hotel booking cancellation decreases by 0.36619693. If the value for the "Country" variable is equal to "LTU", the score for a hotel booking cancellation decreases by 0.41486293. If the value for the "Country" variable is equal to "LVA", the score for a hotel booking cancellation decreases by 0.32654568. If the value for the "Country" variable is equal to "MEX", the score for a hotel booking cancellation decreases by 0.41705237. If the value for the "Country" variable is equal to "MYS", the score for a hotel booking cancellation decreases by 0.50832586. If the value for the "Country" variable is equal to "NLD", the score for a hotel booking cancellation decreases by 0.31872093. If the value for the "Country" variable is equal to "NOR", the
score for a hotel booking cancellation decreases by 0.31954483.
If the value for the "Country" variable is equal to "NZL", the score for a hotel booking cancellation decreases by 0.40964462. If the value for the "Country" variable is equal to "OMN", the score for a hotel booking cancellation decreases by 0.44310324. If the value for the "Country" variable is equal to "POL", the score for a hotel booking cancellation decreases by 0.34839527. If the value for the "Country" variable is equal to "ROU", the score for a hotel booking cancellation decreases by 0.31035796. If the value for the "Country" variable is equal to "RUS", the score for a hotel booking cancellation decreases by 0.17419130. If the value for the "Country" variable is equal to "SGP", the score for a hotel booking cancellation decreases by 0.46411686. If the value for the "Country" variable is equal to "SRB", the score for a hotel booking cancellation decreases by 0.34582290. If the value for the "Country" variable is equal to "SUR", the score for a hotel booking cancellation decreases by 0.42893594. If the value for the "Country" variable is equal to "SVN", the score for a hotel booking cancellation decreases by 0.33152122. If the value for the "Country" variable is equal to "SWE", the score for a hotel booking cancellation decreases by 0.25015330. If the value for the "Country" variable is equal to "TWN", the score for a hotel booking cancellation decreases by 0.56768091. If the value for the "Country" variable is equal to "UKR", the score for a hotel booking cancellation decreases by 0.49165922. If the value for the "Country" variable is equal to "USA", the score for a hotel booking cancellation decreases by 0.23477465. If the value for the "Country" variable is equal to "VNM", the score for a hotel booking cancellation decreases by 0.57948500. Lastly, for every unit increase in the "StaysInWeekNights" variable (for every additional week night a customer stayed in the hotel), the score for a hotel booking cancellation increases by 0.01330873.
a) CART Model: Decision Tree The working of decision tree models is based on repeated partitioning the data into multiple sub-spaces, so that the outcome in each final sub-space is as homogeneous as possible. This approach is technically called recursive partitioning.
![](https://static.wixstatic.com/media/c33de6_e8767b0f1b274de184953287180c61ef~mv2.png/v1/fill/w_980,h_556,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_e8767b0f1b274de184953287180c61ef~mv2.png)
Classification And Regression Trees (CART) algorithm is a classification algorithm for building a decision tree based on Gini’s impurity index as splitting criterion. CART is a binary tree build by splitting node into two child nodes repeatedly.
![](https://static.wixstatic.com/media/c33de6_0c3a15ebcd7f488988883901be4ed36d~mv2.png/v1/fill/w_679,h_745,al_c,q_90,enc_auto/c33de6_0c3a15ebcd7f488988883901be4ed36d~mv2.png)
From the above decision tree data we can see that LeadTime, RequiredCarParkingSpaces, MarketSegment, Country are top contributors to the model’s accuracy.
![](https://static.wixstatic.com/media/c33de6_b40060d6d4f74082ac21be7cf9fbc97d~mv2.png/v1/fill/w_980,h_814,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_b40060d6d4f74082ac21be7cf9fbc97d~mv2.png)
We can see that we get an accuracy of 0.8313 with an NIR value of 0.7224 along with p value being less than alpha level of 0.05 and we can see that we have a good model.
b) SVM
![](https://static.wixstatic.com/media/c33de6_5a38d45bba8047ac8d732bdbd9af6144~mv2.png/v1/fill/w_685,h_813,al_c,q_90,enc_auto/c33de6_5a38d45bba8047ac8d732bdbd9af6144~mv2.png)
A p-value of <2.2e-16 for Accuracy > NIR, the null hypothesis is rejected, therefore, the accuracy is statistically significantly better than the No Information Rate.
c) Association Rules
![](https://static.wixstatic.com/media/c33de6_d6e098fdbcad4af5ac250f458da96c01~mv2.png/v1/fill/w_980,h_527,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_d6e098fdbcad4af5ac250f458da96c01~mv2.png)
![](https://static.wixstatic.com/media/c33de6_8f1c2a57ca6343628f1267dfa9334976~mv2.png/v1/fill/w_980,h_441,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_8f1c2a57ca6343628f1267dfa9334976~mv2.png)
There’s a 5.43438% chance that a customer has “Groups” as their Market Segment, “PRT” as their Country, and cancelled their hotel booking.
If a customer has “Groups” as its Market Segment and “PRT” as their Country, there’s a 75.45927% chance the customer will cancel their hotel booking.
There’s a 7.798303% chance that a customer has “Online TA” as their Market Segment, “PRT” as their Country, and cancelled their hotel booking.
If a customer has “Online TA” as its Market Segment, and “PRT” as their Country, there’s a 48.15785% chance the customer will cancel their hotel booking.
Conclusion
We initially started with all 20 variables and then reduced to 11 variables which we thought would be important. After running through CART, we realized the 6 most important variables and went through with them in our completed model. In our introduction, we stated the goal of our analysis is to answer two key questions
Does Lead Time and Market Segment related to booking cancelation?
Can we predict with accuracy if a booking will be cancelled based on the attributes? Based on our data analysis, these are the significant and potential casual relationships that affect booking cancellation
a) Lead Time
Based on our analysis, we learned Lead Time is one of the most significant factors that affect cancellation. When lead time increases, the customers are more likely to cancel. We observed this from CART analysis.
We see people book their reservations 2 months in advance and if someone cancels their booking when there are less than 60 days (about 2 months) left for their check in, the hotel can charge them as convenience fee.
b) Market Segment
One factor we found from the linear model, which is significant, is Market Segment and CART analysis gave us a similar result. We see that the maximum percentage of cancellation is done by groups and travel agents. The hotel needs to rethink their terms with online travel agents and penalty for group cancellations.
c) Deposit Type
We see customers are not bothered by the monetary penalty when deciding to cancel. For hotels, making Non- Refundable rooms will not prevent cancellation.
d) Required Car Parking/Special Requests
We see that when customers reserve a parking space when they are booking their hotel reservation, the chances of them cancelling their reservation decreases. Similarly, with special requests. Giving a customer what he wants makes the cancellation rate less.
e) Can we predict with accuracy if a booking will be cancelled based on its attributes
Through our analysis, we think the Support Vector Machine model works well to predict hotel cancellations. We tested two other models to see how accurate our prediction models are. CART Model gives us an accuracy of 81.08% whereas Support Vector Machine gives us the highest accuracy of 83.59%
Recommendation
Determining a threshold number of days for customers to pay convenience fee if booking is cancelled
Customers who book 2 months or more time ago have less chance of cancellation.
The cancellations in the category of market segment are done by online travel agents, the hotel can change terms and conditions they have with online agents to avoid more cancellation from online agents
The recommendation we have is the booking deposit should be dependent on how many days ago booking is made.
Limitation
Lack of customer behavior data
From our analysis, we discovered that variables relating to the customer’s behavior, such as special requests, show significance in leading to booking cancellation. Unfortunately, our dataset is limited in providing variables that relate to customer behavior. As we do not know if the special requests made by the customer were taken care of or not. However, we understand such data is not readily available and might be difficult to monitor.
Future
Studies Include more customer behavior data
One recommendation that can improve our study is to source customer behavior data. Examples include datawhich provides information about customers’ browsing sessions, type of special request and if that special request was fulfilled or not. With this additional data, we might understand the customer’s booking journey in greater details and identify alternate good predictors for booking cancellation.
Cross reference results and findings from different hotels
To improve our model and increase its relevance to different hotels, one recommendation is to perform a similar analysis of data from other hotels and cross-reference the results. Doing this allows us to verify our insights.
Comments