Best in motorsport with revenue of $2.14 billion, F1 brings lots of sponsors and media into the game. Being on the winning team will bring fortune and wealth.
![](https://static.wixstatic.com/media/c33de6_9a12c25c624141eba114f2595e27d666~mv2.jpg/v1/fill/w_980,h_653,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_9a12c25c624141eba114f2595e27d666~mv2.jpg)
Many brands are associated with Formula 1. To begin with, teams are brands themselves, many teams are also car manufacturers, and use Formula 1 to market themselves. In addition, teams and the F1 itself have lots of sponsors, who pay a lot of money for their brands to be associated with the sport.
Sports and games are part of world's most significant forms of entertainment. Also, Sports and games form part of the sources of income for those involved
Drivers also have the brands that they like, which provides them with racing gear. This shows how much money comes into Formula 1 and how important it is to predict the winners for each race to determine who will the championship.
How to predict F1 winners ?
To take all this information into account for processing there is step-vise procedure to be followed to achieve the goal of predicting top 5 winners.
1. Collecting Data:
For my data mining I found two great sources: the Kaggle data repository and the official Formula 1 website; they essentially have the same data but I used both for greater accuracy and completeness.
DataFrame_1 : Drivers
Drivers data frame contains all driver’s details like Forename, Surname and Nationality Along with unique Driver_ID
![](https://static.wixstatic.com/media/c33de6_0dc405955ea44e5bb1e779fa42f20495~mv2.png/v1/fill/w_980,h_739,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_0dc405955ea44e5bb1e779fa42f20495~mv2.png)
DataFrame_2 : Results Results data frame contains all race results details like grid_position, final_position, Laps, points, fastestlap and many more details about race results.
![](https://static.wixstatic.com/media/c33de6_93c67146460e41dc892a8bab82a48de9~mv2.png/v1/fill/w_980,h_369,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_93c67146460e41dc892a8bab82a48de9~mv2.png)
DataFrame_3: Status Status data frame contains all status details like either driver finished a race or was disqualified and many more 138 statuses related to race.
![](https://static.wixstatic.com/media/c33de6_32def76223e24a7a8394e60c33d93748~mv2.png/v1/fill/w_732,h_976,al_c,q_90,enc_auto/c33de6_32def76223e24a7a8394e60c33d93748~mv2.png)
DataFrame_4: Races Info Race_Info data frame contains all race’s details like year in which it was held, circuit name, unique id of circuits and Date
![](https://static.wixstatic.com/media/c33de6_538c81af264c42228170f62501db64aa~mv2.png/v1/fill/w_980,h_535,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_538c81af264c42228170f62501db64aa~mv2.png)
DataFrame_5 : Constructors Constructors data frame contains all company’s details like company id , reference name, name and nationality.
![](https://static.wixstatic.com/media/c33de6_62273edb41a94ffbb3942851843826da~mv2.png/v1/fill/w_980,h_361,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_62273edb41a94ffbb3942851843826da~mv2.png)
DataFrame_6 : Driver Standings Driver_Standing data frame contains all driver’s details like raceID, DriverID, race final positions by each race.
![](https://static.wixstatic.com/media/c33de6_9bed2ee1ac664055bcbb6a2e6b0f5e14~mv2.png/v1/fill/w_980,h_697,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_9bed2ee1ac664055bcbb6a2e6b0f5e14~mv2.png)
DataFrame_7: Pitstops Pitstops data frame contains all driver’s pitstops for each race and on which lap he pitted.
![](https://static.wixstatic.com/media/c33de6_9bed2ee1ac664055bcbb6a2e6b0f5e14~mv2.png/v1/fill/w_980,h_697,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_9bed2ee1ac664055bcbb6a2e6b0f5e14~mv2.png)
2. Preparing Data:
Now we have all the data frames, we must merge all these data frames to make combined records for all drivers for their respective races.
Merging data
Similarly we have merged all data into one data frame. Now the question is what we do about redundant data and unwanted data from merged data.
![](https://static.wixstatic.com/media/c33de6_a6f35c5337ea486db71615d2b087875c~mv2.png/v1/fill/w_980,h_590,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_a6f35c5337ea486db71615d2b087875c~mv2.png)
Removing unwanted columns
This makes our data clear and readable for modeling and feasible for writing algorithms. But we haven’t seen if there is an missing value or not.
![](https://static.wixstatic.com/media/c33de6_e34cc12e013d4ca1a3aee5b8398e291e~mv2.png/v1/fill/w_980,h_632,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_e34cc12e013d4ca1a3aee5b8398e291e~mv2.png)
Removing null values
Removing Null values is important because it might result is biased data and affect the training model
![](https://static.wixstatic.com/media/c33de6_cd58d2044d0d4a0d92f02f26d812316e~mv2.png/v1/fill/w_980,h_153,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_cd58d2044d0d4a0d92f02f26d812316e~mv2.png)
3. Data Exploration
Data Volume
In total we have 25,420 records combining all data into one data frame except for Pit stops data, which doesn’t have complete data as it was not recorded correctly.
The drivers before 2010 are not currently racing in Formula One, to make it meaning full and reliable we will be taking Formula One data from 2010-2022
Race analysis
We can below in this graph that how many races are being organized each year from 1950 to 2022. Race held per year have been increase significantly. Which means the competition in the drivers is increased
![](https://static.wixstatic.com/media/c33de6_7771abe9bdc645beb2d072853dc7d4ef~mv2.png/v1/fill/w_980,h_328,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_7771abe9bdc645beb2d072853dc7d4ef~mv2.png)
Location Analysis
From the data we got we can see that as number of races per year have increase the location for Formula One races have also increased, we can see most of the races are held in European countries, surprisingly we have 2 tracks in United states this year adding total race per year count to 23, which was 7 in 1950.
![](https://static.wixstatic.com/media/c33de6_514e2e5f3e984c19a188d2ffdb835a45~mv2.png/v1/fill/w_980,h_613,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_514e2e5f3e984c19a188d2ffdb835a45~mv2.png)
Winners Analysis
More than 1000 races held till now, still counting. Total 854 drivers have driven in formula one races till data. Drivers with more experience have benefit in races, as well as being selected for constructors(companies). That means driver holding highest wins(points) hold the strong side for good results in races.Which is why we will see who have won highest races in Formula One history
![](https://static.wixstatic.com/media/c33de6_b11e7df877ac4b7f9d9b6eb1627f727e~mv2.png/v1/fill/w_980,h_655,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_b11e7df877ac4b7f9d9b6eb1627f727e~mv2.png)
Total Pole Positions
To look at the success of the race drivers who had pole positions in the races.
![](https://static.wixstatic.com/media/c33de6_e4f6ebb26c0644128fd8e68a895836ab~mv2.png/v1/fill/w_784,h_988,al_c,q_90,enc_auto/c33de6_e4f6ebb26c0644128fd8e68a895836ab~mv2.png)
Heat-map for features
Finding the correlation between the variables is important to filter the not important features and train our models
![](https://static.wixstatic.com/media/c33de6_8f627c0cc63b4f3ba84d4009c8524647~mv2.png/v1/fill/w_980,h_721,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_8f627c0cc63b4f3ba84d4009c8524647~mv2.png)
As we can see we have high correlation between following features: ResultID – Year Grid -Position Driverstandingsid – year Driverstandingsid – raceId ResultID – raceID Year- Driverstandingsid Age-point-results
Skewness in features
![](https://static.wixstatic.com/media/c33de6_ff680d3dab6740a795984522beae0425~mv2.png/v1/fill/w_980,h_293,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_ff680d3dab6740a795984522beae0425~mv2.png)
![](https://static.wixstatic.com/media/c33de6_ff680d3dab6740a795984522beae0425~mv2.png/v1/fill/w_980,h_293,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_ff680d3dab6740a795984522beae0425~mv2.png)
Result Id shows consistent races held and given unique result ids, whereas in race id we can see that it is multiple starting points which shows left skewed data for race id.
![](https://static.wixstatic.com/media/c33de6_c8a5a663d17d4877b67e559207158490~mv2.png/v1/fill/w_980,h_280,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_c8a5a663d17d4877b67e559207158490~mv2.png)
For driver id we can see that it is multiple starting points, for constructor id which is representing company’s id also have multiple start points for distributions
![](https://static.wixstatic.com/media/c33de6_a591318a7bd54208acd3228a1913e661~mv2.png/v1/fill/w_980,h_284,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_a591318a7bd54208acd3228a1913e661~mv2.png)
grid Id shows consistent races held and given unique gird positions, whereas in positions id we can see at 1st position there is very less density which shows that not many drivers have stayed on 1st position for long.
4. Choosing a model
Identifying what type of problem, it is.
Before starting to select models and train them, I need to figure out what kind of a machine learning problem I am looking at, whether it’s a classification problem or a regression. I need to choose wisely because if it is a regression problem, I’ll get my predictions based on number which are highly correlated wot each other. But if that’s not the case my models will be wrongly predicting the results and it will be bad for the teams who will use this data.
Looking at the data and problem if we use classification approach, we might end up getting two same results as race winner. We might have more than one winner for a single position at all depending on the predicted probabilities. Because my algorithm is not smart enough to understand that I only need one winner for each race, I created a different scoring function for classification that ranks the probabilities of being the winner of the race for each driver. I sort the probabilities from highest to lowest and map the driver with the highest probability as the winner of the race.
5. Training the model
Before we train the model, we need to split the testing, validation, and training data. If we think about model comparison and parameter tuning, we will need validation data. So we have final data of 5096 records out of which we will need first 3565 records (2010- 2018) for training and from 3565 records to 4078 records(2019) for validation for models and rest is for testing that is 4078 to 5096.
6. Evaluating the model
Looking at the problem I decided to select following models for my predictions
Logistic Regression
K-Neighbors Classifier
Naïve Bayes
Random Forest Classifier
Decision tree
All models we ran on base model parameters to see if we can archive the accuracy without parameter tuning
![](https://static.wixstatic.com/media/c33de6_4e9595e078f340e68ed2cc5ab35ecc91~mv2.png/v1/fill/w_980,h_593,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_4e9595e078f340e68ed2cc5ab35ecc91~mv2.png)
As we can see that Random Forest and Decision tree classifier have highest accuracy than other models. My analysis thoughts were that logistic regressions model will run better than other models as the data we are predicting is numerical and in sequential format, but surprisingly it did not perform well.
To check the precision of the models we will check the results with actual results.
![](https://static.wixstatic.com/media/c33de6_82ba2f95a64e45b89e5a8c7ff5b6188a~mv2.png/v1/fill/w_980,h_135,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_82ba2f95a64e45b89e5a8c7ff5b6188a~mv2.png)
Based on the precisions score we can say that there is chance of improvement in Random Forest. We will try parameter tuning.
7. Parameter tuning Defining parameters ranges to search best parameters suitable for data set using Grid Search.
![](https://static.wixstatic.com/media/c33de6_a4ea3533bb7c4cdfb1c4dcca95bf168f~mv2.png/v1/fill/w_980,h_306,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/c33de6_a4ea3533bb7c4cdfb1c4dcca95bf168f~mv2.png)
After running Grid Search I found best suitable parameters for the model with this data.
![](https://static.wixstatic.com/media/c33de6_61fdce681edf414086bea596d6f5fa70~mv2.png/v1/fill/w_912,h_398,al_c,q_90,enc_auto/c33de6_61fdce681edf414086bea596d6f5fa70~mv2.png)
Moving forward with the parameters lets train the model again with these parameters and get the predictions
8. MakingPredictions
Combining the predicted data with actual data shows that, my prediction model worked well to predict the winners based on their performances. In the given image above we can see the Stroll was predicted 1st but the actual race results shows he didn’t finished the race resulting in 0 position (crashed ). If we remove him from the top we have correct predictions for Hamilton, Bottas, Verstappen, Leclerc and Sainz.
![](https://static.wixstatic.com/media/c33de6_ca18456af854414a9a9bfea2f8a6f684~mv2.png/v1/fill/w_980,h_647,al_c,q_90,usm_0.66_1.00_0.01,enc_auto/c33de6_ca18456af854414a9a9bfea2f8a6f684~mv2.png)
Results
With accuracy of 60% we have predicted the top 5 race winners using machine learning model with best prediction of random forest. Despite having heist accuracy of decision tree we decided to move with random forest because random forest have high accuracy and precision as well.
Discussion
Except the accident incident with the top predicted driver shows that we need more examination on more data from the races like weather, safety car and pitstops data to predict more accurately for upcoming season. Including safety car incoming into race and how many times it came along with how many laps were under safety car will help me to add more accuracy to prediction of the formula one race winner. I’m looking forward to including all these features for advancement of this project and my skills.
Comments