So, how do you go about generating insights from a dataset?
A big part of being a data scientist at QL2 is understanding and supporting the business by providing actionable market insights. To generate insights, a thorough study of the data—data cleaning, feature engineering, data sampling, statistical analysis, etc.—is necessary.
Today, we will explore and analyze the used car dataset we have at QL2. The used car market has seen a dramatic increase in price volatility over recent years. This makes for a very interesting dataset to analyze and generate insights from.
Key Business Questions
Initially, we start off by coming up with a list of questions that would help our business gain a deeper understanding of the trends in the used car market over the years.
Some questions we will answer using the data are:
- How have used car prices changed over the years? Has the difference in prices between old and new cars changed?
- Does mileage affect the selling price of cars differently before and after the pandemic?
- How has the time taken to sell a car changed over the years?
QL2 collects data from websites such as Car Max, Autotrader, etc., that are used to understand and quantify some of the dramatic shifts that have taken place in the US used car markets.
The data that QL2 has is vast:
- 1Tb+ data collected daily since 2015 with 1B+ records
- 800+ Zip codes
- 50M+ used cars
- 50+ manufacturers
So, when you have such a huge amount of data, how do you go about analyzing it, and more importantly generating insights from it?
Understanding the data
The first step when analyzing the data is to understand what fields or features are needed to answer our business questions. Although we collect 20+ features per car, all of these are not necessary for our analysis. Below are the fields we have decided to use for our analysis.
Choose Which Fields to Analyze
- Site: Site from which the data was scraped (e.g., Car Max, Autotrader)
- Year: Year when the car was manufactured
- Make Car companies, such as Toyota, Honda, Hyundai, Jeep, etc.
- Car model: Specific car models such as a Toyota Camry or Honda Accord.
- Mileage: Number of miles when the car was listed on the website
- Price: Asking price for the car
- VIN: Unique VIN for each car listed on the website
- QTS: Timestamp for when the data was scraped
Select the Data Grain
To simplify the analysis, we are going to pick a sample of the data based on car models and time frames. This is a crucial step in the analysis because all the insights are going to be based on this sample.
- Car models: Since this dataset is huge, we are going to limit it to the best-selling sedans in the US—Toyota Camry, Honda Accord, Hyundai Sonata, and Volkswagen Jetta. All of these currently retail from $20k-40k in the US markets.
- Time frame: This analysis includes cars sold after 2017 because we want to observe the trends in the car market in recent years
Clean the Data
The data is scraped off websites daily. The process by itself is not perfect and sometimes results in data that has missing fields, is inaccurate, has structural errors, or contains duplicates.
The following steps were taken to clean the data:
- Remove duplicate VINs.
- Exclude cars where the VIN is missing.
- Exclude cars where the price was > $40k, since all the cars retail for less than that.
- Exclude cars where price or mileage was missing.
Now, we will create some new features for detecting patterns and making the analysis more interpretable.
- Age of the car: We classify the cars as “old” (>3 years) or “new” (<3 years).
- Miles per year: This feature shows the number of miles covered by the car each year since it has been manufactured.
- Time to sell: Based on the first and last time a VIN is seen in the dataset, we can calculate the amount of time it took for a car to sell.
Here is how our final data looks:
Each row represents one car, its manufacturing date, mileage, price, date when it was posted for sale, date when it was sold, age of the car, time is taken to sell (in months), the year it was sold, and whether it was old (>3 years) or new (< 3 years).
Now that we have defined the scope of the analysis, prepared the data, and selected the representative sample to analyze, we will wrap up Part 1 of our analysis.
In Part 2 of this blog series, we will use the information above to generate insights and answer our business questions. Our next blog will detail the exploratory analysis to visually understand what’s happening in the used car market. Additionally, we will also perform a regression analysis to quantify the trends we see in the exploratory analysis.
For more information about our used car data, check out our webinar on QL2 Car Rental and Used Car Price Indices to learn about trends in both markets, their similarities and differences, and insight into whether prices have peaked or continued to grow.
Written by: Akshay Munde, Data Scientist