Multi Linear Regression

Car price predictor based from 3 million row dataset

This dataset was from a free online source and contains over 3 million rows before cleaning and preprosessing. It provided many key features to determine the price such as year, make, model, mileage, color and location.There origanally over 33 columns to choose from but after eliminating redunant columns less important columns I was left with only 12 columns.

Alot of time was spent on cleaning and preprosessing the data as well as the EDA. I had to remove all the null values and outliers. A portion of the rows were also were missing the ammount of mileage so I used mean imputation of the mileage of all cars with that same year to fill the mission values. I also had to convert all the categorical data into

To prepare the data for training I used one-hot encoding to avoid cardinallity and to allow the model to learn catagorical features. Due to the large computational cost of one hot encoding I reduced the side of my dataset to allow my local machine to train the model

Tools used

  • Python
  • Keras/TensorFlow
  • Pandas
  • Matplotlib
  • Seaborn
  • SkiKit-Learn
  • Numpy
See code in github