More data processing
Originally when we loaded the data, we had 378 feature vectors. After getting one hot encodings of the 8 categorical columns, our feature space blew up to 515 features. We only have around 4,000 training samples which means that it may be easier to overfit with too many feature columns to training samples. We want to only keep those features from which we will get a lot of information from and throw away those that give us very little and reduce noise in our data. A very good tool for this is PCA. PCA stands for Principal Component Analysis. In the guts it uses SVD- singular value decomposition, which is a matrix factorization technique. It finds the principal components (the features with the highest variance) by factorizing the data matrix into three matrices and it takes the principal components (features) from the eigenvectors in one of the matrices. Why does it choose the vectors with the highest variance? If you are trying to project all the information that 300 vectors provides down to simply 100 vectors, you want to choose vectors which will not lose too much information when being projected down. A low variance vector will lose a lot of information and won't be very useful.
In the first part of this post, I talked about creating a correlation matrix to see
linear relationships between features and the target vector. I mentioned how this matrix
may not capture nonlinear relationships that may be lying in the data. We can try to
explore nonlinear relationships by calling
PolynomialFeatures(). This will
take the feature columns in the data and return all combinations of features with degree
less than or equal to the degree you specify when initializing the PolynomialFeatures
model. For example if your data had two features, [a, b] and you specified degree=2,
then the polynomial features are [1, a, b, a^2, ab, b^2]. As you can see this will
expand our feature space again, just like the Dummifier transformer did.
Trying More Algorithms
We tried a global learning algorithm; why don't we give some local algorithms a shot. Let's try a really good out of the box algorithm: Random Forest. A random forest is a local model that trains many decision trees on random subsets of the data and it averages the predictions of all the decision trees. It is a really good out of the box algorithm because it provides a good bias/variance balance: random subsets give high variance and ensembling the decision trees together provide regularization which increases the bias. Trying out just the default parameters and our barely preprocessed data gives us an r2 score of 0.42! That's a better start than the linear regression model.
Gradient Boosting is another very popular algorithm (especially in the Kaggle community). Gradient Boosting is a boosting tree method which sequentially adds predictors to an ensemble. It tries to correct past predictors mistakes by fitting a new predictor to its residual errors (the instances it could not predict correctly).
To spice it up with another linear model, we can also try ElasticNet. ElasticNet is a combination of Ridge Regression and Lasso Regression. Both Ridge and Lasso provide regularization of the model unlike plain old Linear Regression. Ridge regression penalizes the model by adding a regularization term to the cost function using the L2 norm of the weight vector. Using the L2 norm allows for light regularization of all the weights of the model. Lasso has the same behavior but it uses the L1 norm of the weight vector instead. The L1 norm regularizes the model a bit more aggressively by sending unimportant weights to zero. This helps with feature selection!
We've played around with a couple of these models using the default parameters. When
a model or collection of models look promising, we can squeeze out better performance by
finding the right parameters. But trying to test out random numbers by hand is absolute
drudgery and here is where grid search comes in for the rescue using sklearn's
GridSearchCV(). A grid search will divide the data into random samples and
test all the permutations of the different parameters you'd like to try. It will perform
cross validation for each model to give accurate scores.
GridSearchCV is great but it comes with some drawbacks. If you have two hyperparameters you'd like to tune A:[1,2,3] and B:[4,5] you have 3*2=6 models to train with these different parameters. But on top of that you'll want to do five-fold cross validation on each of those models, turning this into 6*5=30 models to train. That can take a while especially if you have a couple of algorithms to try.
Randomized Grid Search
Randomized Grid Search is another flavor of grid search where you can specify a range of hyperparameters you'd like to explore. You can set the number of iterations for which to sample randomly selected hyperparameters. This can explore those parameters in between the ranges you have to specify with vanilla grid search. Again, this is another fancy technique you can use to squeeze better performance after you've really spent a lot of time squeezing all you can out of preprocessing the data some more.
Bonus: Ray for hyperparameter search
Ray is a new tool that has been recently released by the RISE lab in Berkeley. It's goal is to make jobs that where once not parallelizable, parallelizable using shared memory object stores and hierarchical scheduling to make machine learning experiments faster. You can use Ray to search through a large hyperparameter space a lot faster.