Machine Learning Pipelines and Search

Constructing a simple pipeline and searching hyperparameter space on a small Kaggle dataset


Check out my collection of ipython notebooks here! The mercedes_tutorial.ipynb notebook will have a lot of the code and explanations that will be expanded upon in this post.

The Data

Kaggle recently launched a new contest called Mercedes-Benz Greener Manufacturing. The competition is looking for a model to predict the time it takes to pass testing using Mercedes-Benz car features. If you log into Kaggle and download their dataset you'll be able to take a look at the data.

What does it look like?

By first glance, we can see that it is a small training set. Loading it into a pandas dataframe it only has 4209 rows and 378 columns. We also know from the competition description that this will be a regression task.

Using a jupyter notebook is a great way to explore your data on the fly. We can get a quick view of the data frame calling .head() once we've stored it to a name. We can see that columns aren't very descriptive of what kind of information they are holding. This means that this dataset has provided annonymous features which means that we won't be able to use much domain knowledge to engineer new features. For this dataset, our game plan will be to explore the data, apply some basic preprocessing and cleaning, and explore different regression algorithms or a combination of some to come with a good model. If we were looking at the Instacart Competition, we would actually like to spend a lot of time extracting valuable insights from the data and cooking up some good features before throwing the kitchen sink of machine learning algorithms since the feature columns are described (and they provide a lot of interesting measurements, but that's for another post). Calling the .get_dtype_counts() method gives us a quick peek at what kind of values are stored in the columns of the data frame. Notice that we have 370 columns have numerical values and 8 columns have object values (categorical values). This is good to keep in mind for later since we'll want to transform those categorical columns into numerical ones. Why you ask? On a basic level, machine learning algorithms perform numerical operations on the data they are being fed. Letters or categories are not very useful for calculations so they need to be translated into a numerical value to be able to perform such calculations. Moreover, we want those numerical representations of the categorical data to be booleans. Boolean values are very important here and to explain why I'll use an example. Say I have a column with three values: yellow, green, blue. I can map yellow to the number 1, green to 2, and blue to 3. However, this poses a problem for the machine learning algorithm because it might interpret blue(3) to be better than yellow(1) because 3 > 1, when we might not want that behavior. So instead, we'll expand the color column into three columns: yellow_column, green_column, and blue_column with a 0 or a 1 if a row contains a certain color.

Knowing what our target vector looks like is also really helpful. When we test new algorithms we'll want to know if our predictions make any sense. Hitting .describe() on the target vector, in this case the 'y' column, shows us some quick stats about those values.

Another handy tool is a correlation matrix. A correlation matrix can give us more information about the relationship between many columns at a time.This is a correlation matrix for the top 10 most correlated feature vectors with the values in the target vector. We can see that the highest positively correlated vector is X314 with 0.61 as the correlation coefficient. This means that column X314 and column y have a positive linear relationship, meaning we can count on column X314 giving us some information on how values in the "y" column are determined. A negative correlation coefficient would tell us that there is an inverse linear relationship between a feature in the dataframe and the "y" target vector (e.g. an increase in one value means a decrease in the other). Don't be fooled by a 0 correlation coefficient and cast off a feature vector as irrelevant. A 0 correlation coefficient could mean that there might be a non-linear relationship between the two vectors measured. Here's a visual from wikipedia to show what these relationships might look like graphically: Notice that the graphs representing coeff= 0 have plenty of structure to them, just not linear. If you were to manipulate some of these graphs by applying a power, square rooting, or some other operation, to the feature vector by some factor you could then see a linear relationship between the feature vectors and the "y" target vector.

The Pipeline

Let's make a basic pipeline to get the data prepared to experiment with some algorithms. First we'll write a function that will read in the data. I wrote it so that if I wanted, I could pass in a list of columns I wanted it to drop before saving the dataframe to a variable.

Notice that I made a helper function to get the right columns for both our train set and our test set (a holdout set for testing). The train set and the smalled test set may have different categorical values that they sampled when the dataframe is split and this helps keep the columns the same. We want both our training and test (validation) sets to have the same features/ be transformed the same way because any algorithm will not be able to utilize columns it had not trained on before to predict a new value. I store these in the variable "columns" for use later. I also called Sci-Kit learn's handy test_train_split() on the training set so that we can make a validation set quickly.

Now that we have the data available for us to play with we can process it how we please. We'll start by making those one hot encodings we were talking about earlier on the 8 categorical columns in the data. I use sci-kit learn's get_dummies() method to generate these values. get_dummies() will expand the matrix and add a column for each category. Each row will be filled with a 1 if it had that particular category. Let's look at an example:

Here's one of the 8 columns that contain categorical values. When we call pd.get_dummies(cat_col), pandas generates a column for each category and fills it with 0s or 1s accordingly. You can see that the column has exploded into 47 columns!

The Significance of Feature Scaling

Now that we have encoded all the categorical values into numerical representations, we should take a look at the scale of the data. There are certain algorithms that are sensitive to the scale of the data its being fed and could lead to some bad calculations if not taken care of. SVM, KNN, and logistic regression are examples of algorithms that need to have the data normalized. Take for example KNN. This is a local algorithm that uses distance to determine boudaries between instances. If the data contains different scales, it is very difficult for KNN to be able to correctly calculate the distance between points if the distances are not unitless throughout the set. PCA is another method that finds the principal components of a dataset by determining the components with the highest variance. If say, one component is a measurement in inches and another component is a measurement in kilometers, the algorithm may determine the kilometer component to have the most variance where that could not be the case when the data has been normalized. Since we'll experiment with some of these algorithms, we'll make sure to scale the data using sci-kit learn StandardScaler().

We should check if we have any missing values. Just like categorical data, machine learning algorithms don't like NaNs either. We can choose to remove columns that show too many missing values, remove rows that have missing values, or impute the missing values using some heuristic. Whenever we can, we prefer to keep as much data as possible without introducing too much noise so we'd go with an imputer. Let's take a look if we have any NaNs:

Looks like we got real lucky and no need to impute missing values! It is quite likely you'll run into missing values when seeing more data in the wild. There are different techniques, some simple like imputing the average or the median to more involved techniques such as using a machine learning algorithm to learn the most likely value for the missing slots.

Putting it together

We've gone over three potential transformations we would like to perform on the data: one hot encodings, normalizing, and imputing missing values. Remember, this is only a baseline of what you can do to ready the data. We'll probably need to perform more transformations to get a better r2 score as we play with algorithms. We'll make a pipeline that creates one hot encodings and normalizes the data for now and skip the imputer since there are no missing values here. Sci-Kit learn has conveniently provided tools that allow us to create our own tranformers so that they will work nicely with other functionalities (like Sci-Kit learn pipeline methods). We do this my importing the BaseEstimator and TransformMixin from sklearn.base. The TransformMixin module allow us to declare a class as a transformer so that we can write a transform() function. The BaseEstimator gives you two extra methods get_params() and set_params() so that we can tune hyperparameters later.

We can implement these classes by integrating them into a pipeline. Sci-Kit learn makes transforming the data in one clean sweep by using Pipeline. Pipeline requires all the classes you call to have both a fit() and transform() method in other to be able to fit_transform() the data. This is why we have both of these functions defined in the classes we built to select the data and dummify the data.

The Algorithms

We have preprocessed our data and now we can get down to playing with some algorithms. Let's start with a simple algorithm that has been around in statistics for a while now: Linear Regression. Linear Regression is a global algorithm, meaning that all the training instances dictate how the model will be determined (unlike KNN which is a local algorithm). It attempts to fit a line (plane) constructed from a weighted sum of the input vectors plus a bias term (the y-intercept) that best fits the training instances to predict a target. The line gets adjusted to the training instances by minimizing a cost function: here the coefficient of determination (r2 score). A cost function is a penalty added to the models feature weights. A popular way to find the minimum to this cost function is by using Gradient Descent.

By calculating the r2 score of our predictions we can see that this model sucks.. a lot! A score of 1 is perfect, a score of 0 means the model is constant and always predicting the expected value of y -not so good. A negative score means that the model is arbitrarily worse than just predicting the expected value of y.

Let's not get too discouraged. We did very bare processing of the data before feeding it to the algorithm and there are plenty of techniques to try. There are also plenty of other algorithms, some that have many hyperparameters to play with, to train a better model. We will explore more algorithms and new processing techniques next.