Go get the library from my github!
Cleaning, transforming, splitting, oh my!
Starting to learn machine learning is tough. There are many steps involved in making a quality model and for a noob trying to wrap their head around it all can be daunting (it was for me!). I built this little library to modularize the process and make assembling a script a snap.
These are the modules I have built so far:
The transformers.py module has a variety of transformers built with sci-kit learn's
TransformerMixin. These classes allow us to
build our own transformers that can be fed into a an sci-kit learn
Pipeline(). These are used by the data_prep.py module to load in data and
transform it. You can also import these on their own to transform any data on the fly.
DataFrameImputer(): fills in missing values in a pandas dataframe with the most frequent value for categorical columns, mean or median for numerical columns.
FactorFeatures(): multiplies specified columns for quadratic interactions
Dummifier(): creates one hot encodings for categorical columns of a pandas dataframe
DataFrameSelector(): takes a pandas dataframe and transforms it into a numpy array
The dataprep.py module contains classes to load and transform train and test data and create various pipelines.
Pipes(): creates various pipelines that use transformers from the transformers.py module to transform data.
DataPrep(): loads train and test data and transforms it using selected Pipes() method. It returns a training set, a validation set, and a test set.
The helpers.py module contains various helper functions used in the transformers.py and the dataprep.py modules. You can load these individually to perform small tasks.
This library is just beginning to grow. After some more studying I'll be filling it up with even more useful tools. Feel free to clone the repo and add your own functions and classes!