August 24, 2020
George is a bloke who likes writing (both for computers and humans) and overall enjoys problem-solving, learning about the universe and being a skeptic git. He's mainly worked as a jack-of-all trades ML/devops/backend dev guy in various startups dealing with challenges around processing and analyzing large volumes of data.
Let’s take a look at how we run a dataset through MindsDB end-to-end, including the minimal data pre-processing that’s not included in MindsDB, which you might need and the standard way to evaluate a machine learning model’s performance.
For the purposes of this article I will be using a “standard” dataset called “German Credit”. The purpose of this dataset is to predict whether someone’s credit class is either good or bad based on 20 attributes such as: installment rate, job, purpose and credit history.
If you prefer to follow along visually, you can watch the video below:
The dataset can be downloaded from here.
First, let’s download it and create the following directory hierarchy for the project:
Next, we’ll install mindsdb and scipy: `pip install --user mindsdb scipy` (note: you need to use python3’s pip, it might be aliased as `pip3` on certain OS’s)
Next, we’ll have to process the data, we’ll be doing this inside a file called `pre_processing.py`. There are a few steps we need to take:
This is done because pandas DataFrame is easier to work with in python than arff files.
Use scipy’s `loadarff` to load the data into a tuple. The first member is a list of rows and the second member is an object containing metadata such as the column names.
Next, iterate through the rows in order to do some cleanup where necessary. In this case `loadarff` loads string columns in numpy binary objects rather than python strings and quotes them inside `’`, so we’ll have to decode them to strings and remove the surrounding `’` in order to get a better representation of the original data.
Finally, using the column names, we’ll turn our dataset into a pandas DataFrame.
It’s good practice to split a dataset into two, one is used for training our machine learning model (in this case the one built by MindsdDB) and another one is used to test its accuracy. We call these the “train” and “test” datasets.
A good train/test split could be something like 80/20, which I’ll be doing here. The more data we feed the model, the better its accuracy will be, but we want to be left with a significant amount of data to test our model. Whatever “significant” means depends on the specific domain you work in, the problem and the size of your data.
Before doing that, we’ll shuffle the data around. This is not necessary for this particular example, but it’s a good general practice since otherwise the ordering of your data might result in an uneven split of certain features between the training and testing datasets.
*This is not the case with certain time-dependent datasets where it can be ideal to train on older data and test on newer data*
Now try running `pre_processing.py`, if all goes well you should see the two new csv files in your `'processed_data` directory.
Next, we’ll create a file called `train.py` in which we’ll add MindsDB the training code:
That’s it. That’s all you need to train a MindsDB model. The only required arguments are `to_predict`, which indicates the name of the column to be predicted (or key, in case you are using a JSON file) and `from_data`, which indicates the location of the data. By default `from_data` can be a structured data file (xlsx, json, csv, tsv… etc) or a pandas dataframe, however MindsDB also supports advanced data sources that can get data from stores such as S3, Mariadb, MySQL or Postgres. More on that here.
To understand the optional argument, it might be helpful to understand the “phases” through which MindsDB goes:
Let’s look at the optional arguments passed here since you might find yourself using them rather often:
Finally, we’ll add some evaluation code to `train.py`, which we use to figure out how well the model is actually performing:
First, tell MindsDB to predict for our testing dataset and extract the predicted values from the result:
Second, get a list of the “real” values and compare the two using a balanced accuracy score, so that the accuracy is not computed as the overall accuracy, but rather, as the accuracy for predicting a good credit class times the accuracy for predicting a bad credit class.
If we want to better visualize the output, we can also print a confusion matrix:
Now run `train.py` and see how it works for yourself.if something breaks, try retracing your steps through this article. If it still doesn’t work, feel free to report it on our Github project.