Friday, January 8, 2021

How to Classify Data with Machine Learning.

 

Figure 1: Decision tree graph built with classification data linked below.


For new programmers, building a machine learning model can be an intimidating task.  Machine learning requires basic knowledge of programming, access to a dataset to build a model from, and knowledge of how to implement the actual AI.

Like many problems, Machine Learning can be simplified by breaking down a problem into many small pieces.  The data science website Kaggle also does a good job helping developers learn programming.  Kaggle offers free access to hundreds of open datasets, as well as the ability to create notebooks to access and make predictions from this data.  As a result, developers spend less time trying to install dependencies or looking for data, and more time accomplishing their goals.


To demonstrate some basic Machine Learning techniques, I created a Python notebook in Kaggle from the dataset "Mushroom Classification".  A Python notebook is like a Word document that also allows a developer to run code in small chunks.  Notebooks are a great tool not only for splitting up large tasks, but also to convey results of a program to non-technical people.  As an example for classifying data, I created a notebook guide linked here in Github.  To run the notebook, either download the dependencies and dataset from Kaggle, or create a notebook in Kaggle from the first link.

Figure 2: A graph using SHAP, a
strategy to analyze the thought process of an AI.

In the Mushroom Classification dataset, the authors collected information about thousands of mushrooms such as size and shape, with the goal of using these features to identify poisonous ones.  The guide I linked to Github will show examples of how to write several Machine Learning models, as well as how to display results and minimize bias.

The linked guide will also give examples on how to explain the results of different AI methods, such as SHAP.