Do you have a project idea but you don’t know where to start? Or maybe you have a dataset and want to build a machine learning model, but you’re not sure how to approach it?
In this article, I’m going to talk about a conceptual framework that you can use to approach any machine learning project. This framework is inspired by the theoretical framework and is very similar to all of the variations of the machine learning life cycle that you may see online.
So why is a framework important?
A framework in machine learning is important for a number of reasons:
- It creates a standardized process to help guide one’s data analysis and modeling
- It allows others to understand how a problem was approached and fix older projects
- It forces one to think more deeply about the problem they are trying to solve. This includes things like what the variable is that will be measured, what the limitations are, and potential problems that might arise.
- It encourages one to be more thorough in their work, increasing the legitimacy of the findings and/or end result.
With these points in mind, let’s talk about the framework!
The Machine Learning Life Cycle
While there are many variations of the machine learning life cycle, all of them have four general buckets of steps: planning, data, modeling, and production.
Before you start any machine learning project, there are a number of things that you need to plan. In this case, the term ‘plan’ encompasses a number of tasks. By completing this step, you’ll develop a better understanding of the problem that you’re trying to solve and can make a more informed decision on whether to proceed with the project or not.
Planning includes the following task:
- State the problem that you are trying to solve. This may seem like an easy step, but you’d be surprised at how often people try to come up with a solution to a problem that doesn’t exist or a problem that isn’t really a problem.
- Define the business objective that you are trying to achieve in order to solve the problem. The objective should be measurable. “Being the best company in the world” is not a measurable objective but something like “Decrease fraudulent transactions” is.
- Determine the target variable if applicable and potential feature variables that you may want to look at. For example, if the objective is to decrease the number of fraudulent transactions, you’ll most likely want labelled data of both fraudulent and non-fraudulent transactions. You may also require features like the time of the transaction, the account ID, and the user’s ID.
- Consider any limitations, contingencies, and risks. This includes, but is not limited to, things like resource limitations (lack of capital, employees, or time), infrastructure limitations (eg. lack of computing power to train a complex neural network), and data limitations (unstructured data, lack of data points, uninterpretable data, etc)
- Establish your success metrics. How will you know that you’ve been successful in achieving your objective? Is it a success if your machine learning model is 90% accurate? What about 85%? Is accuracy the most suitable metric for your business problem?
If you complete this step and are confident with the project then you can move to the next step.
This step is focused on acquiring, exploring, and cleaning your data. More specifically, it includes the following tasks:
- Collect and consolidate the data that you specified in the planning phase. If you’re obtaining data from multiple sources, you’ll need to merge the data into a single table.
- Wrangle your data. This entails cleaning and converting your data to make it more suitable for EDA and modeling. Some things that you’ll want to check include missing values, duplicate data, and noise.
- Conduct exploratory data analysis (EDA). Also known as data exploration, this step is complete essentially so that you can better understand your dataset.
Once your data is ready to go, you can move on to building your model. There are three main steps to this:
- Select your model: The model that you choose ultimately depends on the problem that you are trying to solve. For example, whether it’s a regression or classification problem requires different methods of modeling.
- Train your model: Once you’ve selected your model and split your dataset, you can train your model with your training data.
- Evaluate your model: When you feel that your model is complete, you can evaluate your model using the testing data based on the predetermined success metrics that you’ve decided.
The last step is to productionize your model. This step is not talked about as much in courses and online but is essential especially for enterprises. Without this step, you may not be able to get the full value out of your models that you build. There are two main things to consider in this step:
- Model Deployment: Deploying a machine learning model, known as model deployment, simply means to integrate a machine learning model and integrate it into an existing production environment where it can take in an input and return an output.
- Model Monitoring: Model Monitoring is an operational stage in the machine learning life cycle that comes after model deployment, and it entails ‘monitoring’ your ML models for things like errors, crashes, and latency, but most importantly, to ensure that your model is maintaining a predetermined desired level of performance.
And that’s the general layout of the machine learning life cycle.