Machine learning was supposed to make things easy by computerizing human cognition. But it made life harder than ever for the data teams tasked with implementing it.
A rising number of enterprises implement machine learning (ML) to improve revenue and operations as they digitally transform their businesses. But ML introduces operational complexities and risks that need careful attention. Data teams must holistically manage the ML lifecycle to make their projects efficient and effective.
This blog kicks off a series that examines the ML lifecycle, which spans (1) data and feature engineering, (2) model development, and (3) ML operations (MLOps). This blog defines machine learning and then examines the data and feature engineering stage. Part 2 of the blog series will examine model development and MLOps. Subsequent blogs will examine the roles of key stakeholders—the business leader, data engineer, data scientist and developer—because these roles in particular must acquire new skills and collaborate to make ML work. The stakes run high because many projects today fail due to siloed roles and disjointed processes.
What is Machine Learning?
Let’s start at the beginning. Machine learning (ML) is a subset of artificial intelligence in which an algorithm discovers patterns in data. These patterns help people or applications predict, classify, or prescribe a future outcome. ML relies on a model, which is essentially an equation that defines the relationship between data inputs and outcomes. ML applies various techniques to create this model, including supervised learning, which studies known prior outcomes, and unsupervised learning, which finds patterns without knowing outcomes beforehand.
Once you create and train the ML model on historical data, you apply it to live production data. The model generates a score that helps people and applications create business value by predicting, classifying, or prescribing future outcomes—and taking action.
While the AI pioneer Arthur Samuel coined the term “machine learning” in 1959, the technology really gained steam in the 1980s and 1990s as an alternative to manual predictive models.
Machine learning (ML) is a subset of artificial intelligence in which an algorithm discovers patterns in data that help predict, classify, or prescribe an outcome.
Enterprises address many types of business problems with ML. For example:
Fraud detection: A commercial bank uses ML to classify whether a transaction has a high risk of fraud based on merchant identity, merchant location, and size vs. prior transactions. Transactions classified as high risk trigger an extra authentication request of the merchant.
Price predictions: A real-estate firm uses ML to prescribe the market price of houses based on zip code, recent transactions, and local school ratings. These prescribed prices guide the asking and offer prices that agents recommend to their clients.
Health services: Nurses in a hospital use ML to classify the risk of major infections based on demographic data and patient vital signs. High risk classifications alert the nurses and prompt them to proactively treat patients.
Document processing: A law firm uses ML to classify documents and extract information, such as key concepts, relevant laws, and primary stakeholders, to expedite research efforts.
Preventive maintenance: A manufacturer uses ML to study physical sensors and factory service records to predict when robotic arms will break down. A prediction below a certain threshold triggers an alert for the plant manager to deploy a service technician.
To address use cases like these in large and complicated enterprise environments, you need to manage the ML lifecycle holistically. This entails three stages and nine individual steps.
Data and feature engineering: Ingest and transform your input data (i.e., your historical data), label your outcomes, then define and store your features.
Model development: Select the ML technique you will use (such as linear regression, classification, or “random forests” that make use of decision trees), train your models, then store and manage them.
ML operations (MLOps): Put those models to work! Implement them and operate them in production workflows. Monitor their performance and the accuracy of their output.
Most importantly, data teams must rinse and repeat. They must identify data drift—i.e., changes in market conditions or other aspects of your environment—then pull their ML models out of production, re-train those models and re-implement them. Figure 1 illustrates the three stages of the ML lifecycle.
Figure 1. Stages and Steps in the Machine Learning Lifecycle
Data and Feature Engineering
Let’s define each step of data and feature engineering, and who performs it. To set the table for upcoming blogs, we’ll also describe (in italics) the key challenges that make ML projects an “all-hands-on-deck” endeavor. Busy business leaders, data engineers, data scientists and developers need to acquire new skills and help one another.
Ingest and transform input data. First, data scientists, data engineers, and ML engineers need to collect all the historical input data that’s potentially relevant to the business problem they need to solve. They design, configure and deploy data pipelines that ingest the input data into a repository such as a data lake. They merge, cleanse and format the data, The data scientist provides close oversight to ensure the resulting dataset fits analytics requirements.
Data engineers need to manage high volumes, varieties and velocities of data across heterogeneous hybrid and multi-cloud environments. They collaborate with data scientists to transform data into a usable format and structure for ML. Data engineers also need to create pipelines that data scientists can access and manipulate.
Label the outcomes. Next data engineers, ML engineers and data scientists collaborate with business owners to “label” various outcomes in their historical data sets. This means they add tags to data to easily identify historical outcomes, such as robotic arm failures, fraudulent transactions or the price of recent house sales. While data labeling is trivial in those examples, it can get tricky with unstructured data. You need to label historical images—for example, “dogs,” “cats,” etc.—to help the algorithm create an accurate ML model for image recognition. Similarly, you need to label customer emails and social media posts as “positive” or “negative” to create an accurate model for classifying customer sentiment. You can view a label as the variable you want to predict.
Data engineers and data scientists need to label outcomes accurately and at high scale. This requires a programmatic approach, automation, and assistance from business owners that best understand the domain.
Note that labeling applies to supervised ML only. Unsupervised ML, by definition, studies input data without known outcomes, which means the data has no labels.
Engineer the features. Now the data scientist, data engineer and ML engineer employ feature engineering, which means to extract or derive, then share “features”—the key attributes that really drive outcomes—from all that input data. Features become the filtered, clean inputs for an ML algorithm to study, so that it does not drown in data while creating the model. Feature engineering can dictate the success or failure of ML projects: without it, you have “garbage in, garbage out.” It entails some artwork because it requires domain knowledge and judgment as well as statistical techniques. For example:
The data scientist finds from conversations with realtors that home buyers always cite recent sale prices when determining their own offer price. The recent home prices therefore become a feature.
The data scientist and data engineer use a program to count the number of times key words appear in the service records of (1) repeat customers and (2) former customers. The most frequent words or phrases—perhaps certain product names, adjectives such as “thrilled” or “unacceptable”—become features.
Some enterprises now use feature stores to assist their feature engineering efforts. Feature stores are platforms for defining features, curating them and then serving them to various ML algorithms and models. They also can assist with data transformation as described in step 1 above.
Data scientists need to consult business domain experts to make the right judgment calls about features. They also need to work closely with data engineers to create, manage and reuse the right features for numerous models in their organization.
Now that data teams have ingested and transformed their historical input data, labeled the historical outcomes and engineered their features, they are ready to start building the ML model. This model will define the relationship between features and labels, as shown in Figure 2.
Figure 2: ML Model, Features and Labels
The ML model is an equation that defines how “features,” or key attributes of your input data, relate to outcomes or predictive variables known as “labels”
Data and feature engineering steers the success or failure of MLOps, which in turn steers the success of enterprise ML projects. Data teams that assemble all the right data, label their outcomes correctly, and devise the right features, will make sure those machines do good rather than harm. They might just make things easier for the companies that implement them.
Now that we understand the data and feature engineering phase, we will examine ML model development and operationalization in Part 2 of our blog series.
Join Kevin Petrie, a thought leader and industry analyst at Eckerson Group, at our webinar on May 20th, 2021 at 9:00am PDT where he will speak about The ML Lifecycle and ModelOps: Building and Operationalizing ML Models. This is a thought leadership and best practices session. Do not miss the opportunity to learn more about this critical emerging field!