Automating Machine Learning’s Last Mile™ (MLLM™)
Apr 26 2019
Managing the ML model development lifecyle at scale requires the orchestration of many activities (e.g. data preparation, feature engineering, model training, validating, and testing, actually deploying, tuning, and iterating). We will call the latter three the “Last Mile™.”
At a small scale, building and managing individual pipelines for each model is bearable with a few on-premises models. The ad-hoc projects between teams limit the typical organizational deployment and management processes. If lucky, some organizations can deploy or iterate one model per month and many, only a few models per quarter. Regardless, we are likely to find a gridlocked Last Mile™.
Even if a model successfully reaches production, we also find it takes longer to identify errors and to reach quality model outputs. ROI becomes even more elusive when continuously monitoring and maintaining multiple models and their versions.However, things get very complicated very quickly.
Our customers tell us their Last Mile™ is full of hastily stitched bespoke and duplicative processes that are slow, expensive, full incidents and post-mortems. Furthermore, if teams using multiple frameworks (e.g. Tensorflow, Python, R, SAS) require a new pipeline to test and deploy each model within each framework. Using models written with two or more frameworks becomes much more complicated.
This is especially true when the processes are on premises as it requires the creation and maintenance of specialized tools. Typically, the value of Machine Learning is actually delivered post-deployment and reaching it can be fraught with problems. Therefore, organizations are slowly finding out that rapidly increase the throughput of model versions in production are table stakes to a successful Machine Learning process.
We find the number of production models increases linearly, whereas the complexity of the bespoke (on premises) production environment increases exponentially. To complicate things further, nobody owns the entire process, the budget or has a truly holistic view.
The Last Mile™ interactions between teams can become frustrating and can bring the entire business to a halt. Enterprises become risk averse when making changes to the models and spiral into a vicious cycle of longer and longer system latency.
With this level of engineering overhead, it’s cost prohibitive to consider assigning a team for six months to re-design, and automate a better Last Mile™. Invariably, the forest gets lost for the trees and production throughput suffers. In this instance, the application complexity, the size of the code base, along with attempts to cost or time constrain the projects will likely prevent the enterprise from ever launching a successful new Last Mile™ project.
The few who have identified and quantified the benefit of standardizing and normalizing these processes in production have begun working point solutions vendors to unclog and speed up their Last Mile™ without upsetting other processes. Point vendors are also easier to replace should something go wrong. Additionally, these vendors provide a centralized space to deploy, collaborate and test as needed on a timely basis (minutes or hours, not days).
Other benefits of point solutions is that they address the fact no one team has a large enough budget to absorb the expense to re-design (which could become a Cap-X item). Furthermore, if something goes wrong, they are easier to rip out and replace rather than expanding the relationships with vendors who have proprietary frameworks (e.g. TIBCO, SAS).
For example, a Customer A’s employee states:
“Our internal data science team grew to 30 people in the last few years. So as we scale, we began to realize that automation {of our Last Mile™} is required in our production machine learning infrastructure to support the business needs.”
In Client A’s’ case, four full time engineers were hired to manage their bespoke Last Mile™. As the data science teams double, we have seen the need for engineers triple. Even with these hires, the throughput capacity was insufficient and the headaches were everywhere. By acquiring the point solution platform, Customer A not only saved four FTE’s, but reached the ability of deploying hundreds of models in a cookie cutter process and reaching positive ROI models much sooner.