Best Monitoring Practices for Data Science Models

So you’ve got your data science model up and running. You’ve done everything right to get here and you’re super excited about your model. Your model was trained on the best and most representative data, along with the most robust set of features and the highest accuracy that one has ever seen on test datasets. But even with the best training and deployment practices, data science models can easily drift away from optimal performance and start causing losses for the company. 

Most of these problems arise from unseen circumstances or situations that one could never think through. How would your model react when someone enters their gender in a form as ‘undisclosed’? How does your model adapt to unseen data or highly skewed set of data? Even the best models need constant supervision and occasional tweaking to make sure that the model stays on its path to deliver acceptable results.

In this article, we’ll walk you through the best practices for the monitoring of your data science models. We’ll help you think through various ways to gather the necessary information and convert them into metrics that are important to track (and sometimes legally important) because there can be legal thresholds for certain performance metrics, especially when it comes to imbalance in demographic data. 

Understanding metrics to track:

  • You should first know the features or input data that is being processed by the data science model. This informs our ability to understand the most important features and metrics that must be monitored. This includes features like ‘gender’, ‘age’, ‘medical records’, ‘income’ etc.
    • Here, you’d want to monitor the overall distributions of the features in the incoming data. This allows you to get a deeper sense of the data that is being submitted and it’ll ring a very early alarm bell for you if you know that your model isn’t prepared to handle a particular subgroup of a population. 
  • From the information processed by the model, we need to define thresholds and the exact metric that we want to monitor. Here are some examples:
    • Percentage difference of acceptance of loans between men and women. 
    • Percentage difference of rejection of loans between black and white people.
  • Another important metric for a probabilistic model is the real-time performance metrics such as probability of confidence in a prediction. We’d want to make sure that model is making confident predictions. If the model is making predictions with 50-60% probability, there is a good chance that the model is struggling to work with the incoming data and make the right predictions. Such situations require tweaking of the model and retraining on bigger and more representative datasets. 
  • Beyond performance metrics, you’d also want to monitor descriptive information about the model and its components to make sure that there is nothing happening that is out of the ordinary. This includes dataset size, the latency of the model, the processing power, the number of retries and the iterative improvement of the model.
  • Some metrics are very weak on their own or less important and cannot be used as a way of predicting the quality of the model as they may distract the data scientist from the real problems at hand. Instead, one much try to consolidate weaker or less important metrics into a score that can provide insight into the overall quality of the model’s component. For example, one can calculate a drift score based on individual drift metrics and then also track separate drift for features that are extremely important. This prevents us from tracking 20 drift metrics at a time and allows us to focus on the drift metrics that are actually important.

Analysing the metrics:

  • Firstly, it is important to define a ‘gradient’ of the problem to make sure you know when a metric is in a stage of ‘not to worry about’, ‘maybe a problem’, ‘a problem that could eventually fix itself’, ‘a problem that need to be solved’, or ‘stop the model right now’. These gradients are especially important to plan your monitoring process better rather than defining one single threshold that defines the state of urgency. 
  • You should also track the real-time metrics against averages of the model and industry averages. Metrics by themselves wouldn’t mean anything unless they are benchmarked against standards. One must answer questions like ‘How much confidence in the prediction is enough for a loan application to be accepted?’ before they define thresholds and metric trackers.
  • It is important to note that not every metric must be on point. This is because ‘correlation’ isn’t causation. In simple terms, if you end up rejecting a lot of people from a certain race, that doesn’t necessarily mean that the model is racist. The model needs to be benchmarked against other factors as well to make sure that there is ‘only’ a correlation and not a causation. Thus, these ‘drifts’ of the model from optimal performance must be very carefully monitored and analysed against the rest of the data.
  • Every metric must also be tracked against time to make sure that the observations reflect a real problem and not a one-time outlier that possibly caused a temporary deflection in the model. It is also worth noting that temporary deflections may or may not be affordable by the model. A small error in a loan application predictor may not be the biggest deal breaker but it might mean life or death for a model that predicts whether a human has cancer or not (although in adverse cases like these, multiple layers of checks are performed to make sure it is not the case). 

Maintaining integrity of the metrics

  • The metrics themselves must be correct and should not be meddled with. The infrastructure and the model must be set up such that the metrics are not artificially skewed from external factors. This can be things like APIs interacting with the model, or latency that might affect the integrity of the time series data and make smaller errors look like problems that last longer than they actually did. 
  • Model performance can be affected by factors such as security and the quality of infrastructure. Such factors may not affect the model directly but their effects can slowly creep-in. For example, a host instance with high latency can lead to very slow transfers or even data loss in extreme cases which can complicate the metrics of the model. An instance with too many open ports for exchange can be a security threat and can possibly meddle with the integrity of the metrics. 

From deployment to governance to security, DataTron has it all. Super Excited about Datatron? Or not convinced yet? Check out our website and schedule a walkthrough with us right away! 

This Whitepaper was written by Sohit Miglani. Connect with me here: 

  1. Follow me on Twitter here.
  2. Connect with me on LinkedIn here. (Also send a note when you connect so that I know you read this article)
  3. Follow me on Medium here.

Let's Discuss