Data Science in the new era
Dec 08 2017
Traditionally, Data Science is regarded as a domain exclusive to some people who
understand mathematics and statistics. But this will change in the new era.
For the purposes of this post, I will use a large display publisher like Oath
(Yahoo! + Aol). Typically, the publisher needs to decide what ad to display in a particular
placement in order to maximize conversion, revenue, or margin. This is done on-the-fly and involves
signals from the publisher, the network, the advertiser, and the ad.
The Data Science process primarily consists of three stages:
In the discovery phase, a data scientist seeks to understand if the piles of data
produced by the enterprise contain any signal or they are tasked with validating hypotheses from a
business user or data analyst.
This is the place where the data scientist needs to mine through tons of ore and
find the essential (gold) data from it. At this stage, the data scientist interacts with other data
science team members to share findings, algorithms, models etc. Though unlikely, perhaps, they also
share data sets that are already usable by a model and the wrangling process of “who has which data”
across the company and outside. That said, all of this data is typically historical.
The data scientist then builds the initial set of features for a model s/he think
will bring value to the company. The most important activities for a data scientist are data
discovery, data cleansing, data normalization and algorithm discovery. Typically, this is always
done with a static, historical data set. The data wrangling phase of data discovery, data cleansing
and data normalization can be tedious, cyclic, repetitive. Any automation and automation tools in
this phase makes the lives of data team immensely productive and joyful.
Notebooks built on Jupyter kernel are most commonly used tools during the discovery phase. At the
end of this process, a data scientist will have many notebooks where the code to build several
versions of the model is kept.
Typically, the models built at this stage are off-line, far from a production environment where
they could can cause material damage. They are built in a vacuum and have no monetary value.
Once a data scientist builds a model, the model needs to be deployed to production
to test it on live production traffic.This is where the model has $$$ impact to the
For instance, the model built for ad placement in the discovery will be deployed in
production, to understand whether it impacts the click through rate or not.
This stage involves heavy cross team activity. The teams involved are Data
Science, Data Engineering (or Engineering), DevOps.
Here are the roles played by them.
Data Science: Architect features, algorithm, train and build the machine
learning model on the data lake
Data Engineering: Deploy the model given by data science to production.
Make sure the model gets data from the sources, deploy model, serve the model etc.
Devops: Make sure the model scales, build roll back policy, add additional
data sources etc.
In this stage, the following are the biggest challenges:
- Lot of duplicated work
- Hiring data science talent is hard
- Stand alone pipelines are built by Engineering team(s)
- Need to iterate on the model multiple times to optimize it.
- No standard framework used to share data sets, features, models etc.
- Scaling and deployment of models
- It’s all a black box. No standard way to detect root cause of badly behaving models from bad
The last process is evaluating the model. This is where you take the decision whether to
model to the whole live production or tweak the model to make it better. Often, a model is
associated with a performance KPI metric that judges the quality of the model.
Typically, as in A/B testing, a model is put on a small percentage of traffic,
check how the performance KPI metric is and then decide to either increase the traffic or stop it.
Some examples of the KPI performance metrics are user conversion rate, click through rate, etc.
Evaluating the model is often considered as “post mortem” task and is done manually
by the data scientist.
In this stage, data scientists and data engineers collaborate closely. It does not
end here though. The cycle repeats, often with duplicated steps. Models can impact almost any aspect
of business, but only if they are in production. Compute advances got us to Big Data. But successful
models in production bring us to the new era. Much like how communication patterns in an
organization reflects the organization’s org chart, successful model deployments in production is a
leading indicator of organizational health and productivity. A successful deployment sets in motion
the virtuous cycle of model-deploy-lift in revenue.
If you want to supercharge your data teams and/or are interested in understanding more about Data Science process , please email