How to Make Life Easy for a Newly Hired Data Scientist
Dec 29 2017
When a new data scientist is hired into your team, it takes some time for the person to be
productive. Data science on boarding process takes longer than other on boarding processes inside
Enterprise. In this post, I am going to describe the life of a newly hired data scientist. The use
case is that the data scientist is given a project where he needs to build an online learning model.
He needs to understand the problem, write experiments for it and deploy a model to production. Also,
to make the model useful, he needs to boot up a service which gives live predictions using this
model. To make things simple, I am going to narrate the story of Alex, a recently hired data
The new hire’s first week
Being new to the team, new hires are usually assigned simple tasks, so that they can focus on
understanding the interface and infrastructure as their first step. Data science being a relatively
time consuming and involved field w.r.t. understanding the problems to solve and domain knowledge,
walking through a simple problem gives enough time to understand the framework of the team and how
to progress ahead. As a next step, data scientists can focus on domain related problems.
Given that data science being a fairly involved field, quite a bit of time is required to understand
the problem at hand and try out different experiments. And for a problem at enterprise scale, eg.,
recommendation system for users coming to the website OR price prediction for an item for this user,
it takes time in getting to a model with exact prediction for what items to recommend or prices to
predict, since the data scientist needs to run through several experiments. It is usually the case
that more than couple of experiments are needed to come up to a model which gives near to expected
The data scientist needs to write several scripts and test out if the model is working or not, make
changes accordingly looking at results of the previous experiment and this keeps on going into a
loop. This continues until a proper optimazation is not obtained. Most importantly, data scientists
are time constrained and they can only do so much in the time allotted. It becomes important to
deliver results quickly and accurately. This restrains data scientists productivity where they could
have tried more alternatives if they had more time at their expense.
After carrying out the experiments, data scientists typically want to deploy the model to production
along with converting it into a service, which is a pretty exciting task. Data scientists typically
need help from data engineers for data, DevOps for deployment of models and software engineering
team for any task related to booting up a service.
Frustrations of a new hire
Alex comes into work and starts operating on his first project which his manager gave him.
Typically, he wants to grab data from some database or data lake and write some code to operate on
it and understand the context of the problem. The next step after understanding the problem is to
perform experiments on it. This involves some code writing tasks. Also, there is not one single
experiment that you need to do in Prod, you may want to try out several experiments at the same
time, split traffic, perform A/B testing etc. Also, after the model is deployed to production, you
want to hit the model to get current live predictions for the user. But, today a typical data
scientist like Alex is facing the following problems:
Setup discovery: Typically every new hire data scientist needs to learn infrastructure
on which he is going to perform experiments and deploy code for the model. This is common
regardless of which company he is going to.
Access to data: Data access for every new hire adds to the cost of the company. If
every new hire is asking about credentials stuff and how to access data and where it lies and
meaning of data!! This already adds a lot of time to understanding stuff which is already
existing in place. There is a lack of framework which surfaces this to the top and makes it more
Repeated code: Since there is no central code repository system as in for software
developement engineering for data scientists, each new data scientist tends to write his own
script. It is also pretty difficult to cross check with team, if this piece of code already
exists or not. A central software module to achieving a bigger project is lacking. Every data
scientist has his own tools to achieve his own objective which makes the pipeline very spaggheti
like and hard to understand and scale for problems. This also adds a lot of cost to maintain
scripts at different places.
Hard to deploy models: Data scientists can perform experiments and can write good
machine learning models. They spend enough time on it to do training, testing, cross-validation
etc. Once they see that results are coming close to expectancy, they typically want to give it
out into production. They want to try it out on live production traffic as well, whether their
model is performing for real or not. The challenging part is, data scientists are not
infrastructure guys. They need to take help from DevOps for this. There is a continuous
communication back and forth between the two teams for deployment of models. This is pretty hard
Hard to deploy multiple models: Deploying one model to production is fine, DevOps can
help out there. But if request is to ask to deploy multiple models to production, then it is
near to impossible task currently. It is not near achievable to think of a use case where
multiple models can be flipped back and forth with ease within minutes manually for live traffic
Publishing endpoint: Typically for online models, a data scientist wants to expose
model as a service, so that live traffic can be routed and model results can be given as final
prices for example to the end customer. Writing prediction logic rests completely with data
scientist today, but to make it come true in production, he needs to seek help from his
software dev teams.
There is currently a lack of a proper framework which kind of acknowledges above problems and tries
to give out a PRODUCTION ready solution where data scientist along with doing their own task
are able to have complete control over model deployment, multiple models at the same time along with
prediction logic service support.
Let’s try to look at an example where if such a platform existed, how it would make a difference in
data scientist life. We will again narrate the story from the perspective of Alex coming in as a new
hire and working on an online learning model as his first project.
How to make life easy for a newly hired data scientist
Imagine a centralized user experience where data scientists have complete control over their models
and are able to control how they operate on it. They don’t have to worry about dependency on any
sister teams for deploying their models or for accessing data etc.
Let’s just say for example that when Alex joined the team, he is introduced to a user experience
portal where he is supposed to operate independently on his first project. The time taken is to
understand the existing project, what kind of data is included in the project, what kind of code is
part of the project etc. This gives a huge kick start to understand what the current state of the
project is. Also, if the data scientist needs to perform experiments, he would already have a place
to start from and can progress ahead. This is because all scripts by Alex’s team mates are pushed to
a central repo and are reflected on a centralized dashboard or workspace.
The data scientist also does not need to worry about where the data is, because the platform
provides complete abstraction of the data. This eliminates his blockers for figuring out where is
the data and how to access it. Deploying stuff is made so easy that the data scientist now just
writes prediction logic and throws it into the project and that’s all. The framework then makes it
available as a service. Let’s look at how this would make a difference.
Centralized user experience for data scientists
Making code, data, credentials, models, prediction logic etc. centrally available, which serves all
data science development hassles is pretty tough and not so trivial. This is because data scientists
life differs from company to company and there are different frameworks and different niche details
which needs to be looked at. There is no generalization and standard process of performing a task in
data science lifecycle of a project.
Think of a system which is able to overcome the above mentioned obstacles and offer much more than
that via a very good user experience. Think of a web interface where you can do the following:
Setup discovery: This still stays where it is and is inevitable. But the big difference
is that rather than understanding the spaghetti system that we had, we are now trying to gain
the one shot big picture of the complete framework. It is more likely to miss out a simple point
in spaghetti systems rather than where everything is centralized in one single place.
Centralized code repository per project: There are no more code repetitions, because
Alex is able to figure out code from the project itself. He can tell easily what code is live in
pipeline, which code lies where and everything is neat and properly organized for a model. This
adds to savings in time where the data scientist needs to write scripts and test it out. Almost
half the work is already there.
Centralized data access across projects: Admin is the only person responsible to
configure credentials for a data source and this would become a one time process. This saves
time for every new hire joining the team for hunting down how to get access to data and makes
this data avaiable generically as a dataset. Data scientists also does not need to worry around
writing logic for that technology, rather this becomes an abstracted layer in code. Think about
the framework also providing details about the data. This adds to time savings for data
scientists where they want to get details about data.
Model training and deployment: If the framework is also able to provide deployment
capabilities to data scientists, that would be pretty amazing and outstanding. Data scientists
can do whatever they want at their desk and don’t need to communicate with DevOps for their
projects. They are completely independent for doing their job.
Prediction logic and publishing results: The data scientist role in the complete
project is to write prediction logic and hand over to software dev team. Data scientist does not
get enough admin capabilities to own the complete service by himself. Think of a user interface
platform which gives you complete control over starting, stopping the service to data scientist
himself. Then there is a tremendous cost cutting in terms of communication time between data
scientist and software dev team.
Monitoring dashboards: Think of a platform which also provides complete control over
incoming traffic, service latency, autoscaling, number of incoming requests per sec etc. If this
comes bundled with the software out of the box, then this is all that is needed to see if
service is healthy, predicting correctly etc.
This essentially means that the data scientist has the project in his complete control with all data
and code residing in the project he is working on. He does not need to worry about hassles and his
only focus is how to move the project ahead. Since it is usually the case that he wants to perform
new live experiments, he can also do that without any problems. Now he is also able to deploy the
model to production and also take care of the service booted up. Paired programming and sharing
domain knowledge via a common framework would make a group of data scientists even more productive.
Now, Alex enjoys solving problems at a quick pace with his mentor Bob and is able to focus on the
problems at hand. His hurdles of getting work done from several other teams which was adding time to
the complete process is completely eliminated. He and his other data science team mates only focus
on problems to solve and nothing else, since they own the complete pipeline end to end now.
To learn more about the product, you may contact