A Good Data Scientist is a Good Engineer First

So you’ve gotten a good degree in data science and you know all about statistics and mathematics. But finding a job is still very difficult as a data science new grad. Most data science positions require multiple years of work experience or an advanced degree in this field. So why are data science jobs so hard to secure? And what makes them so valuable that they require multiple years of experience and/or an advanced degree?

The answer to this question is simple: It is very easy to do data science the wrong way and pursue a direction that isn’t correct. While data science seems very exciting and powerful, doing it right is a priority for many companies or else, you’ll end up doing it wrong and potentially building models that make ethically and morally wrong decisions. More importantly, the ability to build a good data science model not only depends on good knowledge of statistics but also a good knowledge of software engineering principles, algorithm design, data structures, cloud engineering and the best deployment practices. 

Why does a data scientist need to have so many skills? This is because the success of a data science model depends on its speed, its ability to pre-process and analyse data, its ability to use minimum space and time complexity, and its ability to plug-in with all kinds of cloud infrastructure which makes it easy to deploy, implement and monitor these models. Let’s take a look at these requirements in detail:

  • Use of best data structures and algorithms: If your algorithm maintains a sorted list and you’re somehow using just lists, the sorting complexity of lists will slow down your algorithm tremendously. The knowledge of heaps (a kind of data structure) and its implementation can tremendously improve the performance of such a machine learning algorithm. Implementation of the best data structures available along with the most efficient processing algorithms can greatly improve the performance and reliability of machine learning algorithms. 

The use of best data structures also allows us to minimise space and time complexity. The use of more space and compute power comes at a cost and most cloud servers would end up being too expensive for an algorithm using lists to sort elements.

  • Build algorithms with the best libraries and systems that are supported universally: You may have built the most successful neural network to data, but you actually ended up building in Perl rather than Python or Go. With cloud softwares improving everyday, a lot of services have stopped supporting Perl or some Perl libraries. This leads to an engineering crisis where your model is just not supported for deployment and it will have to be rewritten in a different language where the packages and the implementations might be different. 
  • Use of best coding practices: You also need to be aware of the best practices as a coder to build the most enhanced and comprehensive software that is easy to maintain and understand. A Python package built with discrete functions and without the integration of object-oriented programming is a very difficult package to maintain since it’s not a consolidated piece of software. By building your package with the most updated object-oriented programming principles and using decorators to process your functions appropriately, your package would be a lot more space efficient, code efficient, easy to maintain and fix if errors arise. 
  • Integration of the most reliable engineering infrastructure to support monitoring: Your model must respond to stable and scalable REST APIs that can easily and securely pull data they need for the monitoring systems. Furthermore, all this data must be handled efficiently and transferred in a secure manner. If your model’s information is not stored in the same class, or the choice of your language of coding does not support the use of REST APIs, it might cause a problem for the data to be pulled from your model. 
  • The model must easily support an incoming stream of live data and render results to APIs: Beyond the idea of training your model in a jupyter notebook, it must be able to handle an incoming stream of data, make predictions, check for validity, readjust itself for any bias, save the new versions, and send the output data to the APIs collecting the results. 

In data science, The model prediction is a very small part of the entire workflow. Hence, one must build a software that can support and implement the entire life cycle of a dynamic machine learning model, deployed on a cloud suite and serving multiple APIs to collect, process and return data.

From deployment to governance to security, DataTron has it all. Super Excited about Datatron? Or not convinced yet? Check out our website and schedule a walkthrough with us right away! 

This blogpost was written by Sohit Miglani. Connect with me here: 

  1. Follow me on Twitter here.
  2. Connect with me on LinkedIn here. (Also send a note when you connect so that I know you read this article)
  3. Follow me on Medium here.

Let's Discuss