Moving from “data-centric” to “model-centric” —
Every two days now we create as much information as we did from the dawn of civilization up until 2003, according to Schmidt. That’s something like five exabytes of data.
The amount of data collected today is tremendously huge and continues to grow as we try to understand more things about our users and for the business. Today, it is entirely possible to know complete behavior for a user — the website he logs in, the Starbucks shop he goes to, his route to the office and back home, the restaurants he visits on weekends, etc. Given this amount of information that can be captured, today’s Enterprises try to capitalize on the market by looking at what will work for users in a particular region based on the time of the year with the data available at hand.
Today’s Data-Centric world
Being inundated with so much of user data led to the invention of big- data a decade ago. This fuelled the development of systems like Hadoop, HDFS, etc. completely based around calculation on large amounts of data, on a scale of TB and above. But Hadoop calculation was still slow. Organizations felt the need to respond quickly and near real-time to users about their needs. Spark, Flink was born out of that need which enabled extremely fast, parallel, in-memory computations. Being able to address the needs of users near real-time is still a current open problem that is actively being tried to be solved in the industry using various methods.
Let’s take an example to demonstrate how projects are carried out in the industry today. Let’s say Chris is the manager of the Data Engineering team in Phil’s Coffee. He recently determined that Phil’s Coffee shop on Market St. and 8th St. in San Francisco has a high concentration of customers between 6:00 AM and 9:00 AM in the morning. The same is the case with another Phil’s Coffee shop at Mission and 2nd St. There is no other Coffee shop in between these two.
Looking at this pattern, he submitted an initial proposal to his management to open up a new coffee shop somewhere in the middle. He suggested with data that this will evenly distribute the load on the 2 existing coffee shops. The coffee shops will be able to serve more customers, shorter wait times will attract more customers and in turn, increase the revenue of the company.
The management was quite satisfied with the idea and asked him simple questions:
- Where to open the new coffee shop exactly?
- Will this be an ROI opportunity?
- If ROI opportunity, how much will be the return?
- How much time it will take to start getting a return?
Chris was not afraid of the above questions at all, because he knew he would be able to answer those questions with data and give approximate predictions to his management. He came back to the desk and simply translated the questions to the concentration of customers going to office in a region, cost of infrastructure at various places over years, number of customers to likely go to the new cafe in the future and number of customers added to the old existing cafes in the future.
This data allowed Chris to make informed decisions on what path to take and how aggressively to proceed with the plans. It also helped management to take decisions based on the criteria and have goals in mind. The management also based future goals and decisions based on the outcome of the implementation of this idea.
Data-Centric is the Old World
Today, big enterprises hire tons of data engineers and managers like Chris, to study data aggregated from various sources and look at flaws in systems and where improvements can be made. Large SQL scripts are in place which runs overnight to produce a dump of big data in the morning. A data engineer goes through this dump of data and tells his manager potential improvements that can be made.
The big drawback of basing the study on data is that the engineer is able to look at what happened in the past day, week or month only. His study is completely manual and he has to use his judgment to calculate areas where he can make maximum revenue in the coming days ahead, or look at areas where he made a maximum loss in the past and improve on it.
Even though the data engineer may have scripts in place to tackle problems like this on a daily basis, manual intervention is still needed to understand what data is dictating at any point in time. No automated systems exist that is intelligent enough to tell that this is good or bad.
Data-Centric, Not Enough Today
The biggest advantage of being data-centric is that data keeps you informed about your decisions in the company. The big enterprises are able to use data to form an idea and implement it. They are also able to base results on data again from the implemented idea to check the output. They can compare prediction with actual results and tell how close they are to the expected output.
Same as Chris and his management team, being able to base your decisions on data is good and keeps you informed about the performance of your product. But there is much more possible to do in the current world based on this data. The data is so huge that it is possible to tell what is going to happen next based on the history of what has happened in the past. We can make machines do these calculations for us which is simply referred to as “machine learning models”.
Let’s say that if Chris is able to input this data to a Machine Learning model to predict a number of customers coming in at a particular point of time in each store. Using this information, the staff quantity required to serve a shop can be determined a-priori. This will not create hassle, not drop customers, instead of attract even more customers to the store. This will also help Phil’s Coffee Shop determine staff capacity at any point of time quantitatively and optimize on it.
Model-Centric is the New World
Being able to predict and operate on it is the new world we are going into in the near future. The data is enormous to be able to make a machine learn about the behavior of a user and predict simple things for him. Machine learning can also operate on the concentration of users in a particular region for a particular product and make informed insights about a number of users coming into a store between 8:00 and 9:00 AM or the price of stadium tickets near the start of the match etc.
Imagine the big enterprises are in a place where their data engineers do a very smart job in terms of predictions. Their daily routine looks like: they look at data and are able to take a look at informed and reliable predictions from machine learning models already in place. This would help them to keep the live streaming infrastructure in place and use appropriate models to always keep revenue growth.
Being able to produce more than one model in a month in a large organization is still seen as a challenging task. Being able to operate on multiple models at the same time without worrying about production space is still a far off vision. What if it becomes super easy to produce any model that we want and make it production-ready within a couple of weeks? What if a data engineer looks at multiple models at the same time and fiddle with anything without fear of hurting production very easily?
Being able to operate in model space the same way as we operate in data space will be a huge boost to the industry. It will transform the way we look at how to improve our products, what project is going to fetch maximum revenue, prevent misadventures from occurring. It will transform the thinking from what is currently happening to a new approach where this is going to happen and let’s define a project to address this cause or implement a new feature with maximum benefit.
If you are interested in taking your enterprise to Model-centric, talk to us at firstname.lastname@example.org.
Thanks for the read!
For more articles, go here!