A number of new firms have taken on this ‘elephant in the room’ (no, not Hadoop) problem that the US is going to be up to 1 million data scientists short of our needed supply over the next decade. A number of firms have arisen to tackle this, including those in a recent article called ‘Want to ditch your data scientist?’ (of course this title assumes you have or could afford a data scientist to begin with…).
A friend and I were calling this concept ‘data scientist in a box’ a few years back. I suspect we weren’t the first. Anyone in the industry of providing large and medium sized businesses with Big Data solutions could already see the missing link was the decreasing availability of those with math skills. We saw that the inventory of existing statisticians working in SAS statistical software might be square pegs when it came to unleashing open-source tools to data with a scale large enough to break old school client server-type stats apps.
Some feel negative about data scientist in a box, however this ignores the reality that there aren’t enough data scientists to go around. I would call this a Marie Antoinette syndrome. If data scientists were food, people who take the attitude that you shouldn’t try a data scientist-in-a-box approach are saying ‘Let them eat cake’. They are saying ‘Let them eat cake’ to those without the funds or glitz of company or size of company to attract a data scientist. If you can afford a data scientist, great. If you can’t, you will need and want these data scientist in a box solutions. These firms give those business people the choice between doing some Big Data Science as opposed to doing nothing, which is what happens today.
One organization, DataKind recognizes that because ‘data science skills are so in demand’, they will provide free data science volunteers to work on projects pro bono if the projects are focused on social good. This is very much like attorneys working pro bono cases for clients who cannot afford fee-for-service. In pro bono legal work, there is no hope the clients will ever get a law degree or ‘teach themselves to fish’; there is no chance they will ever be able to afford the level of service that others do—it is clear that unless given the option to receive highly subsidized or free legal services that they will receive no advocacy. We need to realize that organizations who do not have the means to afford data scientists will end up with no data science. There are no other avenues (unless their data is aligned to the social good and they can qualify for the help that DataKind offers). DataKind cannot scale to even a tiny fraction of the socially good Big Data projects, much less commercial ones.
Data science is critical. It is firmly on the path and progression that our community is moving as we add insights to all the processes surrounding our lives. It seems likely that when there is a real need for Big Data science and not enough individual practitioners, someone will figure out how to packaged up the science to as large a degree possible. When faced with packaging it or just ‘going without’, clearly something—once mature—is better than nothing. Most in the US would agree we have, in healthcare, a shortage of nurses. This is no longer controversial. No one calls it ‘a shortage of good nurses’. The shortage is big enough we have a shortage of any nurses. We already have a shortage of data scientists before Big Data, before R, before predictive analytics came along.
The data scientist-in-a-box firms are already tackling—in a software user interface substrate—the problem of how to explain science approaches to driving insights out of data. Firms like ClearStory, BigML, DataHero are new and doing this today. The ’99 percenter’ user community does not understand the science. But they believe in the science. They have seen it work with Google, LinkedIn, Facebook, and Netflix and use it daily. With a hungry and bought-in user community, it is a no-brainer for these firms to tackle this problem. The firms who tackled this in the past were the business intelligence firms. However, they had a natural stopping point at the limits of SQL. The ‘declarative language’ of SQL. In the world of a NoSQL or Hadoop approach to Big Data, we have broken away from those limits and are in the world of coding universally taught coding languages. Which means it is actually easier to embed data science in Big Data applications, and there is no natural ‘fence’ corralling the type of analysis to ‘easy science’ that business people are more likely to understand.
So the floodgates of potential math to drive insights are open, at a time when scale is increasing fast, or has already increased. And at a time when math skills are dropping in the user community. Will we as a business community continue to keep applications ‘dumb’ to the lowest common denominator of the end user? It won’t happen. That is not our tradition. That is not our history. That is not our way. Our way is to hide the complexity and give the end user the answer. (One might argue a better way is to ‘give a man a fish’ and teach business users the math skills to understand deviations and outliers and regressions and even more complexity, but in our society that’s been spoiled by the speed of getting answers via Google and goes much faster, let’s be real, our business analysts and managers are not going to sit through the process to learn it.)
The data scientist-in-a-box firms are paving the way for making data science accessible to the masses. So if you can afford a personal chef, good for you. If you can afford a data scientist, good for you. For everyone else, get ready to someday have a Google interface, do a search, get a result based on science you don’t understand. I bet the users will trust the answer as directionally correct and better than no answer at all.