Introduction
“It is just as {Joseph} said to Pharaoh: …seven years of great abundance are coming throughout the land of Egypt, but seven years of famine will follow them…let Pharaoh appoint commissioners over the land to take a fifth of the harvest of Egypt during the seven years of abundance…this food should be held in reserve for the country, to be used during the seven years of famine that will come upon Egypt, so that the country may not be ruined by the famine.” Genesis 41:28-36
Joseph, recovering from his loss of his beautiful coat and being banished to the wilderness by his jealous brothers, had fallen into favor of the leader of the most powerful country on earth. He was tasked with ensuring that Egypt would survive the prophesied famine. Consider the variables involved:
- What is the annual food consumption?
- What is the projected growth rate of that consumption?
- What contingency should we add?
- Will some grain be lost over time?
- Will the famine go beyond 7 years?
Graphical Representation of Joseph’s Grain Storage Model
Considering this was four thousand years before computers were invented, and likely before the concept of zero existed, this was no small task. Data Science as we define it today has existed for as long as we had data to supply it – whether it be clay tablets in 3000 BC or object storage today.
What Is Data Science?
“There are three kinds of lies: lies, damned lies, and statistics.” -Mark Twain (sometimes attributed to Benjamin Disraeli)
I have asked over one hundred different people what the definition of data science is in the settings of conferences, networking events, and job interviews and have received over one hundred different answers. Most answers are equally correct given the field has no agreed upon definition.
For what it’s worth, my hundred and first definition of what is data science is as follows:
Data Science is the combination of statistics, computer science, and empirical modeling to convert data into actionable insights.
It reads well but what does that mean? Let’s break it down bit by bit:
Statistics: An ancient field of mathematics that involves interpreting data. Many of the heroes of this field had traditional day jobs, such as ministers (Bayes) and brewery workers (Gosset aka ‘Student’).
Computer Science: A much more modern field compared to statistics, rising in conjunction with the invention and growth of computers in the mid 20th century. Its more notable heroes are considered worthy of being played by actors like Benedict Cumberbatch.
Empirical Modeling: Empirical Modeling, when combined with statistics and computer science, is what makes Data Science Data Science. Originally these techniques were standard regressions combined with models that computers made feasible (think Random Forest), however, these have recently spread into exciting and opaque techniques such as neural networks and what is generally referred to as AI (most of these are not, in fact, anything intelligent but that is not the focus of this section).
Data: A very broad and general term for an alternative representation of a state, event, or otherwise classification. This alternative representation can take the form of writing, tables and rows, or representations such as images or videos.
Inferences and Information: This is the goal of all data science – to create something useful with the data and computation. Inferences are predictions created by a model, some examples being:
- Is this email spam or not spam?
- Is the animal in the picture a dog or a cat?
- Is this person credit worthy or not?
- Is it a hot dog or not a hot dog?
Information is data that has been curated, organized, or enriched into data that is actionable on its own. Examples of this would be:
- What is the average age of our taxpayer base?
- Is the trend of median home price rising or falling over time?
- What is the most common defect type on our primary manufacturing line?
So what? How can Data Science impact an organization?
“You’re sitting on a winning lottery ticket and {you won’t} cash it in” – Chuckie, Good Will Hunting
Analogous to humans only using 10% of their brain, most enterprises likely get single digit percentages of potential value from their data. Data science can be applied in the following ways:
| Application | Example |
| Automated Actions and Decisions | Your bank declining your transaction due to their fraud detection algorithm |
| Improved Decision Making and Support | Your mapping application giving you three routes to choose from |
| Strategic Understanding from Storytelling | Do your casino customers care about ‘resort fees’ on their hotel bills? Do cars built on Tuesdays have higher defect rates? |
| Experimentation | Do customers prefer email or mailed coupons? Should we use light mode or dark mode for our website? |
There’s been decades of papers and studies on the competitive advantage for data science for enterprises-so many that I’d suggest you look things up yourself. HBR has several excellent articles for general use cases. In oil and gas we found several industry specific justifications that make investment a high NPV decision.
What is needed to succeed?
Nothing can be made except by makers. -Henry Ford
Most managers and executives will agree that converting company data and information to value for a company is an important endeavor. The attempted execution has resulted in a sliding scale of outcomes, too many of which are departments of highly paid individuals wandering around a company looking for problems for their solutions.
In order to succeed data science needs to have the following prerequisites, in order of importance:
- Data
- Organizational Buy In
- Organizational Strategy
- Proper Staffing
- Proper Tooling
To show this graphically:
To break down each aspect:
Data: This is stating the obvious, but having data that can be converted into actions and insights is the most important prerequisite for a successful data science implementation in an enterprise. Generally this data will come from a variety of different sources, such as ERP systems, operational applications, or field sensors. I internally debated about placing organizational buy in as the first requirement, but the antibodies can be addressed easier than data can be conjured.
Organizational Alignment: “What do you think you know about my job that I don’t already know?” is something I’ve heard several ways on different field visits. This is not an invalid challenge, and I’m grateful for people being comfortable enough to push back. It has been my experience that technicians and managers often optimize to local minimums and maximums, and don’t see that there’s step changes beyond their heuristical set points. Just like the John Henry quote above, many steel driving men (and women) would rather die of exhaustion than let automation make their lives easier.
Organizational Strategy: Critically important. How is data science to be implemented in our organization? This generally means that data science outputs need to be integrated into business processes. With most change initiatives, top down support is the most important driver for success. There almost certainly will not be a bottoms up groundswell demand for data science in any enterprise.
Staffing and Tooling: Staffing and tooling are the details that will be handled by a strong data science leader. Staffing is the more important of the two-a small group of talented people that work well together will run circles around larger groups without leadership and skill. The importance of tooling is generally over exaggerated, and should be addressed after the other prerequisites are in place. ‘R vs Python’, ‘Databricks vs Sagemaker’, ‘Tensorflow vs Pytorch’ are debates similar to ‘Audi A8 vs Mercedes S Class’ – both will work fine. Under staffing and tooling I’d also include project management techniques. In my experience Kanban is the best format for data science teams-it encourages continuous delivery and minimizes the distractions of meetings and ceremonies, which can be onerous.
Conclusion
Data science is an area of growing importance for enterprises. Companies that embrace data science and integrate it in their workflows will out perform laggards. The success of data science in an enterprise is also a conscious effort that will almost certainly require top down support.
The author can be contacted by email at Matt@LamplightLab.com
Leave a comment