Author's Note: I relied heavily on the expertise of my original data architect, Pete Baker, for this chapter. Pete taught me so much on the job prior to his retirement and also graciously reviewed this content. Thank you Pete.
We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard.
John F Kennedy
A company’s data landscape is both the foundation and a hard requirement for any data science initiatives. Data strategy is incredibly hard to execute and no one gets it right-some companies just get closer than others. This chapter will address the following subjects:
- What is data and what forms does it come in
- What is data strategy and what does right look like
- How should data systems be designed to enable business outcomes, especially data science outcomes
Please keep in mind the focus of this book is data science, but data strategy has major overlaps between data science and analytics/reporting. I originally cut my teeth in both analytics and reporting and I’ll touch on the more traditional cousins of data science in this chapter when it makes sense to do so.
Overview On Data
It is a capital mistake to theorize before one has data.
Sherlock Holmes, A Study In Scarlet
Given that data is the most foundational need for proper data science (Chapter 1), a proper data strategy is at the basic foundation of any successful data science practice. For data science to succeed in an enterprise, a proper data strategy needs to meet the following mission statement:
A successful data strategy provides data scientists with the data to both train and deploy models so that data science products can provide useful and timely inferences.
This data will come in different colors and flavors, but in general will fit the following paradigms:
| Type of Data | Short Description | Examples |
| Structured | Data in rows and columns | Excel Spreadsheets or Relational Database Tables |
| Semi-Structured | Data that does not have rows and columns but instead has tags or keys, usually in a nested structure | JSON representations of user application interactions, XML of an email |
| Unstructured | Date, usually text data, that does not have a strong organization | PDF Documents |
| Time Series | Data describing points in time | IoT / SCADA data |
| Geospatial | Data describing points in space | Geotag data on images, Shape Files |
| Audio / Image / Video | Digital representations of sound and visuals | Ring Camera Videos |
If you’re reading this-you probably know what all of these are; feel free to skip ahead to the following section (Designing Your Data Strategy). Seriously-this will be remedial for you. You’re still here? The following explains each type of data, we’ll refer to these types in this chapter and beyond.
Tabular Data
Tabular data is the simplest data structure and the most intuitive to end users. Humans have been using rows and columns since the time of clay tablets. It’s my belief that people fit things into tabular data at all costs due to the prevalence of Excel in the business world-people fall back on what they know. Tabular databases have other advantages beyond being easy to understand. Tabular data is very dependable. Database rules can be applied to allow for all data to be ACID (atomicity, consistency, isolation, durability). This is attractive for the reliability of databases and the results that come from them.
An example of tabular data is below:
| First Name | Last Name | Address | Phone Number |
| John | Smith | 123 Main St | 555-555-1234 |
| James | Brown | 999 Polk Ave | 555-555-9876 |
| Susan | Jones | 12 Birch Pl | 555-555-5678 |
A advantages and disadvantages of tabular data are listed below:
| Pros | Cons |
| Intuitive for End Users | Difficult to update table structure after provisioning it |
| Extendable (just add another row) | Not all data populations naturally conform to a tabular model |
| Generally ACID |
Semi Structured
Semi structured data arose out of the need to store data whose structure does not fit into traditional columns and rows and may change in the future. Non tabular data’s utility increased greatly with the rise of the internet. Any time you interact with a platform like Instagram, Tiktok, or Twitter almost everything on the back end is semi structured. What makes semi structured data structures attractive is they can be designed to meet requirements that extend beyond standard columns and rows. Semi structured databases can have their schemas updated easily-if you need to add another attribute you can do so without having to change any of the previous records. Think of how limiting it would be to Instagram if they had to execute large database changes with every feature change.
Data in semi structured formats are generally not ACID but alternatively BASE (basically available). This paradigm chooses to reduce the compute needs required for knowing the exact, atomic state at any time. NoSQL is often applied in distributed compute environments-today’s internet platforms live in multiple regions and are usually optimized for their geographical areas. Generally events happen in each region and there are processes to ‘catch things up’ in other regions that run periodically. Records can be created, updated, or removed loosely and a database query will return something that is mostly accurate. This is usually ‘good enough’ for most applications-given someone’s Tiktok post will be available to you in the near future meets the requirements of most users.
| Pros | Cons |
| Easily applied to data that doesn’t fit into traditional columns and rows | Generally requires indexing for repeatable queries |
| Extremely easy to update data structure after provisioning without negatively impacting the overall data set | Lower query performance than like for like tabular databases, but usually not significantly worse |
| Easier to read and write data without the ACID requirements of traditional databases | Generally BASE (not a big problem for most application) |
An example of semi structured data is below:
[
{
FirstName: “John”,
LastName: “Smith”,
Contact: {
Phone: “123-456-5555”,
Address 1: “123 Main St”,
City: “New York City”,
State: “New York”,
}
},
{
FirstName: “Jane”,
LastName: “Jones”,
Contact: {
Phone: “555-123-4567”,
Address 1: “123 Maple Ln”,
City: “Seattle”,
State: “Washington”,
}
}
]
Unstructured Data
Unstructured data is data that generally begins in a non digital format and is digitized later. Most companies have millions of paper documents, contracts, leases, and invoices that could provide significant value if digitized and integrated into a system to enable insights and inferences. Unstructured data is usually the last frontier for data science in enterprises, mostly because it’s a significant amount of work to convert the data in place to data that can be consumed for training and inference.
Time Series
Data that is focused on points in time is generally called Time Series data. Although it can have many forms, in its most basic form time series data can be stored in the format of “Attribute | Timestamp | Value” (especially Operational Technology (OT) systems). Outside of OT systems time series data can contain other attributes that enrich the data for the same timeslice. Time series databases are optimized for specific types of queries, generally pulling by attribute and aggregating over a period of time (‘pull the last 24 hours of this temperature sensor and give me an average’).
Something to be considered with time series data is the conversions that are required for time zones, time change, leap seconds, and more and more horrible details. The generally accepted pattern is to store all times as UTC (Coordinated Universal Time) and convert them to the required time zone at runtime. Quick aside-If you’re ever curious and want to waste an afternoon, spend some time looking up the history of Greenwich Mean Time and the associated observatory. It’s fascinating how an institution built over four hundred years ago dictates both the universal conversion point for time and the 0 degree baseline of Longitude.
An example of time series data is below:
| Attribute | Value | Time Stamp |
| Temperature | 76 | 12/30/2014 4:36:01 AM |
| Pressure | 342 | 12/30/2014 4:37:03 AM |
| Temperature | 77 | 12/30/2014 4:38:34 AM |
The advantages and disadvantages of time series data are listed below:
| Pros | Cons |
| Data sets are optimized around points in time | Time conversions can be problematic if the data architecture isn’t done properly. |
| Highly performant given proper indexes |
Geospatial
Similar to how Time Series data structures enable cataloging when something happened, geospatial data structures enable the where. Generally these data structures relate to points on the surface of the earth, and if needed add an attribute to show if something is above (think planes or Superman) or below (like drill bits or Dune sandworms). This data and practice are highly specialized. Unless you have a specific need to apply geospatial data I wouldn’t recommend dedicated brain cells to this area of human expertise.
Geospatial data is usually stored in three major formats:
- Raster
- Vector
- Tabular with Geospatial Elements
Raster: Raster data is basically images with a geospatial anchor. Like other images they can be resized based on the perspective and framing for the use case. If you look at satellite imagery in Google Maps you’re almost certainly viewing data that is stored in Raster format on the back end.
Check out these cool Raster outputs
Vector: Vector format stores points or groups of points as data objects. These data objects can form ‘shapes’ which is why the most common format you will run into are called shape files. Consider the shape of the square below:
The shape above would be stored in a format like [ [0,0], [0,1], [1,1], [1,0] ]. These shapes will have an anchor to a specific point in space. Either they will be given a reference point to move relatively around or will be stored in absolute coordinates.
Tabular with Geospatial Elements: This is my definition of this category, you probably won’t find this sourced online. Geospatial data is usually in specific standardized formats-such as WDS 84 or NAD 83. Older datasets could be in NAD 27, which doesn’t account for a spherical earth and is less accurate. Geospatial specialists have strong thoughts on what formats should be used and how it’s the end of the world if it’s not stored in a specific way. Maybe-but, if you know the format you can almost always convert it as needed at run time.
The advantages and disadvantages of geospacial data are listed below:
| Pros | Cons |
| Data sets are optimized around points in space | Idiosyncratic data models with an opinionated community |
| Ideal for maps | Older data sets don’t account for the spherical nature of the earth |
A Brief Step Into Metadata
Metadata, or ‘data about data’, can be attached to any of the data types above. It adds context and information about data. This context could be who or what created the data, how the data relates to other data, the permissions related to who can read and write to the record, and many other things. Think about pictures on your phone – if you click into the picture it will give you information on the device that took the image, the geolocation of the image, the zoom applied, and many other things. That metadata rides along with the picture if it’s saved or shared. Good metadata can allow analysts and data scientists to understand data without having to ask a subject matter expert.
Designing Your Data Strategy
“Good design is good business.”
Thomas Watson Jr, former IBM CEO
A company’s data systems need to be designed to maximize business value, which often means designing the system for reporting and data science. Given that companies are living, breathing, and always changing organisms this will almost never come from a blank slate. Most company landscapes will be incredibly convoluted-this is a great opportunity! Things that are easy aren’t worth doing.
We recommend working backwards to design your data strategy, starting with the business outcomes you need to achieve:
You can imagine the number of questions that arise from the simple flow above. For the sake of exercise, consider a company that calculates and sells credit scores in the graphic below:
And that’s just the beginning. When answering the above, enterprises must design their data architecture with the following considerations:
- Secure
- Performant for Inference in Production
- Accessible for Accurate Training
And should following these nice to haves:
- Native to your Data Science Tooling
- Accessible for Users
Secure
Security-first and foremost. This is a basic right to operate requirement. Pragmatically, someone won’t get fired over a bad model. A lot of people will get fired if data gets out. For most companies their data contains insights into real people, and it’s unethical to not apply security to their data. Enterprises must ensure that data, especially personal data, is maintained according to the trust and confidence of the people it represents. The mechanics of this will vary. Some companies may want to go as far to have dedicated red teams challenging current designs to improve them (a good example of this in practice is Netflix’s efforts to constantly reduce risk and increase performance through chaos engineering).
If you mess up data security you go straight to jail.
Performant for Inference in Production
The second most important thing. If you can’t produce insights when they are needed then everything else doesn’t matter. The nature of the company and industry will dictate the requirements. For example, credit scores do not need to be generated on demand-they can be created in a batch fashion weekly or daily to meet the business need. Data first companies such as Meta or Pinterest need to have inferences available in timescales of thousands of a second. These two defining examples will require two very different data architectures.
Accessible for Accurate Training
Models need to come from somewhere. Many firms make it difficult for data scientists to access the data they need to do their work-mostly due to the first and foremost requirement of security. Standard software development works through three (or four) different environments designed to test products before they meet production…data science is not standard software development. A firm’s data architecture has to enable data scientists to experiment and train models. There are a few patterns for this. We feel that having a controlled environment for data science is best-this allows data scientists to hammer their segmented source of data without fear of harming the greater ecosystem. It also allows for security to be a key design feature.
Native to your Data Science Tooling
Let me explain this one. If you’re working in a cloud environment there’s a significant amount of data products and tooling available. Many of them are not the same blue or orange colors of what your cloud provider may be. This is, unfortunately, a big deal. Cloud providers, as a feature and not a bug, make it difficult to link native data science tooling with these frenemy data technologies. This results in significant redundant data replication, bespoke integrations, and other suboptimal technology choices. Try to avoid this if you can.
Accessible for Users
Here’s a good litmus test for your company to see if you’re ‘data first’ or in the trenches. If you have a new hire, how long does it take to get them through the chutes and ladders of access requests, group affiliations, and network changes for them to actually do work? If your answer is 2-4 weeks then you are in the trenches. Access control for data systems is non trivial, and most companies don’t spend the time and effort to get this right so that it’s seamless and scales across users.
Human Incentives
Most {things} can be summarized in four words: People respond to incentives.
Steven Landsburg
I want to go over why data systems are the way they are-I believe it comes down to incentives. Everything comes down to incentives.
| Incentives of Data Group | Incentives of Data Science Group |
| Support the operational needs of the business so they don’t blame you for things not working | Deliver Products to Customers |
| Don’t Go To Jail | Receive Accolades |
| Stop People Yelling At You | Make LinkedIn posts |
Functionally the data group has incentives to operate like a regulated utility company, and the data science group has incentives to operate like a start up. These are not congruent. The data group is charged with meeting competing, and often conflicting, needs of the operational and analytical business groups. This is difficult for the data group to do, and the best data groups will design their data architecture so that it can simultaneously support domains of access and domains of control. There is a time horizon aspect for serving each group-if a data group ignores the operational business the company will fail quickly, but if they don’t meet the needs of the analytical business they will certainly fail, but much more slowly. An analogy would be going without air (not supporting the operational business) vs going without food (not supporting the analytical functions).
Aligning Incentives With Organizational Structure
There are a few ways to address the incentives problem with organizational structure. One way is to combine the data and data science groups in some way. This is a good option for small companies. If this isn’t an option then some kind of federated model where data engineering resources are embedded within the data science group will also work well. Enterprises also need to ensure the operational business is being served by the data group-in larger companies this is a much harder lift than taking care of the data science teams.
Aligning Incentives As An Enterprise
Aligning incentives is much more difficult than a hand wave at an org chart, but I’d argue it’s a much better option long term. To do this right, the incentives of the data group need to enter the data science domain and vice versa. This way the data scientists are also worried about going to jail and getting yelled at, which is a good thing. The data team can then share in the glory of great models and outcomes, and feel like they’re providing real value (and receive real accolades).
Designing Your Data Architecture
To achieve great things, two things are needed; a plan, and not quite enough time.
Leonard Bernstein
A correct data architecture is the life blood of a company that wants to extract value from their data. Think about your organization-where is the data architect in the pecking order? Do you even have one? A company’s data architect should be one of the most important people in a technology group, second only to the enterprise architect in terms of respect and influence. A good data architect will do the following:
- Make pretty drawings
- Design and execute a vision on how data will be converted into value
- Most importantly-work across teams / Sharks and Jets/ silos and influence the entire organization to understand the importance and impact of data and the corresponding strategy
So what makes a good data architecture? The first consideration is your stakeholders. The list will vary by company and industry but a good general list of stakeholders could be:
- Core operational business
- Finance and Accounting
- Business Analytics
- Data Science
Other considerations would include source systems and your necessary business outcomes. A baseline data architecture is listed below:
If you failed the reading portion of your last eye test the key points moving left to right are:
- Data from source systems land in a common area
- Data from source systems is transformed and standardized
- Data can then be split for either ML or Analytics workflows
What’s useful about that design is that it can satisfy both business intelligence and AI/ML needs simultaneously. Data that is typically tabular can be sent to business intelligence environments and the more diverse data types needed by data science can be sent to a separate pipeline. Below is a separate diagram (because I drew it and hell or high water I’ll use it) to show a typical architecture for manufacturing or energy. It’s very similar to the baseline diagram above but with a stronger focus on OT data and the requirements that come with it.
Moving from left to right:
- If you’re a company that has sensors or field data you’ll likely have a rich real time stream of data from thousands of devices. What a blessing and a curse. This data will be noisy, messy, sometimes wrong, and will need to be mastered to a place or piece of equipment to add the context you need. This is all a significant amount of work
- The message bus in the middle of the diagram is critically important if you want to do inferences real time. Message buses aren’t new, but they haven’t been common to companies that were focused on business systems. This results in having teams that aren’t comfortable with them
- Having a proper data warehouse is critical. This enables mixing of different source systems for valuable outcomes. Companies have many analysts that work exclusively in their ERP systems-think about the insights that could be gained by mixing this data with our business and field systems
- Missing in this diagram, on purpose, is master data. You can place master data at different places on this diagram, and you’d have to. No matter what you’ll need a program in place to build and maintain this. There are books written to this entire subject
Architectures will vary based on the type of company, the history of systems, and beliefs and proclivities of the architects. There’s no right architecture, but there certainly are wrong ones.
Conclusion
A journey of a thousand miles begins with a single step.
Lao Tzu
If there’s one thing I hope you gained from this chapter it’s that none of the above happens by accident. It takes significant vision and diligent execution to get to an end state where data science can drive value for an enterprise in a consistent way. It’s also extremely rewarding to see a data landscape exist that you built relied upon by hundreds of people for years to come. So what are you waiting for?
Leave a comment