Chapter 3: All Things Data-Defining Your Data Landscape For Data Science

Author's Note: I relied heavily on the expertise of my original data architect, Pete Baker, for this chapter. Pete taught me so much on the job prior to his retirement and also graciously reviewed this content. Thank you Pete.

We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard.
John F Kennedy

A company’s data landscape is both the foundation and a hard requirement for any data science initiatives. Data strategy is incredibly hard to execute and no one gets it right-some companies just get closer than others. This chapter will address the following subjects:

What is data and what forms does it come in
What is data strategy and what does right look like
How should data systems be designed to enable business outcomes, especially data science outcomes

Please keep in mind the focus of this book is data science, but data strategy has major overlaps between data science and analytics/reporting. I originally cut my teeth in both analytics and reporting and I’ll touch on the more traditional cousins of data science in this chapter when it makes sense to do so.

Overview On Data

It is a capital mistake to theorize before one has data.
Sherlock Holmes, A Study In Scarlet

Given that data is the most foundational need for proper data science (Chapter 1), a proper data strategy is at the basic foundation of any successful data science practice. For data science to succeed in an enterprise, a proper data strategy needs to meet the following mission statement:

A successful data strategy provides data scientists with the data to both train and deploy models so that data science products can provide useful and timely inferences.

This data will come in different colors and flavors, but in general will fit the following paradigms:

Type of Data	Short Description	Examples
Structured	Data in rows and columns	Excel Spreadsheets or Relational Database Tables
Semi-Structured	Data that does not have rows and columns but instead has tags or keys, usually in a nested structure	JSON representations of user application interactions, XML of an email
Unstructured	Date, usually text data, that does not have a strong organization	PDF Documents
Time Series	Data describing points in time	IoT / SCADA data
Geospatial	Data describing points in space	Geotag data on images, Shape Files
Audio / Image / Video	Digital representations of sound and visuals	Ring Camera Videos

If you’re reading this-you probably know what all of these are; feel free to skip ahead to the following section (Designing Your Data Strategy). Seriously-this will be remedial for you. You’re still here? The following explains each type of data, we’ll refer to these types in this chapter and beyond.

Tabular Data

Tabular data is the simplest data structure and the most intuitive to end users. Humans have been using rows and columns since the time of clay tablets. It’s my belief that people fit things into tabular data at all costs due to the prevalence of Excel in the business world-people fall back on what they know. Tabular databases have other advantages beyond being easy to understand. Tabular data is very dependable. Database rules can be applied to allow for all data to be ACID (atomicity, consistency, isolation, durability). This is attractive for the reliability of databases and the results that come from them.

An example of tabular data is below:

First Name	Last Name	Address	Phone Number
John	Smith	123 Main St	555-555-1234
James	Brown	999 Polk Ave	555-555-9876
Susan	Jones	12 Birch Pl	555-555-5678

A advantages and disadvantages of tabular data are listed below:

Pros	Cons
Intuitive for End Users	Difficult to update table structure after provisioning it
Extendable (just add another row)	Not all data populations naturally conform to a tabular model
Generally ACID

Semi Structured

Semi structured data arose out of the need to store data whose structure does not fit into traditional columns and rows and may change in the future. Non tabular data’s utility increased greatly with the rise of the internet. Any time you interact with a platform like Instagram, Tiktok, or Twitter almost everything on the back end is semi structured. What makes semi structured data structures attractive is they can be designed to meet requirements that extend beyond standard columns and rows. Semi structured databases can have their schemas updated easily-if you need to add another attribute you can do so without having to change any of the previous records. Think of how limiting it would be to Instagram if they had to execute large database changes with every feature change.

Data in semi structured formats are generally not ACID but alternatively BASE (basically available). This paradigm chooses to reduce the compute needs required for knowing the exact, atomic state at any time. NoSQL is often applied in distributed compute environments-today’s internet platforms live in multiple regions and are usually optimized for their geographical areas. Generally events happen in each region and there are processes to ‘catch things up’ in other regions that run periodically. Records can be created, updated, or removed loosely and a database query will return something that is mostly accurate. This is usually ‘good enough’ for most applications-given someone’s Tiktok post will be available to you in the near future meets the requirements of most users.

Pros	Cons
Easily applied to data that doesn’t fit into traditional columns and rows	Generally requires indexing for repeatable queries
Extremely easy to update data structure after provisioning without negatively impacting the overall data set	Lower query performance than like for like tabular databases, but usually not significantly worse
Easier to read and write data without the ACID requirements of traditional databases	Generally BASE (not a big problem for most application)

An example of semi structured data is below:

[
	{
		FirstName: “John”,
		LastName: “Smith”,
		Contact:     {
			Phone: “123-456-5555”,
                                    Address 1:  “123 Main St”,
                                    City: “New York City”,
                                    State: “New York”,
                                            }
            },

{
		FirstName: “Jane”,
		LastName: “Jones”,
		Contact:     {
			Phone: “555-123-4567”,
                                    Address 1:  “123 Maple Ln”,
                                    City: “Seattle”,
                                    State: “Washington”,
                                            }
            }
]

Unstructured Data

Unstructured data is data that generally begins in a non digital format and is digitized later. Most companies have millions of paper documents, contracts, leases, and invoices that could provide significant value if digitized and integrated into a system to enable insights and inferences. Unstructured data is usually the last frontier for data science in enterprises, mostly because it’s a significant amount of work to convert the data in place to data that can be consumed for training and inference.

Time Series

Data that is focused on points in time is generally called Time Series data. Although it can have many forms, in its most basic form time series data can be stored in the format of “Attribute | Timestamp | Value” (especially Operational Technology (OT) systems). Outside of OT systems time series data can contain other attributes that enrich the data for the same timeslice. Time series databases are optimized for specific types of queries, generally pulling by attribute and aggregating over a period of time (‘pull the last 24 hours of this temperature sensor and give me an average’).

Something to be considered with time series data is the conversions that are required for time zones, time change, leap seconds, and more and more horrible details. The generally accepted pattern is to store all times as UTC (Coordinated Universal Time) and convert them to the required time zone at runtime. Quick aside-If you’re ever curious and want to waste an afternoon, spend some time looking up the history of Greenwich Mean Time and the associated observatory. It’s fascinating how an institution built over four hundred years ago dictates both the universal conversion point for time and the 0 degree baseline of Longitude.

An example of time series data is below:

Attribute	Value	Time Stamp
Temperature	76	12/30/2014 4:36:01 AM
Pressure	342	12/30/2014 4:37:03 AM
Temperature	77	12/30/2014 4:38:34 AM

The advantages and disadvantages of time series data are listed below:

Pros	Cons
Data sets are optimized around points in time	Time conversions can be problematic if the data architecture isn’t done properly.
Highly performant given proper indexes

Geospatial

Similar to how Time Series data structures enable cataloging when something happened, geospatial data structures enable the where. Generally these data structures relate to points on the surface of the earth, and if needed add an attribute to show if something is above (think planes or Superman) or below (like drill bits or Dune sandworms). This data and practice are highly specialized. Unless you have a specific need to apply geospatial data I wouldn’t recommend dedicated brain cells to this area of human expertise.

Geospatial data is usually stored in three major formats:

Raster
Vector
Tabular with Geospatial Elements

Raster: Raster data is basically images with a geospatial anchor. Like other images they can be resized based on the perspective and framing for the use case. If you look at satellite imagery in Google Maps you’re almost certainly viewing data that is stored in Raster format on the back end.

Check out these cool Raster outputs

Vector: Vector format stores points or groups of points as data objects. These data objects can form ‘shapes’ which is why the most common format you will run into are called shape files. Consider the shape of the square below:

The shape above would be stored in a format like [ [0,0], [0,1], [1,1], [1,0] ]. These shapes will have an anchor to a specific point in space. Either they will be given a reference point to move relatively around or will be stored in absolute coordinates.

Tabular with Geospatial Elements: This is my definition of this category, you probably won’t find this sourced online. Geospatial data is usually in specific standardized formats-such as WDS 84 or NAD 83. Older datasets could be in NAD 27, which doesn’t account for a spherical earth and is less accurate. Geospatial specialists have strong thoughts on what formats should be used and how it’s the end of the world if it’s not stored in a specific way. Maybe-but, if you know the format you can almost always convert it as needed at run time.

The advantages and disadvantages of geospacial data are listed below:

Pros	Cons
Data sets are optimized around points in space	Idiosyncratic data models with an opinionated community
Ideal for maps	Older data sets don’t account for the spherical nature of the earth

A Brief Step Into Metadata

Metadata, or ‘data about data’, can be attached to any of the data types above. It adds context and information about data. This context could be who or what created the data, how the data relates to other data, the permissions related to who can read and write to the record, and many other things. Think about pictures on your phone – if you click into the picture it will give you information on the device that took the image, the geolocation of the image, the zoom applied, and many other things. That metadata rides along with the picture if it’s saved or shared. Good metadata can allow analysts and data scientists to understand data without having to ask a subject matter expert.

Designing Your Data Strategy

“Good design is good business.”
Thomas Watson Jr, former IBM CEO

A company’s data systems need to be designed to maximize business value, which often means designing the system for reporting and data science. Given that companies are living, breathing, and always changing organisms this will almost never come from a blank slate. Most company landscapes will be incredibly convoluted-this is a great opportunity! Things that are easy aren’t worth doing.

We recommend working backwards to design your data strategy, starting with the business outcomes you need to achieve:

You can imagine the number of questions that arise from the simple flow above. For the sake of exercise, consider a company that calculates and sells credit scores in the graphic below:

And that’s just the beginning. When answering the above, enterprises must design their data architecture with the following considerations:

Secure
Performant for Inference in Production
Accessible for Accurate Training

And should following these nice to haves:

Native to your Data Science Tooling
Accessible for Users

Secure

Security-first and foremost. This is a basic right to operate requirement. Pragmatically, someone won’t get fired over a bad model. A lot of people will get fired if data gets out. For most companies their data contains insights into real people, and it’s unethical to not apply security to their data. Enterprises must ensure that data, especially personal data, is maintained according to the trust and confidence of the people it represents. The mechanics of this will vary. Some companies may want to go as far to have dedicated red teams challenging current designs to improve them (a good example of this in practice is Netflix’s efforts to constantly reduce risk and increase performance through chaos engineering).

If you mess up data security you go straight to jail.

Performant for Inference in Production

The second most important thing. If you can’t produce insights when they are needed then everything else doesn’t matter. The nature of the company and industry will dictate the requirements. For example, credit scores do not need to be generated on demand-they can be created in a batch fashion weekly or daily to meet the business need. Data first companies such as Meta or Pinterest need to have inferences available in timescales of thousands of a second. These two defining examples will require two very different data architectures.

Accessible for Accurate Training

Models need to come from somewhere. Many firms make it difficult for data scientists to access the data they need to do their work-mostly due to the first and foremost requirement of security. Standard software development works through three (or four) different environments designed to test products before they meet production…data science is not standard software development. A firm’s data architecture has to enable data scientists to experiment and train models. There are a few patterns for this. We feel that having a controlled environment for data science is best-this allows data scientists to hammer their segmented source of data without fear of harming the greater ecosystem. It also allows for security to be a key design feature.

Native to your Data Science Tooling

Let me explain this one. If you’re working in a cloud environment there’s a significant amount of data products and tooling available. Many of them are not the same blue or orange colors of what your cloud provider may be. This is, unfortunately, a big deal. Cloud providers, as a feature and not a bug, make it difficult to link native data science tooling with these frenemy data technologies. This results in significant redundant data replication, bespoke integrations, and other suboptimal technology choices. Try to avoid this if you can.

Accessible for Users

Here’s a good litmus test for your company to see if you’re ‘data first’ or in the trenches. If you have a new hire, how long does it take to get them through the chutes and ladders of access requests, group affiliations, and network changes for them to actually do work? If your answer is 2-4 weeks then you are in the trenches. Access control for data systems is non trivial, and most companies don’t spend the time and effort to get this right so that it’s seamless and scales across users.

Human Incentives

Most {things} can be summarized in four words: People respond to incentives.
Steven Landsburg

I want to go over why data systems are the way they are-I believe it comes down to incentives. Everything comes down to incentives.

Incentives of Data Group	Incentives of Data Science Group
Support the operational needs of the business so they don’t blame you for things not working	Deliver Products to Customers
Don’t Go To Jail	Receive Accolades
Stop People Yelling At You	Make LinkedIn posts

Functionally the data group has incentives to operate like a regulated utility company, and the data science group has incentives to operate like a start up. These are not congruent. The data group is charged with meeting competing, and often conflicting, needs of the operational and analytical business groups. This is difficult for the data group to do, and the best data groups will design their data architecture so that it can simultaneously support domains of access and domains of control. There is a time horizon aspect for serving each group-if a data group ignores the operational business the company will fail quickly, but if they don’t meet the needs of the analytical business they will certainly fail, but much more slowly. An analogy would be going without air (not supporting the operational business) vs going without food (not supporting the analytical functions).

Aligning Incentives With Organizational Structure

There are a few ways to address the incentives problem with organizational structure. One way is to combine the data and data science groups in some way. This is a good option for small companies. If this isn’t an option then some kind of federated model where data engineering resources are embedded within the data science group will also work well. Enterprises also need to ensure the operational business is being served by the data group-in larger companies this is a much harder lift than taking care of the data science teams.

Aligning Incentives As An Enterprise

Aligning incentives is much more difficult than a hand wave at an org chart, but I’d argue it’s a much better option long term. To do this right, the incentives of the data group need to enter the data science domain and vice versa. This way the data scientists are also worried about going to jail and getting yelled at, which is a good thing. The data team can then share in the glory of great models and outcomes, and feel like they’re providing real value (and receive real accolades).

Designing Your Data Architecture

To achieve great things, two things are needed; a plan, and not quite enough time.
Leonard Bernstein

A correct data architecture is the life blood of a company that wants to extract value from their data. Think about your organization-where is the data architect in the pecking order? Do you even have one? A company’s data architect should be one of the most important people in a technology group, second only to the enterprise architect in terms of respect and influence. A good data architect will do the following:

Make pretty drawings
Design and execute a vision on how data will be converted into value
Most importantly-work across teams / Sharks and Jets/ silos and influence the entire organization to understand the importance and impact of data and the corresponding strategy

So what makes a good data architecture? The first consideration is your stakeholders. The list will vary by company and industry but a good general list of stakeholders could be:

Core operational business
Finance and Accounting
Business Analytics
Data Science

Other considerations would include source systems and your necessary business outcomes. A baseline data architecture is listed below:

If you failed the reading portion of your last eye test the key points moving left to right are:

Data from source systems land in a common area
Data from source systems is transformed and standardized
Data can then be split for either ML or Analytics workflows

What’s useful about that design is that it can satisfy both business intelligence and AI/ML needs simultaneously. Data that is typically tabular can be sent to business intelligence environments and the more diverse data types needed by data science can be sent to a separate pipeline. Below is a separate diagram (because I drew it and hell or high water I’ll use it) to show a typical architecture for manufacturing or energy. It’s very similar to the baseline diagram above but with a stronger focus on OT data and the requirements that come with it.

Moving from left to right:

If you’re a company that has sensors or field data you’ll likely have a rich real time stream of data from thousands of devices. What a blessing and a curse. This data will be noisy, messy, sometimes wrong, and will need to be mastered to a place or piece of equipment to add the context you need. This is all a significant amount of work
The message bus in the middle of the diagram is critically important if you want to do inferences real time. Message buses aren’t new, but they haven’t been common to companies that were focused on business systems. This results in having teams that aren’t comfortable with them
Having a proper data warehouse is critical. This enables mixing of different source systems for valuable outcomes. Companies have many analysts that work exclusively in their ERP systems-think about the insights that could be gained by mixing this data with our business and field systems
Missing in this diagram, on purpose, is master data. You can place master data at different places on this diagram, and you’d have to. No matter what you’ll need a program in place to build and maintain this. There are books written to this entire subject

Architectures will vary based on the type of company, the history of systems, and beliefs and proclivities of the architects. There’s no right architecture, but there certainly are wrong ones.

Conclusion

A journey of a thousand miles begins with a single step.
Lao Tzu

If there’s one thing I hope you gained from this chapter it’s that none of the above happens by accident. It takes significant vision and diligent execution to get to an end state where data science can drive value for an enterprise in a consistent way. It’s also extremely rewarding to see a data landscape exist that you built relied upon by hundreds of people for years to come. So what are you waiting for?

Data Science In The Trenches