Over the years, enterprise data strategies have been in a perpetual pendulum swing – from no strategy at all to way too much.
On one extreme, every analyst, data scientist or analytic application development team must find, access, translate and integrate all the data they need. This leads to some success with individual use cases, but burdens practitioners with excessive and redundant work while creating an unmanageable mess of data across the organization.
On the other extreme, data management professionals create a “foundation” of data for any and all uses, deploying one data domain at a time while attempting to identify every attribute and solve every data quality issue within each domain. This leads to projects that take too long, cost too much and deliver a fraction of the value that was expected.
Thus, the pendulum swings back and forth, back and forth.
The solution to this enduring dilemma is to find the middle way; that is, to carefully construct projects to focus directly on near-term value while contributing to an enterprise foundation
at the same time.
An effective
data strategy delivers (mostly) only the data needed for near-term use cases. To support these applications (preferably delivered by separate projects outside of the data team’s responsibility), the approach integrates only the data elements needed and solves only the data quality problems that affect the in-scope business objectives. And to simultaneously build an enterprise foundation, each small, focused data delivery
also contributes its puzzle piece of data to fit into the larger enterprise data puzzle.
But how will the puzzle pieces fit together? And if we focus only on near-term use cases, what happens if new and unexpected business needs emerge? Won’t we have to go back and redesign everything?
The answer is not completeness, but
extensibility. To accomplish this, there are a set of timeless practices waiting patiently to be rediscovered. Here are a few examples:
Structure and acquire data at the lowest level of detail.
Even if the near-term requirements call for data at a summarized level, you should structure the data at the lowest level possible – such as individual sales transactions instead of daily summaries, individual sensor readings instead of averages, individual customers instead of segments, and so on. In this way, when new requirements emerge, data can be summarized at any level needed, even if additional occurrences and attributes are required for new use cases.
Build right-time and adjustable data integration processes (a.k.a. data pipelines or workflows).
It is not feasible to build real-time or even near-real-time data integration as a default for every source. However, it is possible to build processes that meet known timeliness requirements while allowing increased frequency when needed without excessive rework. For example, change data capture processes for master data can be built so that each run efficiently processes changes since the last run. And where possible, new transactions and data changes can be published from source applications through messaging and batched into a shared data resource until more timely data is needed.
Acquire as much data as possible from new data sources, but only to staging (a.k.a. the data lake).
Landing raw data in a staging area (or
data lake) doesn’t cost much in terms of time or storage. Therefore, if you need data from a specific source, it’s best to err on the side of taking more data rather than less while you’re there. However, only the data needed for identified use cases should be processed any further because transforming, integrating, quality checking and so on does require quite a bit of work. And doing this for too many data elements is what blows up the scope of a data delivery project. But because you’ve sourced a superset of the data to a staging area, it’s easier to fully integrate additional elements later, as needed, without having to traverse a path to the source again.
Obtain data as close to the original source as possible.
Let’s say you’ve decided to build an enterprise data resource and, understandably, you want to show value as quickly as possible. Further, let’s say the data you need is available in one or more transactional data sources
and in a
data mart connected to those sources. If all the data you need for targeted use cases is available in the data mart, it can be very tempting to source data from there “temporarily.” But what happens when the next use case needs a few more attributes from the same source, or needs the data on a more timely basis? You will be limited to whatever the data mart has to offer. If instead, you go to the original source for the data, you’ll be able to acquire whatever additional data you’ll need in the future, and the timeliness will only be limited by the original system’s proximity to the business process it supports.
Create enterprise data models to outline the integrated vision.
Yes, conceptual data modeling is still a thing. Or at least it should be. Here you allow yourself to go beyond the near-term use cases to communicate the long-term scope of the enterprise data resource, depicted in about 20 to 30 entities, to be further detailed little by little within delivery projects. Care must be taken to keep these efforts within reasonable boundaries. Time boxing and limiting the number of entities helps with this. It doesn’t have to be perfect, it’s just a sketch to gain agreement on where all this is going.
Structure integrated data based on stable business entities.
Creating a logical data model as input into physical database design is also still a great practice, even if it’s been decimated in many organizations. This approach organizes data needs in a way that describes real-life business entities rather than source system structures or isolated target applications. With application-independent, business-oriented models in place, which should be built in detail only for in-scope use cases, the core data structures will be resilient to changes in source systems, support a variety of targets and will allow additional entities to be added much more easily while extending integration project-by-project.
Conclusion
These are just a few admittedly over-simplified examples. So perhaps the most important practice of all is to set both extensibility and targeted use cases as explicit goals for each project.
Documenting and regularly reinforcing these dual goals encourages the entire team to come up with all kinds of ways to accomplish them together.
Just about every team I’ve ever been involved with has a mix of dreamers and pragmatists. The dreamers want to reach the long-term data vision. The pragmatists want to focus on immediate value for applications. For a data strategy to work, you must take both goals equally seriously in every project. But don’t worry, once the goals are out there, the dreamers and the pragmatists will let you know if the pendulum starts to swing too far one way or the other. Make sure to listen.