April 11, 2017
A bridge is only worth crossing if it can hold strong while we cross. When we’re crossing the chasm on a bridge of data, it needs to have integrity. Last week, we discussed chilling the data: why it’s important to put operational data into long-term storage. Unfortunately, teams often pre-aggregate the data when chilling it, under the impression that storage is expensive or that the visualization will always be faster. But what about the questions we don’t know to ask yet? What happens when we don’t know what we don’t know?
A single drop of water on its own is inconsequential in the middle of a torrential rainstorm, yet as an aggregate of millions of drops they are of serious consequence indeed. In the same way, when using cold data to address a question, each solitary datum is of negligible value; it’s the aggregate of large datasets that leads to actionable insights. However, because the datum is aggregated into data, any mutation of the datum sets will have amplified effect on the aggregate data. Thus, even though each datum is of negligible value, the integrity of each datum is highly valuable.
The most obvious example of datum integrity destruction is scrubbing. When a human touches data before it enters the analytics system, aside from the aspects of the data that are targeted for scrubbing, there is an introduction of bias. Agency theory kicks in and small adjustments are made to the data to make it appear favourable (in whatever way the individual touching the data perceives favour). The organization’s culture will also seep its way into the data through Groupthink. It’s possible that cognitive dissonance causes an individual to adjust data when personal values are at odds with corporate value, and the list of possible biases goes on. When these biases are consistently introduced into the data, the aggregation of the data will amplify them, and can result in overtly poor decisions.
Datum integrity can also be destroyed when dealing with unstructured data like emails. This data needs to be tagged with metadata to give valuable search and analytical results. When large swaths of unstructured data gets arbitrarily dumped into a big data storage system without having metadata attached to it, then optimistic algorithms end up including data that is irrelevant (false positive) and pessimistic algorithms end up ignoring data that is actually relevant (false negative). In this situation, the best outcome is that the untagged data is unused and simply occupies space, while the worst outcome is that the untagged data wrongfully skews algorithm and analysis results. When a company first begins collecting unstructured data, it is unreasonable to expect that all use cases will be known in advance. Nevertheless, putting some consideration into how data should be tagged will produce results far more trustworthy than arbitrarily dumped data.
Arguably, pre-aggregation is the worst and most common culprit of datum integrity destruction. Using the argument that each datum is of negligible value on its own, some businesses store pre-aggregated data to help solve challenges relevant to the present. However, by storing the pre-aggregated data, the data cannot be analyzed by evaluating different aggregations and it becomes impossible to drill-down to lower levels of granularity. Let us use an example based on public transit. In an attempt to determine stop utilization by passengers, the operating company has stored the following information for analytics. Notice that we are provided with an aggregate over the month: the number of times the bus stopped at a designated area to pick up passengers. Of interest, stop 2 appears to have been frequently skipped, suggesting low utilization.
Route Number | Stop Number | Month | # Times Stopped |
301 | 1 | Sep | 25 |
301 | 2 | Sep | 20 |
301 | 3 | Sep | 23 |
The challenge is that we don’t have a way to evaluate the data along other dimensions. Perhaps if the raw data were explored based on route + stop + driver, a pattern emerges where the regular driver was sick on the days that stop 2 was skipped (implying a lack of training or awareness). Perhaps if the data was explored in conjunction with weather, a different pattern emerges showing that only on the rainy days was ridership low at stop 2 (possibly because stop 2 was not covered). The bottom line is that by pre-aggregating the data before storing it for analysis, this fictional transit operator has severely cut the value of its data. In reality, this problem occurs because our present selves don’t know what problems our future selves are trying to solve, so by storing a currently relevant perspective of the data instead of the raw data, we do ourselves a disservice.
Make sure there is a bridge to cross when you get there! Treat data as a long-term resource and use it to build that bridge. Start with the data traditionally seen as valuable from ERPs and financial systems, but don’t overlook uncommon sources of data. Actively brainstorm external and unstructured data sets like email and excel reports, and collect and chill all of the raw data generated at the lowest levels of operations. Capture data in its rawest form possible and avoid de-valuing the integrity of each piece of datum. Doing this will yield insight into resourcing, operations, market considerations and competitive advantages, especially if enough data has been collected to be statistically significant over the course of prosperous seasons and equally difficult ones. Bear in mind that no answer will be automatic or self-evident, especially if the question itself is unclear as exaggerated by the late Douglas Adams. However, when used as a tool in the arsenal of strategy, data can be a critical success factor to objectively guide, support and validate all corporate decisions.
If your team is seeking to better understand the data landscape, consider attending Big Data: A Peek Under the Hood in Calgary on May 4, 2017 and in Vancouver on May 11, 2017.