Sep 14th 2010
The data mining revolution
We’re generating terabytes of data every minute. Figuring out what it all means will be the next gold rush, says Daniel Poon.
One of my clients, a large US conglomerate, had just acquired and integrated several gardening-equipment businesses. Due diligence had been undertaken and due process followed, but it wasn’t until much later the company discovered that thanks to product misclassifications, it was now the proud owner of no less than four million plastic planting pots, stacked in warehouses across the country.
Another case: a telco giant, while analysing calls data, discovered rampant misuse of an unlimited calling plan through calling cards. When it looked a bit closer, it found that the Mexican mafia was using the product to wholesale telephone capacity to retail customers at substantial discounts. Then, when it came time for them to pay their bills, they disappeared.
These are just two examples of the major opportunity and challenge for large organisations today: how to sort and analyse the increasingly large volumes of data they collect in ways that reveal the company’s real situation.
At the moment, business sits at the bottom of the business intelligence curve. The technology for gathering and storing data has accelerated over the past decade. Examples of the tidal waves of data engulfing the business world are abundant: Walmart handles more than one million customer transactions each hour and its databases are estimated to contain 165 times the information contained in America’s Library of Congress. Facebook is home to 40 billion photos and counting. Decoding the human genome involves analysing three billion base pairs; the first time it was done, in 2003, it took 10 years. Today it takes a week.
We can now store more data than we ever imagined possible, and our analytical capabilities are way beyond where they were just five years ago. The problem for organisations is mobilising those capabilities into actionable intelligence. We have solved the scaling problem – how to store terabytes of data in easily accessible, classifiable ways – but solving the intelligence issue is an ongoing project with huge implications for the efficiency of business.
Multiple versions of the truth
Most organisations haven’t got to square one with an integrated data management and analysis strategy, or what is known as master data management (MDM). At a basic level, MDM seeks to ensure that an organisation does not use multiple (potentially inconsistent) versions of the same master data (eg, product definition, business unit hierarchy, cost-centre business rules and the like) in different parts of its operations.
A simple but common example of poor MDM is when the chief financial officer asks a simple question: ‘What is the year-to-date product revenue of our company?’ The corporate controller at headquarters reports $12.1 million. The vice-president of sales reports $14,001,234 and the regional controller reports $10,800,678.
How could there be three such different answers?
The answer lies in the fact that each finance group is measuring results in a different way. The corporate controller may have reported a rounded GAAP (generally accepted accounting principles) number prepared for the Securities and Exchange Commission (SEC). The vice-president of sales’ figure included intercompany revenue, which should be eliminated. The regional controller took into account some of the deals that will be closed for the quarter but in which product has yet to be shipped. Although a single of version of the truth is assumed, it rarely exists. Reality depends on perspective and context:
- Varied sources: product revenue may be reported from the booking or order entry system as opposed to the billing or invoicing system.
- Varied business rules: business entities are rolled up differently. For example, headquarters often roll up entities differently than the functional business.
- Varied assumptions: different groups tend to include or exclude certain business events, therefore generating different information.
- Varied time of reporting: information reported at different times can produce different results.
- Varied accuracy: not all information is measured at the same level of granularity.
As a result, while data quality at data-entry level may be relatively easy to control, interpretation and analysis of the information provided by different groups without any common guidelines leads to incorrect reporting and, eventually, incorrect decisions. MDM will be a hot topic for the coming decade.
The development of business processes that actually enable analysts to gain useful insight from the mountains of numbers sitting in our data warehouses remains in its infancy. The evolution of data mining will be one of the great productivity revolutions of the next decade.
It will first take the form of an expansion in the number of standard-use cases from which companies will be able to draw. Back in the 1980s, if you wanted to analyse what your gross margin, divisional profitability or overhead attributable was in relation to a particular product line, the project might take weeks or months. Today, it is done at the press of a button.
Other more complex queries, such as profitability per customer, may still take weeks or require a complex project of their own to answer. Very soon, however, even the most complex business analysis will be standardised, with the information scraped out of the data warehouse and put together by business intelligence bots that come standard with any of the major enterprise systems.
Already, web-based business dashboards, combined with social networking tools and strategic communication technologies, are spreading business numeracy and stimulating people who wouldn’t otherwise be thinking about business intelligence to develop queries and refine their business strategies using real data.
But there is still much to be done and huge productivity potential remains. The business intelligence revolution is here.

Comments