The quality of the data is of paramount importance in the big data world. Its importance it’s recognized in the ‘small data’ world, but too often the assumption that the quantity of data will make up for reduced quality is made in the big data space. Which is a big risk, as the quality of data often determines success or failure.
Take for example the recent US elections. A number of companies have used big data to try predicting the election outcome. Yet all of them have been wrong. A number of hedge funds and investors have relied upon such predictions to make investments, and found themselves on the wrong side of the market. In other industries this is true as well. A number of dashboards rely upon datamarts whose quality is often unknown or dubious, and executives take decisions based on such dashboards, to only later realize that their decisions didn’t produce the hoped effects on the hard accounting figures. Using the wrong data can be a costly exercise.
There is too much reliance on quantity, often forgetting that it is not a substitute for quality. Especially for the unstructured data world, quantity does not really means quality. And as data mining techniques endeavor to distill enormous amounts of unstructured data into a reduced set of structured data (or signal), it becomes easy to forget that behind that limited set of metrics is a world of utmost complexity where many things can go wrong, or can be read in the wrong way, or interpreted with the wrong model.
The old acronym “GIGO” – garbage in garbage out – is even more actual today, in a big data world, yet more easily forgotten. The sheer quantity of data available have triggered a sense of safety into a lot of people, including many practitioners, and caused to forget that the quantity of data has only changed the challenge of data quality, but have not diminished its importance. Quite the opposite.
Poor data quality can have multiple origins:
* it’s an inner characteristic of the data itself – it is just of poor quality: unstructured, messy, imprecise, volatile
* it can be the result of collection errors – it can be caused by approximation in measurements, defects in the collection device etc
* it can be the result of data manipulation, transformation or mining processes, whenever there is the need to pre-process data
* it can be the result of storage processes, or the outcome of chosen compromises in the storing processes
It is extremely important that everybody involved in the collection, storage and transformation of data is aware of the many ways data quality can be impacted at every stage of the data flow.
It is not the most sexy part of the data management processes, but data cleansing is the foundation of the whole data edifice. If done wrong, the output will simply be poor, or flat out wrong. We invest thousands of man days every year, and a number of algorithm to cleanse our data. And we see this as an investment, and not a cost. An investment in our ability to generate value, and improve visibility.
Do you do the same? Have you ever questioned how data quality is impacted (reduced) in your data pipeline?