Skip to main content

It goes without saying that big data is now a common feature among many businesses, both big and small. However, its great potential remains rather illusive. The problem lies not with the tools necessarily, but the data itself: There is simply too much of it going around. While that may not seem like a bad thing, the issue comes from the fact that there is no filter with the raw data a business receives. Even when given to custom BI solutions, there's little for them to actually do because there's little discerning useful information from the noise. That's why data cleansing will become more paramount in the years to come.

Separating the wheat from the chaff
Data cleansing is a process that filters out specific data. Usually this includes duplicates, incomplete information and improperly formatted sets. However, a company can expand those options even further by cutting out data points that aren't actually necessary for specific business processes. While what gets cut is left to the discretion of the firm, some basic materials such as outdated information or unverifiable details are a good baseline to remove, according to Web Werks Data Centers.

"Only 28 percent of companies find value in their data."

Of course, the process of data cleansing takes a significant amount of time and resources to complete. It does undermine some of the potential of getting the most insight from big data. ZDNet cited a survey by Cisco which found that only 28 percent of companies think their data generates any strategic value. Part of the reason may be found from a related survey ZDNet also cited which said at least a third of business intelligence professionals spend more than 50 percent of their time cleansing data for actual use.

Moreover, because data scientists and related professionals must cleanse the data first before making assessments,  many of them essentially perform menial labor. This makes them unable to reach their maximum productivity at their jobs.

The future of big data relies heavily on solving this overwhelming data issue. There are three possible paths in order to address this. One is that platforms such as Cognos BI get better at cleansing data themselves. Another approach would be to create a specialized role for people to clean the data for the data scientists, similar to how orderlies function for nurses at hospitals. This is already happening to some degree. A third option is to allow artificial intelligence platforms such as IBM's Watson to perform some data cleansing themselves. While none of these are a complete panacea to the situation, they are steps in the right direction.