Skip to main content

The task of cleansing, shaping and bending data for analytics or other business purposes is known as DATA PREPARATION. Data preparation is an inseparable component of many systems and applications managed by IT. Tasks such as data warehousing and business intelligence are the more formal work done by IT. But routinely more tech-savvy business users like the data scientists are burdened by the informal requests for ad hoc reporting and customized data preparation. These days more and more users need data preparation done as a service so that they are empowered with clean data and can focus on the critical analysis part of the business.

One way to understand the ins and outs of data preparation is by looking at these five steps in data cleaning. Let’s examine these aspects in more detail.

  1. Identify

The Identify step is about finding the data best-suited for a specific analytical purpose. Data scientists cite this as a frustrating and time-consuming exercise. A crucial requirement for data identification is the metadata repository – that is, the creation and maintenance of a comprehensive, well-documented data catalog. In addition to data profiling statistics and other contents, the metadata repository provides a descriptive index that points to the location of available data. The available data is then cataloged together with data profiling statistics and other contents.

Identification is not just about finding data, but also making it easier to be discovered later, whenever a need arises. The data catalog should be continuously updated as the company encounters new data sources even if there are no immediate data preparation steps.

  1. Confine

The Confine step is about collecting the data selected during Identify step. The term “confine” appeals to the image of temporarily imprisoning a copy of the data that feeds the rest of the data preparation process. In contrast, the cell of a spreadsheet permanently confines the data within a specific location, both during and after preparation in almost all business enterprises. A temporary staging area or workspace is required for the processing that happens in the “refine” step of data preparation, which follows after the Confine step. When ongoing quarantine of transitional or delivered data is required, it should make use of shared and managed storage. An evolving practice here comprises the use of in-memory storage areas or cloud-based storage for much faster real-time shaping of the data before it’s sent on to other processes.

  1. Refine

The Refine step is about cleansing and transforming the data collected during the Confine phase of data preparation. In the process of refining data, the enterprise must determine how appropriate the data is for its intended purpose or use. This is an overlapping function of data quality – making data quality integral to data preparation. During this step, the data may also be transformed, in order to make it easier to use during analysis. For example, individual sales transactions may be aggregated into daily, weekly, weekend and monthly sub-totals.

In data warehousing and business intelligence, data quality rules are applied while integrating multiple data sources into a single data model augmented for querying and standard reporting. The point is: Don’t reinvent the wheel, reuse what’s around. The more reusable your refining processes are, the less reliant business is on IT for custom build processes and ad hoc requests. Ideally, the enterprise should strive for making data quality components a library of functions and repository of rules that can be reused to cleanse data.

  1. Document

The Document step is about recording both business and technical metadata about identified, confined and refined data. This includes:

  • Nominal definitions.
  • Business terminology.
  • Source data lineage.
  • History of changes applied during distillation.
  • Relationships with other data.
  • Data usage recommendations.
  • Associated data governance policies.

All the metadata is shared using the metadata repository, or data catalog. Shared metadata enables faster data preparation, that is consistent when repeated, and also eases collaboration when multiple users are involved in different steps of data preparation.

  1. Release

The Release step is about structuring refined data into the format needed by the consuming process or user. The delivered data set(s) should also be evaluated for persistent confinement and, if confined, the supporting metadata should be added to the data catalog.

These steps allow the data to be discovered by other users. Delivery must also stand by data governance policies, such as those minimizing the exposure of sensitive information

Make data preparation a repeatable process

Data preparation needs to become a formalized enterprise “best practice.” Shared metadata, persistent managed storage, and reusable transformation and cleansing logic will make data preparation an efficient, consistent and repeatable process. In turn, it will become easier for users to find relevant data – and they’ll be armed with the knowledge they need to quickly put that data to use.

By providing self-service data preparation tools, business users can work with data on their own, and thus free up IT to work on other tasks. In the process, the entire organization becomes more productive.

Making data preparation a more formalized approach is essential. Users having access to shared metadata, continuously managed storage, and reusable cleaning logic will make data preparation a reliable and reusable process. In turn, the users will be empowered with clean data and they can focus on the more important task of analyzing the data to gain actionable insights. This helps in conditioning better data readiness for the enterprise.