As media companies collect more and more data, from more and more data, the need for a thorough data cleansing regime becomes more important than ever. Out simply - no company wants to risk making bad decisions based on bad data - or licensing bad data to third parties.
Data that is ready for ingest into a data lake is typically provided in CSV, JSON, XML, or Parquet format, but before ingest, it needs to be cleansed to ensure the accuracy of any subsequent analysis undertaken on the data and the referential integrity of the datasets.
Because most data lakes are not built upon full-blown transactional databases which enforce foreign key constraints, it is particularly important that referential integrity is checked before loading data into the data warehouse.
In this blog post we look at the options for cleansing
Open source and AWS options
AWS does not currently provide any data cleansing tools and there are few open source options available.
The most common approach to build your own data cleansing tools is to develop python scripts to pre-process files, checking against the relevant for each incoming data source. These should as a minimum include:
- Formatting of fields
- Type checks
- Range checks
- Referential integrity
- Some ability to quarantine or fix data that is not well formed and create alerts.
Off the shelf cleansing products
There are a limited number of off-the-shelf data cleansing products that are specifically usable in the media sector.
There are extensive number of “cleanse and append” products in most countries that will cleanse customer datasets to remove old addresses (both physical and electronic) and for CRM data cleansing these are by far the most effective options but cannot be applied to other datasets.
There are also “master data management” products that are provided by enterprise resource planning vendors as part of an overall toolset to support a manufacturing, retail, and distribution operation.
The two commercial products we are aware of that are suitable for a media company are:
- Alteryx is a broad data preparation software.
- Our Pipeline API is a low-cost data cleansing API which can be used to simply set up rules-based filters to ensure the cleanliness of the data lake
We like Alteryx for its tight integration with Tableau and our Data Pipeline API for its API that makes it easy to build into your data workflow