What is Data Cleansing?
Data cleansing, or data scrubbing, is the process of detecting and correcting or removing inaccurate data or records from a database. It may also involve correcting or removing improperly formatted or duplicate data or records. Such data removed in this process is often referred to as “dirty data.” Data cleansing is an essential task for preserving data quality. Large organizations with extensive data sets or assets typically use automated tools and algorithms to identity such records and correct common errors (such as missing zip codes in customer records).
The strongest big data environments have rigorous data cleansing tools and processes to ensure data quality is maintained at scale and confidence in data sets remains high for all types of users.