Deductive Pipeline API: Handling invalid data

Deductive Pipeline API: Handling invalid data

Previous: Referential Integrity | Next: Working with session data

If the data does not validate then a fall back option is applied depending on the specification of the following params:

  • fallback_mode - specifies what should be done if the data does not comply with the rules, and can take the following values:
    • "remove_entry" - the default value, the record is removed and quarantined in a file sent to the Logger object passed to the API
    • "use_default" - the record is removed and replaced with whatever is specified in the "default_value" field
    • "do_not_replace" - the record is left unchanged but still logged to the Logger object.
  • default_value - The value that records that do not validate should be set to if fallback_mode is set to use_default. Defaults to ''.

For String, Number, and Date fields, there are two additional methods that can be applied for handling invalid data.

The best match options search the other values in the field that meet the validation rule and replaces the invalid value with one that is sufficiently similar. This is controlled by these two parameters:

  • attempt_closest_match - Specifies whether entries that do not validate should be replaced with the value of the closest matching record in the dataset. If a sufficiently close match, as specified by the string_distance_threshold is not found then the fallback_mode is still applied. Defaults to False
  • string_distance_threshold - This specifies the default distance threshold for closest matches to be applied. This is a variant of the Jaro Winkler distance and defaults to 0.7

The best matching is not possible where invalid data is blank. In this case an additional option is available to replace a field with a value from the record that is most similar in its other fields. This is known as a lookalike model and is most useful for filling in blank records.

It is enabled with the single parameter:

  • lookalike_match - This specifies whether entries that do not validates should be replaced with value from the record that looks most similar to the other records. This implements a nearest neighbor algorithm based on the similarity of other fields in the dataset. It is useful for filling in blank records and defaults to False

Previous: Referential Integrity | Next: Working with session data

Related documentation

  • Deductive Pipeline API on AWS - The Deductive Pipeline API is available through the AWS marketplace (more)
  • Deductive Pipeline Python Client - Deductive Tools includes a client for the Pipeline API (more)
  • Deductive Pipeline API: Sample Data - Sample files to demonstrate usage of the Deductive Pipeline API (more)
  • Deductive Pipeline API: Validating basic data types - Validating incoming datasets for basic string, number, and date type formatting and range checks using the Deductive Data Pipeline API (more)
  • Deductive Pipeline API: Anonymizing data - The Deductive Pipeline API support tokenization, hashing, and encyrption of incoming datasets for anonymisation and pseudonymization (more)
  • Deductive Pipeline API: Referential Integrity - Using the Deductive Pipeline API to validate data against other known good datasets to ensure referential integrity (more)
  • Deductive Pipeline API: Working with session data - The Deductive Pipeline API can check for gaps and overlaps in session data and automatically fix them (more)
  • Deductive Pipeline API: Reporting and monitoring data quality - The Deductive Pipeline API logs data that does not meet the defined rules and quarantines bad data (more)
  • Deductive Pipeline API: Full API reference - A field by field breakdown of the full functionality of the Deductive Data Pipeline API (more)

Need help? Get in touch...

Sign up below and one of our data consultants will get right back to you


Deductive is a global consulting firm providing data consulting and engineering services to companies that want to build and implement strategies to put data to work. We work with primary data generators, businesses harvesting their own internal data, data-centric service providers, data brokers, agencies, media buyers and media sellers.


145 Marina Boulevard
San Rafael
California - 94901
+1 (415) 843 1774

Registered in Delaware


Thames Tower
Station Road
Reading
RG1 1LX

Registered in England & Wales, number 8170657