Your Data Lake is Polluted, Now What?


Polluted Data Lakes and Data Warehouses are becoming an increasingly significant problem resulting in AI Bias and Data & AI Poisoning security vulnerabilities. 

Counterfeit data, in both non-deceptive and deceptive forms is entering the global data supply chain at an alarming rate.   Monitoring for established data quality standards and protecting from emerging AI Security threats are critical for ensuring the optimal use of your Data Lake.

Zectonal offers 4 ways to prevent your Data Lake from becoming polluted resulting in more impactful AI driven outcomes.

In a previous blog Don’t Underestimate the Importance of Characterizing Your Data Supply Chain, we defined the 5 components the data supply chain while also making a comparison to an Industrial Age physical supply chain. We defined the 5 data supply chain components as:

Data -> Data Pipelines -> Data Lake -> Data Analytic Software -> AI-driven Insights

And compared them to the 5 Industrial Age components:

Raw Materials -> Fleet Transportation -> Factories -> Manufacturing Equipment -> Finished Goods

In this article, we describe how Zectonal’s software can benchmark the quality of data supply chains, and how you can maintain a more pristine, higher quality data lake as a result.  We focus first on monitoring the data from the point it originates, through its flow to the data lake via data pipelines.

A Good Place to Start – Monitor Before “E”

The optimal step in the data lifecycle for data observability monitoring is not always obvious. One of the most fundamental processes that occurs to data before it is ultimately ingested into a data lake is the Extract-Transform-Load (“ETL”) process.   ETL processes can be simple, or when used to combine, or fuse, multiple data sets together, can become more complex.

Monitoring your data for quality metrics or embedded threats before it is extracted and ultimately loaded into your data lake is imperative to maintaining a high-quality data repository.  If the data is combined with other data as part of the ETL process, quality defects and malicious payloads are incorporated further upstream into the data manufacturing process.  The more complicated the ETL process, the more difficult it becomes to detect and enforce quality standards.

We have tested other types of tools that monitor data quality after ETL processes are executed, and in our opinion, its too late — your data lake is already contaminated. 

Catch Missing Data Before It Pollutes Your Data Lake

It is not uncommon for large datasets to have a portion of their data empty or missing. In one extreme case, Google is said to have one of the largest training data sets for a natural language model using trillions of features.  One would have to assume a portion of the features were empty.

But when data, specifically training data for AI, starts to contain an unexpected number of missing values, unforeseen behavior could start to occur with the performance and outcomes of the AI model.   Establishing threshold benchmarks that alert you when missing values exceed a specific minimum or maximum metric is one way to ensure that too much missing data unintentionally ends up in your data lake, thereby polluting it when used for training accurate AI models.

Data & AI Poisoning Can Spread Rapidly and is Far Reaching

Our Zectonal research team recently identified an attack vector that could be used in an AI Poisoning attack.  This led us to develop a unique data payload security capability to protect against this type of general AI attack vector.  In a specific case, a payload and exploit were triggered by normal ETL operations associated with a specific open-source software tool found in many data lake environments. Allowing this type of payload to be loaded into a data lake would likely result in this malicious payload triggering an exploit multiple times over a long period of time by common-variety ETL processes.  Once you upload the payload into the data lake, the chances of discovery are significantly decreased due to the scale of a data lake versus checking newly arriving data.

Focusing data observability at a specific point in the lifecycle of data helps ensure both detection and prevention are not mutually exclusive.

It’s Cheap to Copy Data, Not So Much Oil

Everyone has heard the analogy Data is the New Oil, but we all know it’s more complicated than that!  A significant difference between the Industrial Age physical supply chains and the modern data supply chains is how easy and cheap it is to replicate data versus physical raw materials. 

Replicating physical raw material inputs like oil and minerals were not possible. We take for granted that once data is created, it is often easy to copy or slightly modify it before introducing it into a data supply chain.

Counterfeit Data

Substitute OEM for Original Data Creators in the supply chain research article below, and it accurately describes the same challenge data supply chains face with data consumers that produce AI products:

Supply chains (SCs) have become geographically dispersed and complex, raising increasing issues with regard to the traceability and visibility of the products and services they exchange (MacCarthy et al. 2016; Revilla and Saenz 2017; Cao, Bryceson, and Hine 2020). Manufacturers and consumers face a growing issue with the provenance or authenticity of products exchanged through global supply chains. Counterfeiters increasingly have access to the same quality of technology used by Original Equipment Manufacturers (OEMs) (Stevenson and Busby 2015).[1],risk%20in%20global%20supply%20chains.

Using similar language from the article to describe the analogous counterfeit data, we can define two forms of counterfeit data:

Counterfeit fake data
  • Deceptive Counterfeits – this is data that is intentionally and malicious copied or manipulated before entering the data supply chain
  • Non-Deceptive Counterfeits – this data is intentionally copied or manipulated but for a specific intent before entering the data supply chain.  A good example is synthetic data used for AI training.

The Rise of Synthetic Data

While it may seem unfair to label the emerging synthetic data industry as non-deceptive counterfeit data since we can speak towards its value first-hand for ML training, the challenge of discerning real and synthetic data in global data supply chains will become a significant challenge in the very near future.

The net result is the global data supply chain will continue to be flooded with data that enter sat various stages of aggregation and resell. 

Detecting Counterfeits – Checksums Do Not Tell the Whole Story

Our data observability software differentiates between detecting data files and data content that are likely to be duplicated.

For more advanced data counterfeiters, we observe duplicate data “packaged” in different file formats that facilitate avoiding detection.  A simple technique counterfeiters use to avoid duplicate detection is simply to use different compression techniques on data files.  A CSV file without compression will have a different checksum than the same file when it is compressed using GZIP.  GZIP has 9 different configurable levels of compression – each one will result in a different checksum when used on the same source file.  BZ2 is another common compression technique, etc.

More advanced data counterfeiters will slightly manipulate aspects of the data, usually the schema, as another way to avoid detection.  A different schema might show up in a data dictionary or schema repository and look different than its original copy.

Detecting duplicate data using techniques based on data content and not just using checksums is necessary to avoid duplicate data pollution.

Cost of Duplicate Data – Cloud is More Pervasive Than On-Premises

It goes without saying that duplicate data is expensive.  Data duplication detection is not new and has been a focus of the on-premises storage technology industry for several decades.  What makes this different is the lower-cost, greater accessibility, and ubiquitous nature of cloud storage resulting in a much larger scale data duplication challenge that was ever faced by on-premises storage solutions.

Late Data – Don’t Assume You Are Getting Your Data on Time

Another nefarious and hard to detect data lake pollutant is late data.  We differentiate between data that does not arrive within an agreed-upon time window (your Data SLA) and data that shows up later than it should (Late Data). Both are problems but need to be monitored and handled differently.

An example scenario is that your data is seemingly showing up every day as anticipated and agreed-upon with your data provider.  You expect 40-50 files worth of data per day from your data pipeline and using our data observability software with appropriately configured min and max thresholds, everything seems looks normal. 

Over time you start to discover timestamps that are relatively new showing up in data you assumed had arrived several weeks or months ago.  We characterize this as late data.

Because it’s usually not a common practice to go back into older sections in your data warehouse where you just assumed everything was normal, detecting late data is often tedious and difficult.

Why is Late Data a Pollutant?

Some pollutants are more transparent and harder to detect than others – late data is one of those.  Let’s assume that your ML training routines need to re-train once a month due a known “drift” in your data – “drift” are subtle changes occurring over time according to a known pattern.

If your data is late enough that it does not get incorporated into these re-training routines, then it’s not contributing to a significant aspect of its value.  Not only is the late data of a lesser value, but any real-world activities the data may represent are not incorporated into your AI because it was not part of the training process.  This means that the inference and predictive capabilities of your model have diminished and do not model real-world scenarios.  This can lead to AI Bias and other problematic outcomes.  The more late data you receive, the less effective your AI becomes.

Standard ML Training Metrics Are Not Applicable

In this scenario, if you were just looking at standard ML training metrics related to over- and under-fitting your model, or accuracy/precision/recall of your model, you would have no indication there was anything different from the previous re-training activity.  The model performance characteristics could look good, yet you might never know your training data did not simulate the actual environment you wish to model and train towards.  Your metrics would not indicate that a statistically significant portion of your training data is missing.

What’s In Your Bucket?

Samuel L. Jackson is famous for his acting as well as his memorable quotes and taglines.  We like a variant of one of his more recent taglines that brings immediate brand awareness to a well know financial institution.  What’s in your bucket?

Cloud storage buckets including Amazon S3, Microsoft Azure Blob Storage, Google Cloud Storage, and several other on-premises equivalents are often the final stop on the data pipeline before data is ingested into a data lake.  With more and more enterprise storage architectures, the cloud bucket is an extension of the data lake itself.

Cloud storage is so ubiquitous, so accessible, and so inexpensive it is not uncommon for these buckets to become the proverbial “junk drawer.” In our decade plus experience supporting Fortune 500 firms, financial institutions, and other data intensive customers, we cease to be amazed at what is in our customers buckets!

When we use our software to show customers what’s in their bucket, they are often amazed, and had no clue what was actually there.

Cloud “Junk Drawers” 

Cloud Junk Drawer

Here are a few of the reasons why we believe your cloud storage can become a junk drawer:

Non-Intuitive User Interfaces

In our opinion, cloud service providers still have a lot more work to do in order to develop an intuitive user interface (“UI”) console to quickly characterize what is in a data bucket. It is especially difficult to use their UI if you have a large number of files and have to paginate across sometimes dozens or hundreds of pages. 

Homegrown Scripts

Moderate to sophisticated customers usually write their own software scripts to obtain this information. We see extensive use of Python-based Cloud SDK scripts used in this scenario.  As is often the case with homegrown solutions, they are often not maintained, and knowledge is rarely transferred when employees leave.  This makes homegrown scripts only so valuable over time.

Inexpensive Experimentation

Inexpensive experimentation, especially at a large scale is a significant benefit to using the cloud.  Because scale is part of the value proposition for cloud, these experiments can quickly result in very large numbers of partitions of unused data residing in cloud object stores that are kept indefinitely as potential lessons-learned.

Software Analytic Tools and Their File Formats Change Over Time

Software analytic tools (i.e. supply chain machinery) change over time.  The Hadoop ecosystem created a number of Big Data file codecs that remain with us today.  It is relatively inexpensive to keep data as software migrations occur, and it is more common than not to forget to purge these old legacy files just in case they are needed in the future.

Logging and Backups

Cloud storage is optimal for backups and log file data of all varieties.  This is where the largest variety of unknown files originate.  Cloud object stores are integrated into almost every backup capability imaginable including desktops, laptops, servers of all varieties, software applications, and even other cloud services. Similarly, logging to an inexpensive cloud storage is an established best-practice under almost every scenario.

Zectonal monitors, detects, and alerts you when file types you have not explicitly whitelisted are detected in a data bucket.  This allows you to purge those files before they pollute your Data Lake.

Getting A Pristine and Clear Data Lake – Detection Without Prevention is Not Enough

Data Observability that works to simultaneously detect and prevent data pollution will lead to a more pristine and clearer data lake.  The point of detection is a critical aspect that is missing for other tools that try to make data more observable.  They may detect data quality issues, but for the most part, its already too late – your data lake is already polluted and there’s no easy way to tell how far that pollution has spread into your analytics and AI

Zectonal focuses on detection and prevention, as well as incorporating AI security considerations as part of our offering.

Learn more about Zectonal at and request a free trial of our software.

Zectonal is developing unparalleled software to ensure a secure and blazingly fast Data Observability and AI Security capabilities for your data. Feel more confident in your data.  Generate impactful insights.  Make better business decisions with less hassle, and at a faster pace.

Join The Conversation at Zectonal for additional Data Observability, Data Supply Chain, Data Lake, and AI Security topics in the future.

Know your data with Zectonal.

About the AuthorDave Hirko is the Founder and CEO of Zectonal.  Dave previously worked at Amazon Web Services (AWS), Gartner, and was a Founder and PCM Member of the top-level Apache Metron Big Data Cybersecurity Platform that was implemented by Fortune 500 institutions to find cyber anomalies using Big Data analytics. 

Feel free to reach out to Dave at