Early Data Checks Saves Time — and Money

Sahil Parekh
4 min readAug 19, 2021

Modern data applications require lengthy processing cycles and early data checks can save time, catch errors, and ultimately save resources.

The concept of a Data Pipeline makes a reference to industrial pipelines that are tasked with moving liquids or solids from one stage of the process to another. These “physical” pipelines have a set of standard practices in terms of quality checks through their extension, to check that there are no leakages, pressure loss, sudden changes in composition and properties, and so on. But although these mentioned practices are standard regulations in these industrial pipelines, when we refer to Data Pipelines it is very rare to see standard practices to ensure data quality throughout the process being put in place. As a matter of fact, these practices are a point where Data Engineers can learn from Industrial and Mechanical engineers in order to avoid lengthy postmortems and corrections when we see that data fed to a process was faulty at the end of a processing cycle. In this article, we will focus on which are the stages of a common data pipeline, and how data checks in the early stages can help to reduce time and resources while protecting the integrity of the overall outcome.

Ingestion Checks

Most data pipelines can be clearly defined in stages, which could be ingestion or raw, processing, and staging. As industrial pipelines have checks on the properties and composition of the raw materials being ingested, data pipelines should have early data checks in the ingestion layer to ensure that basic requirements such as required files, file types, and data type schema, are checked at the beginning of the processing. This can alert us to changes in data sources, like for example, changes in an API being used to capture data or broken streaming processes, that can affect the product. These are basic checks because we are not looking at the content of the data, rather at its shape, to observe its fitness for the intended use. Some of the possible checks that can be implemented besides the ones mentioned before are row counts, in which we can compare, for example, the number of required rows after a transformation has been made. This can be done in applications that have, for example, a fixed number of users that we might want to update, therefore the row count is known, and we can use it to compare for missing or duplicated rows, which is a common problem when using complex table joins.

Staging Checks

After the data has been processed and transformed, it is generally parked in a staging layer before it can be distributed to the stakeholders through different databases or data marts. This stage can also be considered the business layer as it allows us to finally apply rules and logic based on the end-use of the data, before distributing it to the final destination. To apply these checks, it is necessary to also have a good business understanding, as well as to have these rules and constraints clearly stored and documented in order to be updated and applied with ease. These checks also allow us to provide a second layer of checks that can, for example, check for numerical distributions of the data in order to observe if any sudden changes in some metrics have occurred and if it’s valid or is due to a wrong data being processed. Moreover, these checks can also involve the use of data itself, to keep up with regulations like GDPR, personal data being exposed, and so on. An example of one of these checks might be, for example, to plot the views of a certain product in a marketplace, and to check if it follows some expected trends, like an increase of sales due to a discount or a holiday season.

Conclusion

Data checks early in the process is a must-have practice for every Data Engineer building a data pipeline, because it allows us to establish checkpoints where we validate data from the data engineering, business, or reporting perspective, each of them applying checks according to the intended use of the data. These checks should be automated and monitored in the best way possible, in case faulty data was found, be able to re-trigger or flag the wrong data accordingly. So, the next time that you drink a glass of water, think of all the quality checks that were put in place, from source to distribution, to ensure the quality of it, and how we should learn from it when building the next generation data platform.

--

--