Efficient Monitoring for Reducing Data Downtime
Implementing monitoring strategies in data engineering pipelines to catch faulty data.
The modern data platform is becoming more complex and data downtime, which is defined as the period of time for which we have missing or wrong data, is one of the results of this trend. The rise of cloud providers allowed us to detach processing operations from the common data warehouse in order to be run as Spark or Databricks jobs, consume data from multiple sources such as APIs, ingest data from streaming databases like Kafka, and orchestrate a multitude of operations through Airflow.
In order to tackle this complexity, the current standard is to do a separation of concerns regarding the different environments of stages, therefore we can have a development or production environment, as well as ingestion, processing and staging layers for our data. Although this organizes the resources in a better way, it is becoming increasingly complicated to keep track of the use of all of these resources. Is in this point, where an efficient implementation of monitoring can be the difference between chaos and neat organization.
Monitoring Data Operations
When discussing efficient monitoring we have to do a separation between the actual systems that we are monitoring, as well as the tool is used for the aggregation of the results in a meaningful way that allows us to clearly read all the necessary metrics.
About the systems used, one of the ways in which we can ensure the overall quality of the process is to establish data quality checks and tests. While data checks are intended to be a way to verify that the content is according to the expectations in terms of data type schema, distribution of data, and so on, the data quality controls are systems designed to establish governance of the data in order to ensure that all the parties involved get to access no more and no less than they allowed to, and the sensitive data is protected in the right way. It is important to spend some quality time ensuring that the metrics and checks being captured are representative of the architecture and nature of your data. For example, we might have controls that ensure that no Personal Identifiable Information gets into a certain stage, and we can also have data quality testing to ensure that the count of rows is expected.
Monitoring Data Systems
Efficient monitoring is critical in this case as when the pipeline fails and we face a data downtime, it is the tool that will guide our efforts to track, fix and put the pipeline up and running again. Here it is important to highlight that although most cloud providers have their own monitoring tools, these tools are always going to point us to the most proximal error, instead of the root error. This means that in case we face a data downtime because a DAG in Airflow was unable to be executed, it will not point us into the root error, which might be a change in the way in which one of our data providers sends us data through their API. Therefore, we must spend time developing the metrics that we will monitor in order to come up with meaningful ways to point ourselves to the root of the problem, rather than its consequence.Data Operation Metrics.
Data Operation Metrics
In order to construct meaningful metrics to watch, we can start by asking ourselves the most common questions when we tackle a failure in the pipeline. The most common question to answer is if the data is arriving correctly at the ingestion layer, if it has the expected schema, if the data is complete or according to the expected number of rows, and so on. A good method to come up with meaningful metrics is to abstract the architecture of your application and design metrics that can ensure the correct outcome previously and after the process. This is the most effective way in which we can reduce data downtime.
So now that we have deeply thought about the metrics that will allow us to quickly find the root cause of any problem, we need to move into the requirements of the data monitoring tool to be used. In order to be able to specify the types of metrics to be tracked, it is necessary for the tool to be granular enough to point to specific issues within the code. It should also be able to monitor these metrics as time series that are then persisted into files that we can then query and to be timely in order to provide alerts on time without disruption.
Reducing data downtime in complex data pipelines can be a daunting task, given the myriad of operations and resources that are being orchestrated and need to work in harmony. Being able to monitor, both the operation being run and the systems where these operations are taking place is not enough, as it is also important to properly select variables that allow us to quickly point as the root cause of a certain error, in order to address it in the least amount of time. With these concepts in hand, data engineers will be able to create.