Transparency in Data Engineering

Sahil Parekh
3 min readOct 15, 2021

Accountability through data quality and open access are the keys to trustworthy data

With the explosion in the volume and complexity of the data being used everyday for decision making, the importance of ensuring transparency in the use of data has become paramount. Companies and organizations rely on data more than ever before to be able to run operations and make critical decisions based on it. Its essential being able to clearly understand the data, trace it back to its origin. Data engineering can provide a better transparency of the data through the use of several practices, which ensure users that the information that they are using is trustworthy.

Transparency in Data

The term transparency in data can be applied in several ways. It can be used to define the way that the companies deal with data collected from the behavior of users in their platforms, as well as to refer to stakeholders of data products that want to be able to track the data that they are using back to its source. Overall it implies that the processes that are used to collect and transform the data are accessible by all users in a simple way. Data protection regulations seek to provide accountability for the use of information collected from individuals, which lead to companies adopting internal practices to ensure that they are able to comply with these requirements.

One of the most important characteristics of transparency is interpretability. The transformations that are applied to data during multiple aggregations of data sets can lead to errors that later on cause the creation of wrong outputs that ultimately leads to wrong decisions. The interpretability of the process is essential to be able to create data checks that ensure that the quality is according to the expectations.

Data governance is an important aspect required for data transparency. The creation of specific policies that guarantee that the access to data catalogs holding aggregate metadata of the data and the operations being applied to it is essential to guarantee quality data as well as to ensure its proper use and management. The implementation of retention data policies as well as specific security policies ensures the proper use and regulation compliance.

Engineering Transparency

Most modern data platforms have been designed to provide transparency of the operations that are being undertaken. Data platforms like Databricks provide the creation of translation logs for Delta tables that can, later on, be used for auditing purposes and that also facilitates being able to recover from wrongful transformations by going back to previous states. This feature provides clarity and understanding of the applications applied to data, ensures accountability, and provides a way to recover from operations that created wrong data.

Other tools like AWS Glue Data Catalog are intended to provide insights on existing data by creating data catalogs that hold information and metrics on existing data. As mentioned before, these catalogs can then be used to visualize the data and be able to extract metrics from the data. These catalogs can then be used to develop data checks that strive to guarantee the quality of the data. These checks can be implemented in the early stages to provide transparency on the initial properties of the data and its schema, and in later stages to ensure that the distribution of the data is according to business standards.

In recent years, data platforms have increasingly relied on the use of multiple tools to collect data and process it has extended to several systems. These systems interact with each other and collectively form single portions of information that are later on used for analytics. This situation poses a complication for data engineers seeking to implement practices that ensure data transparency as the monitoring becomes increasingly complex.

Conclusion

Data transparency in the context of data engineering is being able to dive into the specifics of the data that is being used. Is to be able to know the sources used, the transformation that was applied to it, and to know the checks that ensured that the information is trustworthy. Ensuring these principles has been increasingly complex in a context where modern data platforms require the use of a multitude of systems in a constantly changing environment. The answer may lay in the way that these information, checks, and catalogs are governed through policies and tools that make the action of monitoring them as transparent as the data itself.

--

--