What Data Engineering has learned from DevOps

Sahil Parekh
4 min readFeb 1, 2022

The need for version control and accountability is also applicable to the way in which we handle data.

DevOps started around 20 years ago when the business requirements started to outpace the software development times. This led to the development of a series of methods that had as objective to address issues in software development teams struggling to meet deadlines, which also contributed to the adoption of Agile and other development practices.

In a general way, DevOps can be considered as a loose set of methodologies and practices put in place within an organization to increase the through output of software development teams. While some of these practices can be traced back to Agile methodologies, others are simply structured ways to provide guidance in application development and design, code management, and so on.

DevOps Best Practices

DevOps continued to progress to the current state in which the two fundamental parts of it, which are development and operation, have already developed practices for continuous integration and deployment, automated health checks and notification on resource outages, and so on. These capabilities allow organizations to be more flexible by speeding up the development and deployment of new applications that better serve their customers, or to improve existing developments in order to adapt to new trends.

In recent years a similar trend started to emerge in data engineering. The complexity of the pipelines being built and designed started to move more and more from a single data warehouse where data was ingested, processed and distributed, to a complex set of resources working in harmony in order to transform and deliver data to analysts, data scientists and business teams. Among the required skills for a data engineer, we can mention the architecture of distributed data systems, implementing efficient data quality checks, join and transform data from a myriad of sources and collaborate with teams of data scientists, analysts, and businesses to develop meaningful datasets.

Introducing DataOps

In some cases, data pipelines and platforms started to require data engineering teams to have a deep understanding of the resources being used in these pipelines, which can be computing services, databases, private networks, and so on. It is at this point, where the data engineering role started to overlap with the DevOps role, as the first started to have an increased interest in monitoring the resources used in the pipelines, not only to ensure that theories are all up and running but to also implement data checks and controls about the volume and content of data ingested. In time, data engineering started to think of data products instead of pipeline scripts and it’s starting to implement strategies to also be able to keep up with the business needs in terms of development and deployment times, as well as to provide quality, reliability, and observability in the data being served.

These changes are starting to be used to define a new role, which is the DataOps Engineer, which is taken with provisioning and maintaining data infrastructure resources, rather than focusing on collaborating and designing the product itself. This is a trend that is followed by the incorporation by several cloud providers of tools that although facilitate the deployment and configuration of resources with the use of code like Terraform and the rise of monitoring tools that follow the DevOps ethos but from data quality and reliability perspective. Moreover, this separation of roles is becoming more natural as the responsibilities of DataOps start to differ from the normal Data Engineering and start to be much more focused on design and support of resources used by other teams to ingest, process, and deliver data, as well as monitoring the overall quality and ensuring reliability.

Conclusion

Trends like we mentioned before will continue in the next few years with the introduction of infrastructure as code with cloud providers adopting languages such as Terraform, and reliability, observability, monitoring, and data checks becoming a standard practice for all data products. This will allow organizations to also be able to speed up the development of data products with reliable information that later on can be used across different departments and build a data-driven way of thinking. DevOps methodologies will progressively be more applied to Data Engineering as data products start to rely on well-managed peer-reviewed code, efficient resources monitoring, efficient data quality checks, and monitoring, as well as collaboration between teams to build reliable outcomes.

--

--