Data Platform Design Thinking

Sahil Parekh
4 min readSep 20, 2021

Challenges and best practices of the design of modern cloud platforms

Data has become more and more centric for organizations in the past years. The idea of being able to democratize the insights obtained from data to create more informed decisions throughout an organization is a must for various companies as the sources of information multiply. Data platforms are the principal way in which these changes are funneled into the organization as it’s the place where the data journey begins. Being able to effectively design data platforms to comply with the requirements of the future is a must for all data engineers, and the architecture and selection of tools used to create them is a critical point to discuss. Companies implementing data-driven decision-making methodologies in organizations is pushing new standards in the design of data platforms

Insights from changing flows of data

Data democratization and other practices are all about implementing data-driven decision-making methodologies in organizations. The benefits of these initiatives have been long discussed as it makes users self-sufficient, removes biases that might be derived by the lack of information and improves the overall efficiency of a company. Whether it is to be used in business analytics or operational intelligence, the new systems required tighter control and governance of data, as well as to provide self-service capabilities for users to use and understand the data.

Modern data platforms are required to serve more users than ever before, each one of them with its specific requirements derived from its unique perspective of the data. This data was commonly served through the use of data marts and dashboards for analytics, through the use of pipelines that were carefully designed making use of restrictive schemas and data definitions consuming and processing data in data warehouses. In time, the number of sources increased to several sources like IoT devices or CRM systems, or even the capture of unstructured information from a myriad of sources. This pushed the incorporation of data lakes, that seek to structure the unstructured data to, later on, be transformed and consumed.

Moreover, the use of real-time KPIs for operational intelligence, data democratization, data governance policies, and data protection regulations have pushed the data engineers to design robust data platforms that not only can navigate this complex environment but also need to provide transparency on how the data is being processed.

Traits of Modern Data Platforms

For a long time, the only clear ethos of the design of data platforms was the incorporation of clear stages for the ingestion of raw data, process data, and staging data which allowed processes to be able to recover from outages without losing data. This approach has started to change and has moved into incorporating traits that make data platforms future-proof.

The traits that may characterize modern data platforms can be summarized in the next list:

— Scalability: Resources used in data platforms need to be able to scale to adapt to constant changes in traffic and usage without disruption. The systems such as databases and processing need to adapt to the demand. The use of containerized applications like Kubernetes that automatically scale when needed, and the use of data platforms like Databricks or Snowflake allow the use of on-demand computing resources, solving the issues of adapting to spikes in the demand.

— Simple interfaces to consume data: One of the other traits that data engineers should strive for is to easily be able to consume data from the user’s perspective. Microservices running as containerized applications that can be offered to consume data through APIs or the use of specifically tailored datamarts with the required data protection systems are a trait of great data platforms. This ensures that users can quickly consume data and be able to obtain insights from it.

— Transparency: The use of data catalogs for keeping track of all the required metadata as well as to know the current state of data assets is a great way to provide transparency on the use of data. Users can use these catalogs to explore the existing assets as well as to ensure that the data is up to date and according to expectations.

— Data Checks: Data needs to check on each of the steps of processing and ingestion to ensure that everything is defined according to expected values. These checks need to have both a data engineering perspective (row counts, expected data type schema) as well as validate business rules (expected ranges of values, distribution of data in the range of previous values, etc.)

— Idempotency: Processes need to create just one transformation no matter how many times are triggered. This way we can be sure that re-triggering a process that failed won’t generate duplicate values.

Conclusion

Companies have moved from the common monolithic data warehouse with tightly defined schemas and normalization, to almost alive structures of systems that seek to adapt to an ever-growing volume of data. This stream of data constantly changes its structure. To provide insights to different stakeholders, each one of them with their perspective of the data as well as different levels of abilities to manipulate the data, data engineers can incorporate traits that guarantee access and transparency, as well as robustness. Data engineers should strive for modular and on-demand technologies to be able to handle the challenges from the mentioned above.

--

--