Today, the role of the Data Engineering teams has become an enabler who makes a company truly data-driven. Data teams are involved in all aspects of data analysis which primarily are:
- Data collection: From disparate data sources
- Data cleaning/curator: Making it consumable for analysis
- Data consumers: Multiple teams across an organization each dealing with different subsets of data. Each having their own requirements and use cases.
- Vendors/Applications: Where data is exposed as an API to the end-user and they create their own dashboards and applications based on it.
The team creates data infrastructure to enable analytics on the data. Analysts/Data Scientists use these infrastructures to provide insights to the business. In some sense, the Data Engineering team has become a service provider to analysts/data scientists.
In this respect, Presto is one of the products that has gained lots of popularity for its ad hoc interactive query capabilities (Presto vs Other Data Warehouse). It’s being used by almost all big internet companies. These companies are not only using it but also building tools around it to make the life of end-users simple. For these companies, a single Presto cluster is not good enough. There are multiple requirements that force data engineering teams to create multiple clusters.
Use cases of multiple clusters
- Different workload: We see primarily two kinds of workloads – Batch and Interactive. Even in the case of Batch, there are ETL, dashboards queries, periodic reports, etc. In the case of ad hoc queries the requirements might be specific to the user and use cases. For a data scientist, the data set requirement and cluster capabilities requirement might be completely different from an analyst use case.
- Isolation between teams: Many teams have their own budgeting constraints and work on different data sets. They like to have their own analytics setup and manage it according to their application lifecycle.
- High Availability: For production workloads organizations have warm clusters ready to ensure that in case of failures there are no downtimes.
- Data Sources at different locations (On-premises/multiple clouds): As apps are being used geographically the data is stored in different locations. Sometimes the apps run in different clouds and store data local to that cloud. To run an analysis on a set of data, organizations like to spin up the compute clusters in the same region/zone and cloud.
- Upgrades/Fixes/Cluster management: Data engineering teams have their own schedule of upgrades/fixes/maintenance. During this period they manage an extra set of clusters to reduce the downtime.
We have met organizations managing anywhere between 5 to 60 clusters across different data centers or even to different clouds. Managing so many presto clusters has its own challenges. We have seen some or all of the above problems and have gone through big data-driven organizations that have created tools to meet some or other of the above requirements.
At least 4 different companies have talked about their problems (links below) in managing multiple presto clusters. They talk in detail about their own homegrown solutions where which is similar to some kind of proxy/gateway to manage these clusters and provide a consistent and single pane of glass to analysts/data scientists.
We at Falarica have developed and have open-sourced PrestoGateway keeping in mind the generic use cases across different spectrums. We have also kept in mind the capabilities that are must of any organizations to use it in a production environment.
In the next blog, we talk about the architecture of PrestoGateway and how it can help its users in providing a single pane of glass for the interactive query. We will also talk about how our solution is different from the other gateways built by larger companies.
Falarica’s Presto Gateway: