The serverless experience of a managed cloud query service like AWS Athena or Google Big Query is awesome. The end user does not have to worry about the provisioning part at all. He just fires queries and quickly gets back the results.
Behind the scenes, Athena does all the magic of deciding things like where to run the query, what to provision for it etc. It just gives back the result and the end user spends his time analyzing the results of the query.
It’s great but still there is at least one big problem and a gaping hole here
- The cost of managed query services – $5 / TB proves to be really dear when you have multiple Data Scientists and Analysts to use the system and they fire a lot of queries. The aws bill can just zoom up in a matter of time.
The figure above shows the Athena / GBQ cost vs the actual machine cost which will be required to give the same performance for the queries. We have assumed 100 GB of data scanned for every query.
- These solutions are still cloud specific – How about the end user needing to run a query on data which is on another cloud. Pull all the data? We can easily agree that it is not a solution. Or do you copy the data first. Well same as first. Where is interactiveness here? Or you go to another platform, another query service of the second cloud and query the data?
All of the above is a hassle and doesn’t cut the desired behavior. Also, today you need a multi-cloud solution not only for your operational data but also for analytics if you want to be sharp and agile based on what your data indicates. The recent aws outage again was a corroboration of a real multi cloud strategy.
Our point is, why restrict this experience to a single cloud?
Why to not have the same experience even while your data is scattered across locations, different clouds, on-prem or a mix of all.
Why not have a single endpoint for the end user where he fires queries without bothering about data location, different cloud or private data centre. Why even in this setup he can’t be made to not worry about the right provisioning of cluster capacities, failure handling etc.
Behind the scenes why cannot the system figure out the right location of data and takes your query there?
Behind the scene why cannot the system figure out a strategy to use on-premise capacity judiciously and automatically burst the load to cloud at peak requirements. Why there has to be any manual intervention or communication between DevOps/IT or end users?
The Falarica Query Platform, powered by Presto and Kubernetes, exactly makes all the above scenario possible and gives you a close to managed experience which spans across clouds and data centres.