Falarica founding members have been part of the product ideation and the core execution team of quite a few well conceptualized and commercially successful data products. For the past 15 years or so we have been incessantly thinking data and solving challenges of data in distributed systems. We have solved problems around, rapidly increasing volumes, increasing concurrency and ever decreasing latency needs, added on top of the good old ACID requirements of data management and querying systems.
We have been at the centre of action in not only the product inception, design and development activities but also helped our customers develop data solutions using our products and see them through stable production deployments of critical systems.
Data Products we were part of
Noticeable products on which we have worked are Apache Geode, erstwhile GemFire – a distributed memory optimized data management system, SnappyData – a Unified Data Platform, which leveraged Apache Spark and Apache Geode to make a integrated data platform capable of running all kinds of data processing workloads viz transactional, analytical, streaming, batch etc. and GemFireXD, a sibling of GemFire, which was essentially a distributed in memory relational database supporting standard ANSI sql.
Along the journey, we have been fortunate to work with the thought leaders of the space and a brilliant set of clients and colleagues. Interacting with them frequently was very enriching. A big thank you to them.
Evolving Ecosystem around Data Processing
While we were engrossed in our mission, we were also all eyes and ears to the enormous changes happening in the area of distributed computing and the changing paradigms around it. Out of so many immensely useful advancements, Cloud becoming prevalent, linux containers based technologies like Kubernetes and Docker becoming popular, the paradigm of dissociated query processing engines, where computation was separated from both metadata and data, like Hive, Spark and Presto, were some of the game changers in a lot of ways. The world had to stop, take notice and adapt to the new path shown by these landmark technologies and we were no different.
New age Ad-Hoc Analytics – Our introduction to analytics space
Till 2013 or so, our area of concern was primarily transactional systems, capable of doing fast writes and reads of targeted data, primary key based reads and writes. While the system was capable of running analytical queries too, they were not good enough to be used in production environments. We started to work on our own home grown solution for the same, whose design was similar to parallel processing architectures. Around the same time, Apache Spark had started making a lot of heads turn for its capabilities with Analytical query processing. We stopped, had a great thorough look at the internal design of Spark and realised that we can pull in the Spark engine inside Apache Geode ( GemFire/GemFireXD ). We augmented and modified Spark internals to complement a transactional store with analytical capabilities and created a commercially successful product, SnappyData.
While Spark was making great strides in this regard, there was another system called Presto which came to the fore at around the same time. It was developed at Facebook and open sourced in 2013. Since then the adoption of Presto for handling ad-hoc analytic queries gained massive momentum. The core design of both the engines can be really compared and contrasted in terms of execution planning, scheduling, resource sharing and workload isolation. ( Stay tuned for a qualitative comparison on them soon ). But essentially both the engines are pretty comparable and powerful. Presto, in our opinion, suits better for ad-hoc query service, owing to better handling of concurrency, workload isolation, result streaming and security.
New Paradigms, New Challenges, New Opportunities
The main difference between this class of tools i.e. Presto, Spark and the likes and the existing solutions like the MPP databases and the data warehouse systems were in the separation of concerns for storage and processing. These newer tools were just computational engines capable of connecting to any datasource and run fast analytics on them. As the volume of data kept on increasing, this approach of data processing became a default choice. However along with this welcome but paradigm shift, came a lot of complexities too. One really needs good efforts, large engineering teams to not only design but also keep such systems up and running.
Now add the complexity of multiple ‘clouds’ and multiple data centres and complexity just goes quite a few notches higher.
The three of us realized this and thought of bringing down the complexity here. But before that it was necessary to validate the challenges with the real enterprises. We got in touch with multiple people from the industry to understand the challenges they are encountering with their ad-hoc analytics requirements and vet our understanding. The problem definition not only resonated with them but was even furthered by those interactions. We got newer insights. Problems regarding enterprise wide resource sharing, workload isolation and cost were some real problems. This triggered the idea of us coming together and making something useful. The result is Falarica Query Platform.
Ad-Hoc Analytics Challenges Today
Some of the main challenges in today’s setup which all the enterprises are facing
- Multiple Clusters – Enterprises are required to set up multiple clusters for various reasons like, teams needing their own clusters, lower operational failures etc. Having multiple clusters may solve a lot of problems but they come with their own set of problems.
- Scattered Data – Data is present at various places and in various sources. Even cross premises data need to be correlated which is a huge challenge.
- Under utilized on-prem resources – Many organizations have a huge amount of on-prem resources. It is almost a no brainer that a hybrid solution is needed.
- Managed Query services by clouds not pocket friendly – The cost becomes huge as the charges are tied to the amount of data scanned, which is ever increasing, never predictable and hence the budget can shoot up pretty rapidly
- Auto Scaling and Workload Isolation – The amount of queries to be run, the types of queries to be executed in ad-hoc analysis, can both vary to a great extent diurnally. The ability to auto scale up automatically is a must need. Also sharing of clusters should not affect cross team workloads.
Falarica Query Platform – Simplified analytics at reduced cost
Falarica Query Platform, Falarica’s analytic platform, is our answer to the problems above. Essentially, steering the workloads effectively. It is powered by Presto and Kubernetes. With Falarica Query Platform, we aim to provide a platform which reduces all the above complexities to a great degree for both the admins and the end users, with reduced cost and always available systems. It can be used for on-prem, hybrid as well as multi-cloud deployments. It has a single pane of glass for end users, a single management and configuration point for admins to manage clusters and catalogs, and a console to define canned and custom routing and scheduling policies. Dynamic and tunable cloud bursting policies, orchestrated from a single point to check cloud usages. Some of the immediate areas of work would be to solve the challenges due to scattered data, making the platform flexible enough to adhere to data security and compliance needs, use machine learning to automatically choose the best cluster etc. We also intend to contribute back to Presto.
Apart from Falarica Query Platform, Falarica currently have 3 more offerings
- Falarica Presto-Platform – An enterprise grade, Presto platform on Kubernetes using Falarica’s Presto Operator for both on-prem and cloud. We will help the customer to adopt and deploy this platform and also provide support and consulting around this. This is a licensed offering.
- Presto Gateway – Presto Gateway is the precursor to our flagship offering Falarica Query Platform. However this is fully open source. It is a working and well tested gateway product, capable of routing queries to multiple presto clusters behind it, based on policies which can be configured in it and even dynamically be changed. It is fully secured too. This is the foundation on which Falarica Query Platform is built. We thought of open sourcing this part as the community can take advantage of this directly or indirectly by building something cool on it. Falarica Query Platform is specific to our very own idea of a complete platform which will evolve over a period of time
- Presto Operator – An open source, Presto Operator for Kubernetes with HPA.