Understanding How to Make Spark Fly: The Things We Learned While Building a Spark-driven App

Spark is often a great choice for for large-scale and complex data manipulation workflows, especially for independent linear workflows. Unfortunately, self-service data preparation does not fit that usage pattern: analysts work together collaboratively and experiment with many different options. The experimental nature of data preparation produces a series of intermediate transformations results that can be rolled back and undone at any given moment. Because Spark does not store intermediate results, a rollback action triggers Spark to recompute costly operations that are both resource intensive, expensive, and slow. Additionally, Spark does not natively support collaboration with RDD between users.

Paxata built a columnar data preparation engine geared toward interactive, collaborative transformations of massive scales of data using Spark. To improve performance of these costly roll-back operations and share RDD’s between users, we enhanced Spark with a distributed cache of intermediate results. This caching strategy optimizes large-scale data processing specifically for sub-second interactivity, multi-user Spark collaboration, and data manipulation rollback.

In this talk we will share our experiences with the following:

Computations on a columnar data preparation engine with an enhanced Spark cluster in large-scale production systems
Sub-second response time and interactivity at billion row scale
Multi-user Spark collaboration experience

From product roadmap to release, Lilia has been instrumental in bringing the first self-service data preparation platform to market. Lilia's passion for solving customer challenges and her love for data have been consistent themes in her career. Prior to joining Paxata, Lilia was a product manager at Socrata and and program manager at Microsoft.

Shachar Harussi is a Principal Distributed Data Engineer in Paxata where he contributes his experience in distributed computing, database and indexing algorithms, networks, microarchitecture and performance tuning to tune and scale Paxata’s enterprise-grade self-data-prep application. Shachar has a long track record of both industrial and academic work experience. Shachar graduated with a PhD in CS with research interests around databases theory, compression and automata theory. Prior to Paxata Shachar was ScaleDB’s Chief Architect.