Which programming language to choose? Part 4: Spark
We continue the series of articles about the programming languages used in Kryptonite. In this part, our lead engineer of the development department, Mykhailo Kuznetsov, will talk about the features and applications of Spark, its pros and cons, about the intricacies of learning the framework, as well as share ideas for a pet project and useful links.
Previous parts about Rust, Scala and JavaScript.
Contents
1. Features
To begin with, Spark is a framework written entirely in Scala. But it also provides a high-level API for Python, Java, etc. Although Java and Scala are equal in speed and functionality, using the framework is still much more convenient and concise on Scala. Functionality and processing speed may be reduced in other languages.
Spark was built to process data. There is a so-called ETL process – extract, transform, load – these three tasks are closed by Spark. It takes data from the source, transforms it (cleans, enriches, connects …), and submits the result to the receiver, for example – a database. That is, data processing is performed.
The peculiarity of Spark is that there is no other such framework in principle. Flink, Apache NiFi and other tools do not close all tasks. My personal opinion is that Spark is head and shoulders above them all. It is unique, widespread, and virtually the industry standard for big data processing.
2. Where it is used
Spark is used wherever you need to process any data in large volumes that cannot fit on a single server. Everything fits here – data from various sensors (IoT), advertising data (including clickstreams), weather analytics, health care, banks, analysis of satellite signals … in general – all directions with a large amount data
Spark is for you if you like to dig into data. I love it😊 For example, in “Kryptonite” I design dataflow and develop Spark jobs, NiFi pipelines, Airflow dogs based on it. And I also write custom components for these tools and yaml related to the work of pipelines + conduct research for new data tools and write PoC.
3. Pros and cons
Pros:
1. A huge number of built-in sources from where we take data.
2. System of plugins. By defining our datasource, we can, for example, write an algorithm for processing a certain type of file, if no one has done it before us.
3. Fairly simple API, as well as user-friendly and informative UI. You can tap on the buttons and see where the bytes are going, what the current stage is, how much has already been processed, etc.
4. Wide possibilities of data transformation: aggregation, cleaning, application of machine learning models. Anything you want can be calculated, moreover, quite quickly.
5. Batch processing. For example, if there is a lot of data during the day, it can automatically process it at night.
6. Spark Streaming. Streaming processing allows businesses to receive data aggregates as close to real time as possible.
7. A large set of both built-in and custom connectors for various sources.
8. The volume of data depends only on the resources of the cluster. Horizontal scaling is always more convenient and cheaper than vertical scaling.
Cons:
Since Spark is a product of the JVM ecosystem, we get all the disadvantages inherent to it.
1. If you specified insufficient memory for processing a specific dataset in the instance parameters, you will receive an Out Of Memory Error (OOM). Memory consumption on executors during processing of large datasets is huge. A lot of resources are required (RAM and fast SSD).
2. If we talk about streaming, it should be understood that in Spark it is microbatch and works differently than in Flink, where each event is processed as soon as it “arrives”. Here the logic is different: one event “arrived”, the second, the third… They accumulate and go into processing not immediately, but when the specified number of them is collected. Accordingly, latency suffers because of this.
3. Latency also suffers due to garbage collection. No one knows when the GC will decide to collect the garbage. This is the so-called stop-the-world, when all processes stop.
4. I’m not sure that this is a minus, but still – you will have to rebuild your thinking on distributed processing. We write code on Spark, and part of this code will be executed on the driver, and part will be sent to the executors. It is necessary to understand well which parts will fly to the executors, how they will work. It depends on whether the job will work at all, what will be the consumption of resources, how long it will take… But this is a skill that can be developed, it will come with experience.
4. Communities
Spark has a large community: both Russian- and English-speaking. Without help, you will definitely not stay.
Telegram has many channels dedicated to data processing, in particular there is a group specifically about Spark (the first in the list below):
For English-language resources, I would recommend the YouTube channel “Databricks”. It’s run by the guys who are improving Spark, teaching presentations and videos with the “insider” you won’t find anywhere else. If you want “under the hood”, this channel is just right.
5. Training
It is quite possible to master the initial level of Spark on your own. Instead of an introductory course, I would ask the management for a week or two to read articles on Medium (now only with VPN) and Habr, watch video courses on Youtube, and if there is enough time – read the Spark books from O’Reilly and run through the examples from them.
I believe that most of the Spark courses that are advertised will not teach you anything properly. The main volume of their training materials is focused on data processing in Python, because everything is much simpler there. When it comes to Scala, materials are a real problem. I know literally a couple of good specialized courses that allow you to get under the hood specifically on Scala. Here are examples: https://newprolab.com/, https://rockthejvm.com/.
The entry point to Spark is a basic knowledge of Scala in its functional form. These are collections and understanding operations on them, such as map, flatMap, foreach. Implicits may come in handy somewhere. Many are afraid of the language, its combination of paradigms, but all complex points in Scala start with asynchrony, and this is rarely needed in Spark (for example, it may be needed in foreachBatch streaming). So the “rocky” part that will have to be explored is small and friendly.
In addition to knowing Spark, it is important to understand the concepts of how other tools work and what class of tasks they solve. For example, knowing how Flink Streaming differs from Spark Streaming, why you need Kafka or different types of NoSQL databases (Cassandra, Neo4J, etc.).
6. Pet projects
In Data Engineering, the most classic and basic option for a pet project is the ETL process. We take data from somewhere, somehow process it and store it somewhere. This can be driven to insane complexity. Data can be taken from simple sources (for example, a CSV file), or can be taken from Kafka in avro or protobuf format. The main thing is to understand the logic for yourself in order to form a finished process of “take-transform-save” as a result.
There are quite a lot of areas in Data Engineering and you can always choose what you would like to do. You can mostly code on Spark, or you can “Esquelite”. Here, Spark and I took something from Kafka, put it in ClickHouse, and there we “returned” something with SQL. This is a pipeline that will show your ability to work with three tools. To this, you can add a scheduler like Airflow, or “screw” some kind of dashboard.
For example, I somehow dug deep under the hood of Spark and wrote my own data source, and also wrote native UDFs. But these are all projects that developers are unlikely to encounter at work. I was just interested in understanding the “inside”, and it is also important for me to understand how the cog tool works.