Inadvertently, the year 2020 has given each one of us a huge opportunity to brush up our skills and gain knowledge of the courses we are passionate about. Soon, the IT job market is going to open up in full swing after this CoVid -19 episode this year and the demand for the Big Data professionals is likely to soar high more than ever. In case, you want to be a part of the big data analytics world, the Apache Spark interview questions can throw some light on how rigorous the interviews can be.
The questions below can be your passport to become the next Big Data professional.
Listed below is our compilation of the top 50 Apache Spark interview questions that can be asked in a standard interview while shortlisting candidates for a Big Data professional job.
Apache Spark Interview Questions and Answers – Difficulty Level -1:
1. What is the difference between Hadoop and Spark?
Spark is built on top of Hadoop. However, to make a comparison between the two, let’s compare the Hadoop’s Map Reduce, the data processing, and analysis engine, and Apache Spark.
|Speed||Average Speed||100 x faster processing of data|
|Difficulty level||Hadoop is complicated to learn||Fairly easy to learn|
|Processing of data||Batch processing||Real-time and batch processing of data|
|Recovery||Partitions recovery is allowed||Fault-tolerant|
|Interactivity||Interactive modes||No interactivity except in Pig and Hive|
2. Explain Apache Spark.
Spark is an open-source, free cluster computing framework capable of real-time processing of data. It provides features such as parallel processing and fault-tolerance. Today, Spark is the market leader for big data processing.
3. What benefits does Spark have over Map Reduce?
Spark has the following advantages over Hadoop’s Map Reduce
- Spark’s in-memory processing enables 100 x faster speed as compared to Map R’s persistence in the storage data processing.
- Where Spark has libraries to perform multiple tasks from the same core such as batch processing, streaming, ML, handling SQL queries, Hadoop supports batch processing.
- Hadoop depends on data lying on the disk for processing, whereas Spark is into in-memory processing.
- Spark is capable of iterative processing. There is only single computing in Hadoop.
4. List the key features of Apache Spark.
Apache Spark has the following key features:
- Lazy Evaluation
- Real-time computing
- Integration with Hadoop
- Multi-format support
5. What programming languages does Spark support?
One can perform Spark coding in the following programming languages:
6. Explain YARN.
YARN stands for Yet Another Resource Negotiator, which is a distributed container manager like Mesos. Just like Hadoop’s Map Reduce work on YARN for data processing, the Spark performs the data processing on YARN.
7. Why should one learn Map R if Spark is better?
Spark and many other Big Data tools borrow the Hadoop’s MapReduce paradigm. When the data is becoming bigger, Map Reduce is the ideal model to process it. Even tools like Hive and Pig, convert their queries into Map R queries for optimum results.
8. Is it mandatory to install Spark on all nodes of the YARN cluster?
No, Spark works on top of the YARN cluster and need not be installed on all three nodes. Installing Spark on one node is enough to get the job done.
9. What are the components of Spark?
The components of Spark are:
- Spark Core
- Spark SQL
- Spark Streaming
10. Write and explain what is RDD in Spark?
RDD stands for Resilient Distributed Dataset. RDDs are immutable, a fault-tolerant fundamental data structure in Spark. They have partitioned datasets distributed among the cluster nodes.
There are two ways of creating RDDS -Parallelizing and referencing a data set. The RDDS are responsible for lazy evaluation. The lazy evaluation of RDDs is the reason behind the better processing speed in Spark.
11. How to create RDDs in Spark?
In Spark, there are 3 ways of creating the RDDs:
- Parallelizing the existing collection using the parallelize() method.
- Referencing external datasets
- Creating RDD from an existing RDD
12. What operations does RDD support in Spark?
RDDs support two kinds of operations:
- Transformations- As RDDs are immutable, one cannot change the RDD. Therefore, one can create a new RDD from the existing one by transforming it.
- Actions – These return the end results of the RDD computations to the driver program.
13. What is pair RDD?
Paired RDD is the distributed collection of the Spark data sets with key-value pairs.
14. What is RDD Lineage?
RDD lineage is the graph of all the RDDs and its parent RDDs.
Spark Interview Questions and Answers – Difficulty Level – 2:
15. Explain what is Executor Memory in a Spark Application.
Each Spark application will have an executor on the worker node. The executor memory is how much memory is utilized by the worker node in that application.
16. Explain partitions in Spark.
Everything in Spark is a partitioned RDD. The partitions are the logical chunks of data distributed among the various nodes in a cluster. Partitions are the key units of parallelism in Spark.
17. Explain Transformations in Spark.
We know that the RDDs are immutable. Therefore, any function spied on the RDD results in another RDD. This is called transformation. Map() and filter() are two of the examples of transformations on Spark RDD.
18. What are the Actions in Spark?
The Actions are the resultant datasets of all the previous transformations. They bring back the final data to the Driver program. The functions like reduce(), take() are examples of the actions in Spark.
19. What is Spark Core and what are its functions?
It is the key unit of Spark. Spark Core is responsible for performing all the task like dispatching, scheduling, input-output operations, etc.
20. How does streaming take place in Spark?
Spark receives live streaming of data which is divided into batches. These batches of data are processed by the Spark Engine and the final stream of results are sent back in batches. The fundamental stream unit in Spark is DStream or Discretized Stream.
21. What API is used for visualizing graphs in Spark?
The Spark API for graphs is GraphX. This includes an increasing collection of algorithms and builders to simplify graph analytics tasks.
22. Explain PageRank in GraphX.
Page Rank is an algorithm that outputs a probability distribution. The values between 0 and 1 are generated which signifies the likelihood of a user randomly clicking the link will land at the said page.
23. How does Machine Learning take place in Spark?
Spark has a scalable library called MLlib which can be used for performing clustering, dimension reduction, and regression, etc.
24. What is Spark SQL? How does it work?
It is a new module that integrates relational processing to Spark’s functional processing. Spark SQL enables data querying via SQL or Hive Query Language (HQL). It has four libraries:
- Data Source API
- Data Frame API
- Interpreter and Optimizer
- SQL Service.
25. What are the functions of Spark SQL?
The Spark SQL performs the following:
- Loads data from various structured Datasources like RDBMS.
- It can query data with the help of SQL commands within the Spark program and also from external tools via JDBC/ODBC connectors from tools such as Tableau.
- It is also capable of providing integration between SQL and Python/Scala code.
26. How is Spark used with Hadoop?
- Spark is very much compatible with Hadoop and can run on top of Hadoop’s HDFS.
- It can optionally work alongside MapReduce for data processing. Spark and MapR can be used for live processing and batch processing of data.
- Spark can run on YARN clusters.
27. What is a parquet file?
Parquet is a file type that is columnar in nature and is considered to be the best file format for big data processing.
28. Explain Spark Driver.
Spark Driver is a program that runs on master nodes and is responsible for declaring transformations and actions on RDDS.
It also delivers RDD graphs to master nodes.
29. Explain the file system supported in Spark.
The different file systems supported by Spark are:
- Amazon S3
30. What is Spark Executor?
When a SparkContext is created, the Executors are acquired on top of worker nodes in the clusters. The Spark Executors are responsible to run computations and store the data on the worker node. They are also responsible to send the results back to the driver.
31. Explain the type of Cluster Managers in Spark.
Spark has three cluster manager types:
- Apache Mesos
32. Explain the worker node.
The worker node is a slave node in the clusters. The master node assigns the job and the worker nodes which have the data, perform the tasks via Executors.
33. What are the disadvantages of Spark?
- Spark uses up more storage space as compared to Hadoop
- Instead of the processing taking place on a single node, the task is divided over multiple clusters.
- Spark’s in-memory processing can prove to be expensive for big data processing.
- The data utilization is more in Spark as compared to Hadoop.
34. What are the advantages of Spark over Hadoop?
- Spark is known for the real-time processing of data that can be used for Stock market analysis, banking, telecommunications, etc.
- Spark’s stream processing enables live streaming data analysis which can help in fraud detection, system alerts, etc.
- Spark is 10 to 100 times faster in processing the data due to its lazy evaluation mechanism and parallel processing.
35. Explain the Sparse Vector.
The sparse vector has two parallel arrays where non-zero entities are stored to save space. While one array has indices, the other has values.
For Example A dense vector – [1.,0.,0.,0.,0.,0.,3.]
Is stored in sparse vector as – (7, [0,6], [1.,3.])
Here 7 is the size.
[0,6] are indices
[1.,3.] are values.
Apache Spark Interview Questions and Answers – Difficulty Level – 3:
36. Can we use Spark to work on the Cassandra database?
It is possible to have Spark work on Cassandra using the Cassandra connector.
37. Can Apache Spark run on Apache Mesos?
Yes. Like Spark can work on YARN, it can work on clusters managed by Apache Mesos. Spark can run as a standalone mode without any resource manager. If it should run on a multi-node setup, it can make use of YARN or Mesos.
38. What are broadcast variables?
There are two types of shared variables in Spark – Accumulators and broadcast variables. The broadcast variables are read-only variables cached in the Executors for local referencing instead of shipping back and forth to the driver.
39. What is the need for broadcast variables in Spark?
Broadcast variables are in-memory cached, read-only variables stored, and used by Executors. The broadcasters essentially
speed up the processing without needing to contact the driver for data as the local copy is available for lookup.
40. What are accumulators in Spark? Explain.
Accumulators are another type of shared variables in Spark. They are used to perform sums and iterations. One can create named or unnamed accumulators. They are by default numeric types, but the users can define new types of accumulators.
41. How does caching work in Apache Spark?
Caching the RDDs in Spark is for speeding up the computation by accessing the same RDD multiple times. The Discretized Streams or DStreams’ function in Spark streaming is to allow the users to either cache or persist the data in the memory.
The two functions cache () and persist(level) are used to cache the data in memory and cache the memory based on the mentioned storage level respectively.
The persist () without level mentioned is equal to cache i.e. caches the data in memory. The persist(level) caches the data in the specified storage level such as on disk, server, or off-heap memory.
42. What data sources are available in Spark SQL?
JSON files, Parquet, Avro, Hive tables, etc. are the data sources available in Spark SQL.
43. What are the various persistence levels in Apache Spark?
Spark keeps the capability to persist the intermediary data. However, it is highly recommended to call the persist () method on RDD to reuse it. There are different persist or storage levels in Spark:
44. What is checkpointing in Spark?
A checkpoint is an essential feature of Spark which enables the driver to restart in case of failure of an RDD from the previously computed state. A checkpoint will persist the RDD in HDFS at specific intervals when the user decides and that’s when the RDD loses the transformation history.
However, the data frame stores the result of the previous transformations to maintain the fault-tolerance.
45. What is Akka? How does Spark use it?
Akka is a framework that enables reactive, distributed, parallel, and resilient concurrent applications in Scala and Java. Apache Spark sits on top of Akka.
Spark uses Akka for scheduling the tasks and messaging between the master and the worker node when assigning the tasks to the worker nodes.
46. Explain the Lazy Evaluation in Spark.
When a job is assigned in Spark, the framework makes a note of it, remembers it, but does nothing until further prompted.
There are two types of operations in Spark – transformation and action.
No operation takes place when a programmer calls the map() function (transformation). However, the methods like collect() or filter() (action) finally trigger the operation.
47, Explain the difference in Spark SQL, HQL, and SQL.
The Spark SQL is a component of Spark Core that supports SQL and HQL queries without changing the syntax.
48. Where do you use Spark streaming?
It is used when real-time data needs to be streamed into the Spark program. It can be streamed from various sources including Kafka, Flume, Amazon Kinesis, etc. The streamed data is divided into batches for processing.
The Spark streaming comes into use for performing real-time sentiment analysis of the customers on social media platforms like Twitter, Facebook, etc.
Live streaming data processing is essential for alerting the downtime, fraud detection in financial organizations, and stock market predictions, etc.
49. Explain the Sliding Window operation in Spark.
DStreams are continuous batches of data that stream into the Spark program for processing. So, a window slides over the DStream, as provided for windowed computations by Spark Streaming library, and the source RDDs that fall within that window are combined to produce new RDDs.
50. What is DStream in Spark?
DStream in Spark stands for Discretized Stream which is the foundation for lazy evaluation. It is a continuous stream of data coming from a source be it, HDFS or Spark Streaming. DStreams divided into batches are RDDs.
Any operations performed on the DStream results on the RDDs within.