The following is a compilation of all the resources I’ve used to pass the Databricks 2.X Spark Certification, as well as some questions that came up.
Background on the Exam
The exam databricks spark 2.x spark with scala certification is made up of 40 MCQ questions, 3 hours to answer them in total. The registration grants you 2 trials, meaning if you fail the first attempt, you have to wait 15 days and then you can apply for a second trial. The focus of the exam is spark DataFrames. The question distribution is outlined very well in the LinkedIn article in REX.
- Spark in Action
- Learning Spark: Outdated but has useful information regarding RDDs.
- Spark: The Definitive guide (Either Spark in Action or this).
- LinkedIn article with GitHub repository associated: https://www.linkedin.com/pulse/5-tips-cracking-databricks-apache-spark-certification-vivek-bombatkar/ and https://github.com/vivek-bombatkar/Databricks-Apache-Spark-2X-Certified-Developer Note that in the github repo there is a sample exam. 2 of the provided answers are not correct.
- Medium article https://link.medium.com/l5Sw4zn5WY
The following was sent to me by someone from databricks’ learning center: https://databricks-prod-cloudfront.cloud.databricks.com/public/793177bc53e528530b06c78a4fa0e086/0/6221173/100020/latest.html The sample exam shows the format of the questions.
Which line will trigger the physical plan?
Action vs Transformation
Default storage level for rdds vs DataFrames (using cache) MEMORY_ONLY and MEMORY_AND_DISK respectively.
Of all the blocks of code, which one has the least bottleneck. (know what
Does using an accumulator prevent us from using the catalyst optimizations?
What are the consequences of increasing the number of partitions?
For structured streaming review fault tolerance for every sink. no questions on spark streaming. one windowing question but it was very simple (need to know that the sliding should be smaller than the window time).
Apply BFS with GraphFrames.
CrossJoin and know that is does a cartesian product so can be memory exhaustive.
Broadcast is done automatically as long as the DataSet to be broadcasted is less than 10MB
What can we do if we want to handle a file format that is not supported by the DataFrame API
UDF questions. and how to call them. (parameters of
registermethod, how to call the registered udf and the function that will be invoked).
FooBarBaz question. (printing something based on multiples of 3 or 5 or both. 4 or 5 code blocks and we need to guess the one with the correct output.)