Databricks Certified Developer for Apache Spark Scala Exam Questions

Serkan SAKINMAZ
5 min readFeb 20, 2022

Databricks Academy is one of the most popular exam source for Apache Spark. In this section, I have listed up some example questions that help you to see similar questions

Question 1

The following dataframes have “addressID” as a common field. In order to make an inner join, which code block is supposed to be executed?

Answers

Explanation 1

Please see the definition of join.

Based on that, the following answer is correct

customerDF.join(addressDF, col("address_id") === col("address_id"),"inner")

Wrong answers

customerDF.join(addressDF, "address_id" === "address_id","inner") is false because Spark couldn't identify "address_id" without col or $

customerDF.join(addressDF, col(address_id) === col(address_id),"inner") is false because you need to use " (double quote) in the col function

customerDF.innerJoin(addressDF, col("address_id") === col("address_id")) is false because there is no innerJoin function

Question 2

Answers

Explanation 2

Correct answer

The question asked the “id” and “name” based distinct data. The following code find the distinct records based on “id” and “name”

countryDF.dropDuplicates(List("id","name")).show()

This would be the alternative solution : countryDF.distinct().show()

Wrong answers

The following code gives a compile error

countryDF.distinct(List("id","name")).show()

The following code finds only “id” based unique data

countryDF.dropDuplicates(List("id")).show()

The following code is same as above, only find “name” based unique data

countryDF.dropDuplicates(List("name")).show()

Question 3

The customer DataFrame has a “birthdate” field as a “YYYY-mm-dd” format. We would like to find a number of customer whose birth year is between 1991 and 1993. Select the right code block from the following options

Answers

Explanation 3

Right answer

val count = customerDF.where(year($"birthdate") > 1991 && year($"birthdate") < 1993).count()

println("Count : " + count)

Wrong answers

The following answer is wrong because you couldn’t fetch year from $"birthdate" You have to use year function

val count = customerDF.where($"birthdate" > 1991 && $"birthdate" < 1993).count()

println("Count : " + count)

The following answer is wrong because year function should take a column

val count = customerDF.where(year("birthdate") > 1991 && year("birthdate") < 1993).count()

println("Count : " + count)

The following answer is wrong because col function should take a column with double-quotes

val count = customerDF.filter(year(col(birthdate)) > 1991 && year(col(birthdate)) < 1993).count()

println("Count : " + count)

Question 4

Which running modes does Apache Spark support ? (Select three)

Answers

Explanation 4

Apache Spark supports thee following modes

-Local

-Standalone

-YARN

-Mesos

-Kubernetes

-etc

Wrong answers

Docker is a containerization platform, you can also create Spark cluster, but Docker is not running mode by itself

Spark is able to process In-memory, but it is not a running mode.

Question 5

We have the following Spark code and output.

Answers

Explanation 5

Right answer

selectedDF.groupBy("id").max("cost").show(5)

https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html#groupBy(org.apache.spark.sql.Column...)

Wrong answers

The following answer is wrong because there is no ws_item_sk after select code

selectedDF.groupBy("ws_item_sk").max("cost").show(5)

The following answer gives string literal issues

selectedDF.selectExpr("groupBy(id)).max("cost").show(5)

The following answer is wrong . org.apache.spark.sql.AnalysisException: Undefined function: ‘groupBy’.

selectedDF.selectExpr("groupBy(id)").selectExpr("max(cost)").show(5)

Question 6

In Spark, which of the followings are defined as a wide transformation? (Select 3)

Answers

Explanation 6

Right answers

groupByKey() , join() , distinct()

Wide transformation — In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey() and reducebyKey().

Ref : data-flair.training, oreilly

Wrong answers

The followings are narrow transformation

filter() , map() , union()

Narrow transformations transform data without any shuffle involved.

Question 7

Which of the following persisting method allow to store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk

Answers

Explanation 7

Reference : https://spark.apache.org/docs/2.1.0/programming-guide.html#rdd-persistence

Right answer

MEMORY_AND_DISK

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed.

Wrong answers

MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level

MEMORY_ONLY_SER - Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

MEMORY_AND_DISK_SER - Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.

Do you want more questions like this ?

Udemy — Databricks Certified Developer for Apache Spark Scala Test

50 exam questions with 10 bonus test to prepare Databricks Certified Associate Developer for Apache Spark 3.0 — Scala

Go to the course with this link!

--

--