Databricks Certified Developer for Apache Spark Scala Exam Questions
Databricks Academy is one of the most popular exam source for Apache Spark. In this section, I have listed up some example questions that help you to see similar questions
Question 1
The following dataframes have “addressID” as a common field. In order to make an inner join, which code block is supposed to be executed?
Answers
Explanation 1
Please see the definition of join.
Based on that, the following answer is correct
customerDF.join(addressDF, col("address_id") === col("address_id"),"inner")
Wrong answers
customerDF.join(addressDF, "address_id" === "address_id","inner")
is false because Spark couldn't identify "address_id" without col or $
customerDF.join(addressDF, col(address_id) === col(address_id),"inner")
is false because you need to use " (double quote) in the col function
customerDF.innerJoin(addressDF, col("address_id") === col("address_id"))
is false because there is no innerJoin function
Question 2
Answers
Explanation 2
Correct answer
The question asked the “id” and “name” based distinct data. The following code find the distinct records based on “id” and “name”
countryDF.dropDuplicates(List("id","name")).show()
This would be the alternative solution : countryDF.distinct().show()
Wrong answers
The following code gives a compile error
countryDF.distinct(List("id","name")).show()
The following code finds only “id” based unique data
countryDF.dropDuplicates(List("id")).show()
The following code is same as above, only find “name” based unique data
countryDF.dropDuplicates(List("name")).show()
Question 3
The customer DataFrame has a “birthdate” field as a “YYYY-mm-dd” format. We would like to find a number of customer whose birth year is between 1991 and 1993. Select the right code block from the following options
Answers
Explanation 3
Right answer
val count = customerDF.where(year($"birthdate") > 1991 && year($"birthdate") < 1993).count()
println("Count : " + count)
Wrong answers
The following answer is wrong because you couldn’t fetch year from $"birthdate"
You have to use year function
val count = customerDF.where($"birthdate" > 1991 && $"birthdate" < 1993).count()
println("Count : " + count)
The following answer is wrong because year function should take a column
val count = customerDF.where(year("birthdate") > 1991 && year("birthdate") < 1993).count()
println("Count : " + count)
The following answer is wrong because col function should take a column with double-quotes
val count = customerDF.filter(year(col(birthdate)) > 1991 && year(col(birthdate)) < 1993).count()
println("Count : " + count)
Question 4
Which running modes does Apache Spark support ? (Select three)
Answers
Explanation 4
Apache Spark supports thee following modes
-Local
-Standalone
-YARN
-Mesos
-Kubernetes
-etc
Wrong answers
Docker is a containerization platform, you can also create Spark cluster, but Docker is not running mode by itself
Spark is able to process In-memory, but it is not a running mode.
Question 5
We have the following Spark code and output.
Answers
Explanation 5
Right answer
selectedDF.groupBy("id").max("cost").show(5)
Wrong answers
The following answer is wrong because there is no ws_item_sk after select code
selectedDF.groupBy("ws_item_sk").max("cost").show(5)
The following answer gives string literal issues
selectedDF.selectExpr("groupBy(id)).max("cost").show(5)
The following answer is wrong . org.apache.spark.sql.AnalysisException: Undefined function: ‘groupBy’.
selectedDF.selectExpr("groupBy(id)").selectExpr("max(cost)").show(5)
Question 6
In Spark, which of the followings are defined as a wide transformation? (Select 3)
Answers
Explanation 6
Right answers
groupByKey()
, join()
, distinct()
Wide transformation — In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey() and reducebyKey().
Ref : data-flair.training, oreilly
Wrong answers
The followings are narrow transformation
filter()
, map()
, union()
Narrow transformations transform data without any shuffle involved.
Question 7
Which of the following persisting method allow to store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk
Answers
Explanation 7
Reference : https://spark.apache.org/docs/2.1.0/programming-guide.html#rdd-persistence
Right answer
MEMORY_AND_DISK
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed.
Wrong answers
MEMORY_ONLY
- Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level
MEMORY_ONLY_SER
- Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER
- Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
Do you want more questions like this ?
Udemy — Databricks Certified Developer for Apache Spark Scala Test
50 exam questions with 10 bonus test to prepare Databricks Certified Associate Developer for Apache Spark 3.0 — Scala
Go to the course with this link!