TLDR: Why does my Spark cluster fail to complete writes to a Delta table unless my Jupyter Notebook has access to the data location, contrary to my expectation that Spark should handle writes independently of Jupyter's data access?
I've set up a PySpark Jupyter Notebook connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table. However, I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location. Repo for reproducibility.
Setup:
version: '3'services: spark: image: com/data_lake_spark:latest # Spark service configuration details... spark-worker-1: # Configuration details... spark-worker-2: # Configuration details... jupyter: image: com/data_lake_notebook:latest # Jupyter Notebook service configuration details...
Spark Session Configuration:
# Spark session setup...
Commanding Code:
# Write initial test data to Delta tableowner_df.write.format("delta").mode("overwrite").save(delta_output_path)
Removing Jupyter's access to the /data
directory in the Docker Compose configuration results in a DeltaIOException when attempting to write to the Delta table. However, providing access to the /data
directory allows successful writes.
Error Message:
Py4JJavaError: An error occurred while calling o56.save.: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log at org.apache.spark.sql.delta.DeltaErrorsBase.cannotCreateLogPathException(DeltaErrors.scala:1534) at org.apache.spark.sql.delta.DeltaErrorsBase.cannotCreateLogPathException$(DeltaErrors.scala:1533) at org.apache.spark.sql.delta.DeltaErrors$.cannotCreateLogPathException(DeltaErrors.scala:3203) at org.apache.spark.sql.delta.DeltaLog.createDirIfNotExists$1(DeltaLog.scala:443)
I expect Spark to handle writes independently of Jupyter's data access. Seeking insights or suggestions for resolving this issue. Any guidance would be appreciated.