Quantcast
Channel: Recent Questions - Stack Overflow
Viewing all articles
Browse latest Browse all 12111

Clustered Spark fails to write _delta_log via a Notebook without granting the Notebook data access?

$
0
0

TLDR: Why does my Spark cluster fail to complete writes to a Delta table unless my Jupyter Notebook has access to the data location, contrary to my expectation that Spark should handle writes independently of Jupyter's data access?

I've set up a PySpark Jupyter Notebook connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table. However, I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location. Repo for reproducibility.

Setup:

version: '3'services:  spark:    image: com/data_lake_spark:latest    # Spark service configuration details...  spark-worker-1:    # Configuration details...  spark-worker-2:    # Configuration details...  jupyter:    image: com/data_lake_notebook:latest    # Jupyter Notebook service configuration details...

Spark Session Configuration:

# Spark session setup...

Commanding Code:

# Write initial test data to Delta tableowner_df.write.format("delta").mode("overwrite").save(delta_output_path)

Removing Jupyter's access to the /data directory in the Docker Compose configuration results in a DeltaIOException when attempting to write to the Delta table. However, providing access to the /data directory allows successful writes.

Error Message:

Py4JJavaError: An error occurred while calling o56.save.: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log    at org.apache.spark.sql.delta.DeltaErrorsBase.cannotCreateLogPathException(DeltaErrors.scala:1534)    at org.apache.spark.sql.delta.DeltaErrorsBase.cannotCreateLogPathException$(DeltaErrors.scala:1533)    at org.apache.spark.sql.delta.DeltaErrors$.cannotCreateLogPathException(DeltaErrors.scala:3203)    at org.apache.spark.sql.delta.DeltaLog.createDirIfNotExists$1(DeltaLog.scala:443)

I expect Spark to handle writes independently of Jupyter's data access. Seeking insights or suggestions for resolving this issue. Any guidance would be appreciated.


Viewing all articles
Browse latest Browse all 12111

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>