Quick Start
Delta’s documentation on how to enable it with Python is relatively straightforward. You install the delta-spark
package using pip, and after adding the Delta
related configuration, you need to wrap the PySpark builder with a call to configure_spark_with_delta_pip,
and then you can .get_or_create
your session.
By looking at its code, you’ll find out, that all it does is add a spark.jars.packages
to your session’s configuration, which consequently puts the required Java module in your classpath.
AWS Glue
This installation approach works on a typical setup; however, when I was trying to utilize this for a script on AWS Glue, I realized this package was not getting placed in the classpath, causing a ClassNotFound exception. To make it work, I needed to download the desired delta-core jar file from the maven repository, upload it to S3, and the path it to the Glue job as a Dependent jar path.
PySpark version constraints
At the time of this writing, the Delta package works with PySpark < 3.2
. If you try to run it with a newer version, it’ll raise the following exception:
java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.SQLConfHelper
Overall, it’s good to make sure your Spark/PySpark versions match together, and they are compatible with the Delta version.