Delta’s documentation on how to enable it with Python is relatively straightforward. You install the
delta-spark package using pip, and after adding the
Delta related configuration, you need to wrap the PySpark builder with a call to
configure_spark_with_delta_pip, and then you can
.get_or_create your session.
By looking at its code, you’ll find out, that all it does is add a
spark.jars.packages to your session’s configuration, which consequently puts the required Java module in your classpath.
This installation approach works on a typical setup; however, when I was trying to utilize this for a script on AWS Glue, I realized this package was not getting placed in the classpath, causing a ClassNotFound exception. To make it work, I needed to download the desired delta-core jar file from the maven repository, upload it to S3, and the path it to the Glue job as a Dependent jar path.
PySpark version constraints
At the time of this writing, the Delta package works with
PySpark < 3.2. If you try to run it with a newer version, it’ll raise the following exception:
Overall, it’s good to make sure your Spark/PySpark versions match together, and they are compatible with the Delta version.