Tip: Initialize PySpark session with Delta support

Quick Start

Delta’s documentation on how to enable it with Python is relatively straightforward. You install the delta-spark package using pip, and after adding the Delta related configuration, you need to wrap the PySpark builder with a call to configure_spark_with_delta_pip, and then you can .get_or_create your session.

By looking at its code, you’ll find out, that all it does is add a spark.jars.packages to your session’s configuration, which consequently puts the required Java module in your classpath.

AWS Glue

This installation approach works on a typical setup; however, when I was trying to utilize this for a script on AWS Glue, I realized this package was not getting placed in the classpath, causing a ClassNotFound exception. To make it work, I needed to download the desired delta-core jar file from the maven repository, upload it to S3, and the path it to the Glue job as a Dependent jar path.

PySpark version constraints

At the time of this writing, the Delta package works with PySpark < 3.2. If you try to run it with a newer version, it’ll raise the following exception:

java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.SQLConfHelper

Overall, it’s good to make sure your Spark/PySpark versions match together, and they are compatible with the Delta version.