Tip: Initialize PySpark session with Delta support

Quick Start

Delta’s documentation on how to enable it with Python is relatively straightforward. You install the delta-spark package using pip, and after adding the Delta related configuration, you need to wrap the PySpark builder with a call to configure_spark_with_delta_pip, and then you can .get_or_create your session.

By looking at its code, you’ll find out, that all it does is add a spark.jars.packages to your session’s configuration, which consequently puts the required Java module in your classpath.

AWS Glue

This installation approach works on a typical setup; however, when I was trying to utilize this for a script on AWS Glue, I realized this package was not getting placed in the classpath, causing a ClassNotFound exception. To make it work, I needed to download the desired delta-core jar file from the maven repository, upload it to S3, and the path it to the Glue job as a Dependent jar path.

PySpark version constraints

At the time of this writing, the Delta package works with PySpark < 3.2. If you try to run it with a newer version, it’ll raise the following exception:

java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.SQLConfHelper

Overall, it’s good to make sure your Spark/PySpark versions match together, and they are compatible with the Delta version.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s