Plumbum for shell scripting in Python

I never was a fan of scripting, and unless it was a real edge case, I was always ending up refactoring my needs based on existing tools instead of scripting. Why?

Simply said, because as soon as I finish writing a Bash script, I quickly forget what the mess was about, and if I needed a refactor sometime later, I always end up with too many WTF/m.

image.png

Yes, I realize how cool it is, but in 2022 I prefer to use my mental capacity on making something valuable than deciphering my novice bash skills 🙂

I learned about Babashka, the scripting tool for Clojure developers, a while back. I found it interesting to have the full power of a programming language within your shell scripts. I gave it a try yesterday and found myself some reasons not to pick it up. The reasons are:

  1. Shell outs (using applications in your PATH) are not as smooth as a regular Bash script.
  2. Things we like about shell scripts, like piping, are not as straightforward as they seem. There are thread macros, yet it is not easy to utilize it in every scenario (there are better ways, more Clojure friendly alternatives).
  3. Clojure itself is a bizarre tool for system administration. Not a bad thing, but I am not much of a Clojure developer “yet.”
  4. Available libraries for system management are not as vast as I know in Python.

Wait a minute, Python? Why not Python? At least it is the second-best citizen on any Unix system nowadays (unlike Babashka, which needs to be installed separately). I once switched to Xonsh some years ago (oh man, it has been six years). I used Python’s subprocess, popen and sh from time to time, but never was encouraged to switch to it altogether.

Influenced by my questions, I tried to search and found Plumbum. I wouldn’t say I like to showcase by duplicating the excellent documentation. Still, unlike other alternatives, I want to say that it’s trying to solve the right problem in a novel way. There are still some shortcomings, like being an external dependency, meaning the scripts won’t be as plug and play as we like them to be, but I thought maybe those kinds of issues could be resolved with some package management.

So I mixed it with Poetry, and my script project was born 😀. I expect to write more about my experience in this area, as I find it interesting. The power of scripting is helpful, but I wonder if we can improve it to catch up with the good practices I like to follow in software development. At least those who contribute have less WTF/day.

Random Thought: Systems Thinking

The first time I learned about systems thinking was from this talk by DR. Russel Ackoff:

It was a fascinating point of view that one can't improve the quality of a system as a whole by improving the performance of one component. Thinking about this, we can expand the idea to any area of our life.

As a #SoftwareEngineer, I can relate it to any system I'm working on. The list can go on and on with other areas like education, society, economy, etc.

I was studying the #Covid19 #pandemic topic, and it made me realize the fact (just like how Ackoff said) what crucial having a holistic view of the systems can play and what awful harm it’ll be when decisions are made in isolation.

System.com is a website designed around the same idea as a collaborative platform to visualize how everything in our world is dependent on each other.

Originally tweeted by Shahin (@Shahinism) on March 19, 2022.

Tool Gut: Kafka Message Journey

Preface

In this article, I will review how a message travels through its journey in the Kafka ecosystem. If you need a refresher on the core concepts of Kafka, you can refer to this blog post. Also, for more information on any of the concepts covered in this article, I would suggest referring to either of the following resources:

The Path

In a Kafka cluster, there are a couple of components that need to be initialized before the journey begins. Of course, this is something handled by Kafka, but they are necessary, so we are going to review them:

Controller

This component is responsible for the coordination of assigning partitions and data replicas. The controller will be selected in a process called Kafka Controller Election (more details on how the process works, can be found in this blog post).

In short, all of the brokers initialized will try to register themselves as the controller on the cluster start. Since only one controller can exist on the cluster by design, the first one who succeeds will become the controller, and the rest will watch. When the controller goes down for whatever reason, Zookeeper will inform the watchers, and the re-election process will be triggered (just like the first one).

Partitions

They are the primary concurrency mechanism in Kafka to enable producers and consumers to scale horizontally (more info). Controllers allocate these partitions over available brokers based on the configuration defined for each topic using the --partitions argument.

Replicas

They are the primary fault-tolerance mechanism in Kafka. Replicas are created by the controller component based on the configuration defined for each topic using the --replication-factor argument.

Lead Replica

Shown in orange diamond in the picture above (with “L”), these are the primary replica for each topic responsible for handling read and write requests. They’ll get selected in a process called Kafka Lead Replicat Election (this blog post, has an excellent in-depth overview on the process).

For each topic, Kafka tracks a factor named In-Sync Replica (or ISR for short), which indicates the number of replicas reflecting the latest status. When the lead replica goes down, the next in-sync replica will be selected, and if there is no in-sync replica to choose from, Kafka will wait (accepts no write action) until one such replica gets booted. There is a configuration called unclean-leader-election, which, when enabled, allows Kafka to use any non-sync replication when such a state happens to continue the consumption process.

Journey

Now let’s review the message journey (you can use the image in the start of the article to help with visualizing the process):

  1. Producer publishes the message to the cluster.
  2. The target partition will be selected in a round-robin fashion.
  3. The message will be appended to the end of the lead replica in the selected partition, and a unique ID (offset) will be assigned to the message.
  4. The replication mechanism will also create copies of data into the other replicas (if defined based on the replication factor).
  5. During the lifetime message on the topic, any consumer active on the consumer groups will consume the message if it’s not already consumed by other consumers active on that group. Note: This is only effective if the consumer uses a consumer group; otherwise, the consumer is responsible for tracking the offsets.
  6. When the retention period configuration of the topic exceeds, the message will automatically get deleted from the topic. Obvious fact: If no consumer has consumed the message, it’s lost forever!

Using data to improve software engineering

Whenever working on a new source code, I find it quite cumbersome to grow to enough understanding around it to be productive with it both in terms of security and speed of delivery. This problem can especially get challenging given different factors like:

  1. Lack of documentation around the source code or architecture.
  2. Unknown external dependencies relying on the current project to function as a system.
  3. Different developers contribute to the project without a unified coding style/conventions.

Over time, I’ve grown some instincts about understanding smaller software projects with a list of questions related to their function and trying to answer them one by one by interrogating the git history.

My method, of course, is not only helpful with the code written by others but also helps me to understand what I was thinking about a couple of weeks ago, developing a piece myself. However, this approach is relatively harder to apply when the project is more prominent in size.

Searching around this topic, I stumbled upon the following video by Adam Tornhill, which describes a method as an answer to this problem with the potential to expand it to other areas related to software development like organizational level concerns.

He also has a book on the same subject called Your Code as a Crime Scene, I enjoyed his presentation, and yes found the next interesting book I’m going to read 😉

AWS Vault with Yubikey

Preface

It’s been a while since I’m using aws-vault to manage my AWS credentials for accessing the different client accounts. It has been quite comfortable, regardless of the setup the clients choose for their users. However, there are 3 places in which I’m trying to improve my own development experience:

  1. Requirement of 2FA authentication
  2. SSO Login through a browser (to get access over shell)
  3. Session duration

Today I managed to resolve the first one (to some extent), and here is how:

2FA

Well, I understand how the second-factor authentication contributes to keeping me secure. However, as a developer, it can be quite distracting to find your phone and go through the authenticator app to find the code you need to type to be able to continue your work (yeah, call me lazy!).

So, to solve this issue, I was using the 2fa CLI application, so I had this code with the comfort of my CLI. But:

  1. 2fa stores the secret key in a plain text file in ~/.2fa. Meaning anyone cating it can regenerate my tokens (more of a silly practice than a security risk, as 2fa is not the only layer of protection).
  2. I need to have a journey between terminal/tmux windows to get this code in my clipboard (told you already, I’m lazy!).

Yubikey

Thanks to DataChef, we’ve switched to using yubikey devices, for our second-factor authentication. And I was wondering if it can help me to cover these concerns!

Well, it is. All I need to do is to provide it with the secret key (either resync, or extract it from one of the registered devices), and after that, whenever needed just touch the key and it’s done:

In Action

asciicast

Gotchas

Using this so far, I’ve faced 2 small glitches, which confused me at first:

  1. When a token has been used, it’s not possible to reuse it for a different session on AWS. This I couldn’t face before (tbh, the use-case is quite rare), because the context switch was probably taking me around 30 seconds (the time window for token refresh)!
  2. av aliases for zsh-aws-vault plugin (link to my fork), need a bit of refactoring to be able to understand --prompt parameter. Otherwise, I need to provide AWS_VAULT_PL_MFA=yubikey as an environment variable, you can see why:
avsh () {
        case ${AWS_VAULT_PL_MFA} in
                (inline) aws-vault exec -t $2 $1 -- zsh ;;
                (yubikey) totp=${2:-$1} 
                        aws-vault exec -t $(ykman oath code --single $totp) $1 -- zsh ;;
                (*) aws-vault exec $1 -- zsh ;;
        esac
}

Tip: Initialize PySpark session with Delta support

Quick Start

Delta’s documentation on how to enable it with Python is relatively straightforward. You install the delta-spark package using pip, and after adding the Delta related configuration, you need to wrap the PySpark builder with a call to configure_spark_with_delta_pip, and then you can .get_or_create your session.

By looking at its code, you’ll find out, that all it does is add a spark.jars.packages to your session’s configuration, which consequently puts the required Java module in your classpath.

AWS Glue

This installation approach works on a typical setup; however, when I was trying to utilize this for a script on AWS Glue, I realized this package was not getting placed in the classpath, causing a ClassNotFound exception. To make it work, I needed to download the desired delta-core jar file from the maven repository, upload it to S3, and the path it to the Glue job as a Dependent jar path.

PySpark version constraints

At the time of this writing, the Delta package works with PySpark < 3.2. If you try to run it with a newer version, it’ll raise the following exception:

java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.SQLConfHelper

Overall, it’s good to make sure your Spark/PySpark versions match together, and they are compatible with the Delta version.