Core
The tidy_tools.core
module is a functional replacement of the "basic" queries, namely:
- Selecting columns
- Filtering data
Motivation
The core module stemmed from a couple ideas, but the most pressing examples are:
- The tedious nature of applying a filter across multiple columns
- The intuitive column selectors module in Polars
For this section, I'll only focus on the first example. Consider the following PySpark code that attempts to filter null values in a DataFrame.
from pyspark.sql import functions as F
result = spark_data.filter(
(F.col('A').isNull() | F.col('A').rlike(r"^\s*$")) |
(F.col('B').isNull() | F.col('B').rlike(r"^\s*$")) |
(F.col('C').isNull() | F.col('C').rlike(r"^\s*$"))
)
Let's read through this query. There are three main aspects:
- our predicate checks for values that are
NULL
or entirely whitespace - our predicate is applied to multiple columns
- our predicates are reduced into a single expression
The basis of all PySpark queries can be expressed as such: we want to apply a predicate to at least one column that evaluates to a PySpark expression. However, we are not limited to the PySpark API.
Let's revisit this example using a functional approach:
import functools
import operator
from pyspark.sql import (
Column,
DataFrame,
functions as F
)
def is_null(column: str | Column) -> Column:
"""Return expression identifying null values."""
if not isinstance(column, Column):
column = F.col(column)
return column.isNull() | column.rlike(r"^\s*$")
def filter_null(
data: DataFrame,
*columns: str | Column,
strict: bool = False,
invert: bool = False
) -> DataFrame:
"""Filter data for null values."""
# define, apply predicate to multiple columns (by default, all)
predicate = map(is_null, columns or data.columns)
# reduce predicates into single expression
comparison_op = operator.and_ if strict else operator.or_
query = functools.reduce(comparison_op, predicate)
# evaluate expression on data
return data.filter(operator.inv(query) if invert else query)
This seems like even more work for filtering null values than before. Luckily only I will be writing this. Let's see what you'll be writing instead and compare it to how you would write it in native PySpark.
# filtering one column for null values
pyspark_result = spark_data.filter(F.col('A').isNull() | F.col('A').rlike(r"^\s*$"))
tidy_result = filter_null(spark_data, 'A')
# filtering on multiple columns for *any* null values
pyspark_result = spark_data.filter(
(F.col('A').isNull() | F.col('A').rlike(r"^\s*$")) |
(F.col('B').isNull() | F.col('B').rlike(r"^\s*$")) |
(F.col('C').isNull() | F.col('C').rlike(r"^\s*$"))
)
tidy_result = filter_null(spark_data, 'A', 'B', 'C')
# filtering on multiple columns for *all* null values
pyspark_result = spark_data.filter(
(F.col('A').isNull() | F.col('A').rlike(r"^\s*$")) &
(F.col('B').isNull() | F.col('B').rlike(r"^\s*$")) &
(F.col('C').isNull() | F.col('C').rlike(r"^\s*$"))
)
tidy_result = filter_null(spark_data, 'A', 'B', 'C', strict=True)
# filtering on multiple columns for *no* null values
pyspark_result = spark_data.filter(
~ (
(F.col('A').isNull() | F.col('A').rlike(r"^\s*$")) |
(F.col('B').isNull() | F.col('B').rlike(r"^\s*$")) |
(F.col('C').isNull() | F.col('C').rlike(r"^\s*$"))
)
)
tidy_result = filter_null(spark_data, 'A', 'B', 'C', invert=True)
All filtering expressions reduce these tedious elements of PySpark expressions to intuitive, easy-to-control parameters. This is just one example of how Tidy Tools promotes declarative workflows, letting you focus on the 'what to do' and not the 'how to do'.