Skip to content

Core

The tidy_tools.core module is a functional replacement of the "basic" queries, namely:

  • Selecting columns
  • Filtering data

Motivation

The core module stemmed from a couple ideas, but the most pressing examples are:

  • The tedious nature of applying a filter across multiple columns
  • The intuitive column selectors module in Polars

For this section, I'll only focus on the first example. Consider the following PySpark code that attempts to filter null values in a DataFrame.

from pyspark.sql import functions as F

result = spark_data.filter(
    (F.col('A').isNull() | F.col('A').rlike(r"^\s*$")) |
    (F.col('B').isNull() | F.col('B').rlike(r"^\s*$")) |
    (F.col('C').isNull() | F.col('C').rlike(r"^\s*$"))
)

Let's read through this query. There are three main aspects:

  • our predicate checks for values that are NULL or entirely whitespace
  • our predicate is applied to multiple columns
  • our predicates are reduced into a single expression

The basis of all PySpark queries can be expressed as such: we want to apply a predicate to at least one column that evaluates to a PySpark expression. However, we are not limited to the PySpark API.

Let's revisit this example using a functional approach:

import functools
import operator

from pyspark.sql import (
    Column,
    DataFrame,
    functions as F
)

def is_null(column: str | Column) -> Column:
    """Return expression identifying null values."""
    if not isinstance(column, Column):
        column = F.col(column)
    return column.isNull() | column.rlike(r"^\s*$")

def filter_null(
    data: DataFrame,
    *columns: str | Column,
    strict: bool = False,
    invert: bool = False
) -> DataFrame:
    """Filter data for null values."""
    # define, apply predicate to multiple columns (by default, all)
    predicate = map(is_null, columns or data.columns)
    # reduce predicates into single expression
    comparison_op = operator.and_ if strict else operator.or_
    query = functools.reduce(comparison_op, predicate)
    # evaluate expression on data
    return data.filter(operator.inv(query) if invert else query)

This seems like even more work for filtering null values than before. Luckily only I will be writing this. Let's see what you'll be writing instead and compare it to how you would write it in native PySpark.

# filtering one column for null values
pyspark_result = spark_data.filter(F.col('A').isNull() | F.col('A').rlike(r"^\s*$"))
tidy_result = filter_null(spark_data, 'A')

# filtering on multiple columns for *any* null values
pyspark_result = spark_data.filter(
    (F.col('A').isNull() | F.col('A').rlike(r"^\s*$")) |
    (F.col('B').isNull() | F.col('B').rlike(r"^\s*$")) |
    (F.col('C').isNull() | F.col('C').rlike(r"^\s*$"))
)
tidy_result = filter_null(spark_data, 'A', 'B', 'C')

# filtering on multiple columns for *all* null values
pyspark_result = spark_data.filter(
    (F.col('A').isNull() | F.col('A').rlike(r"^\s*$")) &
    (F.col('B').isNull() | F.col('B').rlike(r"^\s*$")) &
    (F.col('C').isNull() | F.col('C').rlike(r"^\s*$"))
)
tidy_result = filter_null(spark_data, 'A', 'B', 'C', strict=True)

# filtering on multiple columns for *no* null values
pyspark_result = spark_data.filter(
    ~ (
        (F.col('A').isNull() | F.col('A').rlike(r"^\s*$")) |
        (F.col('B').isNull() | F.col('B').rlike(r"^\s*$")) |
        (F.col('C').isNull() | F.col('C').rlike(r"^\s*$"))
    )
)
tidy_result = filter_null(spark_data, 'A', 'B', 'C', invert=True)

All filtering expressions reduce these tedious elements of PySpark expressions to intuitive, easy-to-control parameters. This is just one example of how Tidy Tools promotes declarative workflows, letting you focus on the 'what to do' and not the 'how to do'.