Filter
filter_nulls
filter_nulls(self: DataFrame, *columns: ColumnReference, strict: bool = False, invert: bool = False) -> DataFrame
Keep all observations that represent null across any/all column(s).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
self
|
DataFrame
|
Object inheriting from PySpark DataFrame. |
required |
*columns
|
ColumnReference
|
Arbitrary number of column references. All columns must exist in |
()
|
strict
|
bool
|
Should condition be true for all column(s)? |
False
|
invert
|
bool
|
Should observations that meet condition be kept (False) or removed (True)? |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
Observations that represent null across any/all column(s). |
Source code in src/tidy_tools/core/filter.py
filter_substring
filter_substring(self: DataFrame, *columns: ColumnReference, substring: str, strict: bool = False, invert: bool = False) -> DataFrame
Keep all observations that match the regular expression across any/all column(s).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
self
|
DataFrame
|
Object inheriting from PySpark DataFrame. |
required |
*columns
|
ColumnReference
|
Arbitrary number of column references. All columns must exist in |
()
|
substring
|
str
|
String expression to check. |
required |
strict
|
bool
|
Should condition be true for all column(s)? |
False
|
invert
|
bool
|
Should observations that meet condition be kept (False) or removed (True)? |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
Observations that match the substring across any/all column(s). |
Source code in src/tidy_tools/core/filter.py
filter_regex
filter_regex(self: DataFrame, *columns: ColumnReference, pattern: str, strict: bool = False, invert: bool = False) -> DataFrame
Keep all observations that match the regular expression across any/all column(s).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
self
|
DataFrame
|
Object inheriting from PySpark DataFrame. |
required |
*columns
|
ColumnReference
|
Arbitrary number of column references. All columns must exist in |
()
|
pattern
|
str
|
Regular expression. Must be compiled according to |
required |
strict
|
bool
|
Should condition be true for all column(s)? |
False
|
invert
|
bool
|
Should observations that meet condition be kept (False) or removed (True)? |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
Observations that match the regular expression across any/all column(s). |
Source code in src/tidy_tools/core/filter.py
filter_elements
filter_elements(self: DataFrame, *columns: ColumnReference, elements: Sequence, strict: bool = False, invert: bool = False) -> DataFrame
Keep all observations that exist within elements across any/all column(s).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
self
|
DataFrame
|
Object inheriting from PySpark DataFrame. |
required |
*columns
|
ColumnReference
|
Arbitrary number of column references. All columns must exist in |
()
|
elements
|
Sequence
|
Collection of items expected to exist in any/all column(s). |
required |
strict
|
bool
|
Should condition be true for all column(s)? |
False
|
invert
|
bool
|
Should observations that meet condition be kept (False) or removed (True)? |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
Observations that exist within range across any/all column(s). |
Source code in src/tidy_tools/core/filter.py
filter_range
filter_range(self: DataFrame, *columns: ColumnReference, boundaries: Sequence[Any], strict: bool = False, invert: bool = False) -> DataFrame
Keep all observations that exist within range across any/all column(s).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
self
|
DataFrame
|
Object inheriting from PySpark DataFrame. |
required |
*columns
|
ColumnReference
|
Arbitrary number of column references. All columns must exist in |
()
|
boundaries
|
Sequence[Any]
|
Bounds of range. Must be of same type and in ascending order. |
required |
strict
|
bool
|
Should condition be true for all column(s)? |
False
|
invert
|
bool
|
Should observations that meet condition be kept (False) or removed (True)? |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
Observations that exist within range across any/all column(s). |
Raises:
Type | Description |
---|---|
AssertionError
|
Raises error if either condition is not met:
- |
Source code in src/tidy_tools/core/filter.py
filter_custom
filter_custom(self: DataFrame, *columns: ColumnReference, predicate: Callable, strict: bool = False, invert: bool = False, **kwargs: dict) -> DataFrame
Keep all observations that match the regular expression across any/all column(s).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
self
|
DataFrame
|
Object inheriting from PySpark DataFrame. |
required |
*columns
|
ColumnReference
|
Arbitrary number of column references. All columns must exist in |
()
|
predicate
|
Callable
|
Function returning PySpark Column for filtering expression. |
required |
strict
|
bool
|
Should condition be true for all column(s)? |
False
|
invert
|
bool
|
Should observations that meet condition be kept (False) or removed (True)? |
False
|
**kwargs
|
dict
|
Additional options to pass to |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
Observations that match the substring across any/all column(s). |