Skip to content

Apheris Statistics Reference🔗

apheris_stats.simple_stats🔗

corr(datasets, session, column_names, global_means=None, group_by=None, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Computes the federated pearson correlation matrix for a given set of columns.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

datasets that the computation shall be run on

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names Iterable[str]

set of columns

required
global_means Dict[Union[str, Tuple], Union[int, float, Number]]

means over all datasets for given column names. If global_means is None, it will be automatically determined in a separate pre-run

None
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result as a pandas DataFrame with the

correlation matrix of the specified columns.

Example
corr_matrix = simple_stats.corr(
    datasets=[transformations_dataset_essex, transformations_dataset_norfolk],
    column_names=['age', 'length of covid infection'],
    global_means={'age': 50, 'length of covid infection': 10},
    session=session
)

cov(datasets, session, column_names, global_means=None, group_by=None, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Computes the federated covariance matrix for a given set of columns.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

datasets that the computation shall be run on

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_names Iterable[str]

set of columns

required
global_means Dict[Union[str, Tuple], Union[int, float, Number]]

means over all datasets for given column names. If global_means is None, it will be automatically determined in a separate pre-run

None
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result as a pandas DataFrame with the

correlation matrix of the specified columns.

Example
coc_matrix = simple_stats.cov(
    datasets=[transformations_dataset_essex, transformations_dataset_norfolk],
    column_names=['age', 'length of covid infection'],
    global_means={'age': 50, 'length of covid infection': 10},
    session=session
)

count_column_value(datasets, session, column_name, value, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns how often value appears in a certain column of the datasets.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

Name of the column over which the function shall be calculated

required
value

This value will be counted

required
aggregation bool

Defines whether the counts should be aggregated over all datasets or whether the counts should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.ROUND: only valid for counts, rounds to the privacy bound or 0. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result

count_group_by(datasets, session, column_name, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Function that counts categorical values of a table column.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

name of the column for which the statistical query shall be computed

required
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.ROUND: only valid for counts, rounds to the privacy bound or 0. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result. Its result contains a pandas DataFrame with the counts summed over the datasets.

count_null(datasets, session, column_name, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the number of occurrences of NA values (such as None or numpy.NaN) and the number of non-NA values in the datasets. NA are counted based on panda's isna() and notna() functions.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

name of the column over which the NA values shall be counted

required
group_by Union[Hashable, Iterable[Hashable]]

(optional) mapping, label, or list of labels, used to group before aggregation.

None
aggregation bool

defines whether the counts should be aggregated over all datasets or whether the counts should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.ROUND: only valid for counts, rounds to the privacy bound or 0. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result

describe(datasets, session, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Create a description of a dataset

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical description of datasets

histogram(datasets, session, column_name, bins, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns a histogram for the given datasets

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

name of the column for which the histogram shall be generated

required
bins

int or sequence of scalars. If bins is an int, it defines the number of bins with equal width. If it is a sequence, its content defines the bin edges.

required
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
aggregation bool

If True, the histogram is aggregated over all datasets. Otherwise, one histogram will be returned per dataset. Aggregation is only feasible, if bins is an Iterable which defines the bin edges.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.ROUND: only valid for counts, rounds to the privacy bound or 0. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns: statistical result

iqr_column(datasets, session, column_name, global_min_max, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Function to approximate the interquartile range (IQR) over multiple datasets. Internally, first a histogram with a user-defined number of bins and user-defined upper and lower bounds is created over all datasets. Based on this histogram the IQR is approximated.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

name of the column for which the histogram shall be generated

required
global_min_max Iterable[float]

a list that contains the global minimum and maximum values of the combined datasets. This needs to be computed separately, using for example the function min_column/max_column combined with min_aggregation/max_aggregation.

required
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
n_bins int

number of bins for internal histogram

100
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result

kaplan_meier(datasets, session, duration_column_name, event_column_name, group_by=None, plot=False, stepsize=1, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Create a Kaplan Meier survival statistic

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
duration_column_name str

duration column for survival function

required
event_column_name str

event column - indicating death

required
group_by str

grouping column

None
plot bool

if True results will be displayed using pd.DataFrame.plot()

False
stepsize int

histogram bin size

1
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result

max_column(datasets, session, column_name, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the max over a specified column.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

name of the column over which the max shall be calculated

required
group_by Union[Hashable, Iterable[Hashable]]

optional; mapping, label, or list of labels, used to group before aggregation.

None
aggregation bool

defines whether the max should be aggregated over all datasets or whether the max should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result

mean_column(datasets, session, column_name, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the mean over a specified column.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

name of the column over which the mean shall be calculated

required
group_by Union[Hashable, Iterable[Hashable]]

optional; mapping, label, or list of labels, used to group before aggregation.

None
aggregation bool

defines whether the mean should be aggregated over all datasets or whether the mean should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result

median_with_confidence_intervals_column(datasets, session, column_name, global_min_max, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Function to approximate the median and the 95% confidence interval over multiple datasets. Internally, first a histogram with a user-defined number of bins and user-defined upper and lower bounds is created over all datasets. Based on this histogram the median and the confidence interval are approximated.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

name of the column for which the histogram shall be generated

required
global_min_max Iterable[float]

a list that contains the global minimum and maximum values of the combined datasets. This needs to be computed separately, using for example the function min_column/max_column combined with min_aggregation/max_aggregation.

required
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
n_bins int

number of bins for internal histogram

100
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result - If no group_by argument is used, its result contains a numpy.ndarray with approximate median, lower and upper bound of the 95% confidence interval. - If a group_by argument is used, its result contains a tuple of three dicts (approximate median, lower and upper bound of the 95% confidence interval).

median_with_quartiles(datasets, session, column_name, global_min_max, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Function to approximate the median and the 1st and 3rd quartile over multiple datasets. Internally, first a histogram with a user-defined number of bins and user-defined upper and lower bounds is created over all datasets. Based on this histogram above-mentioned values are approximated.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

name of the column for which the statistical query shall be computed

required
global_min_max List[float]

a list that contains the global minimum and maximum values of the combined datasets. This needs to be computed separately, using for example the function min_column/max_column combined with min_aggregation/max_aggregation.

required
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
n_bins int

number of bins for the internal histogram

100
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE..

RAISE

Returns:

Type Description

statistical result; Its result contains a tuple with the 1st quartile, the median, and the 3rd quartile.

min_column(datasets, session, column_name, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the min over a specified column.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

name of the column over which the min shall be calculated

required
group_by Union[Hashable, Iterable[Hashable]]

optional; mapping, label, or list of labels, used to group before aggregation.

None
aggregation bool

defines whether the min should be aggregated over all datasets or whether the min should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result

pca_transformation(datasets, session, column_names, n_components, handle_outliers=PrivacyHandlingMethod.RAISE.value) 🔗

Computes the principal components transformation matrix of given list of datasets. Args: datasets: datasets that the computation shall be run on session: For remote runs, use a SimpleStatsSession that refers to a cluster column_names: set of columns n_components: number of components to keep handle_outliers: Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

      - `PrivacyHandlingMethod.FILTER`: filters out all groups that are violating
         privacy bound.
      - `PrivacyHandlingMethod.FILTER_DATASET`: removes out the entire dataset
         from the federated computation in case of privacy violations.
      - `PrivacyHandlingMethod.RAISE`: raises a PrivacyException if privacy bound
         was violated.

    Default is `PrivacyHandlingMethod.RAISE`.

Returns: transformation matrix as pandas DataFrame.

shape(datasets, session, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the shape of the datasets

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.ROUND: only valid for counts, rounds to the privacy bound or 0. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result

squared_errors_by_column(datasets, session, column_name, global_mean, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the sum over the squared difference from global_mean over a specified column.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

name of the column over which the operation shall be calculated

required
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
global_mean float

the deviation of each element to this value is squared and then added up. The mean can be computed via apheris.simple_stats.mean_column.

required
aggregation bool

defines whether the operation should be aggregated over all datasets or whether the operation should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result

sum_column(datasets, session, column_name, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Returns the sum over a specified column.

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
column_name str

name of the column over which the sum shall be calculated

required
group_by Union[Hashable, Iterable[Hashable]]

optional; mapping, label, or list of labels, used to group before aggregation.

None
aggregation bool

defines whether the sum should be aggregated over all datasets or whether the sum should be returned per dataset.

True
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result

tableone(datasets, session, numerical_columns=None, numerical_nonnormal_columns=None, categorical_columns=None, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE) 🔗

Create an overview statistic

Parameters:

Name Type Description Default
datasets Union[Iterable[FederatedDataFrame], FederatedDataFrame]

list of FederatedDataFrames that define the pre-preprocessing of individual datasets.

required
session Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]

For remote runs, use a SimpleStatsSession that refers to a cluster of Compute Clients and an Aggregator. If you want to simulate a cluster locally, use a LocalDummySimpleStatsSession or LocalDebugSimpleStatsSession.

required
numerical_columns Iterable[str]

names of columns for which mean and standard deviation shall be calculated.

None
numerical_nonnormal_columns Iterable[str]

names of columns for which the median, as well as 1st and 3rd quartile shall be calculated. These values are approximated via a histogram.

None
categorical_columns Iterable[str]

names of categorical columns, whose value counts shall be counted.

None
group_by Union[Hashable, Iterable[Hashable]]

mapping, label, or list of labels, used to group before aggregation.

None
n_bins int

number of bins of the histogram that is used to approximate the median and 1st and 3rd quartile of columns in numerical_nonnormal_columns.

100
handle_outliers Union[PrivacyHandlingMethod, str]

Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are:

- PrivacyHandlingMethod.FILTER: filters out all groups that are violating privacy bound. - PrivacyHandlingMethod.FILTER_DATASET: removes out the entire dataset from the federated computation in case of privacy violations. - PrivacyHandlingMethod.RAISE: raises a PrivacyException if privacy bound was violated.

Default is PrivacyHandlingMethod.RAISE.

RAISE

Returns:

Type Description

statistical result; Its result contains a pandas DataFrame with the

tableone statistics over the datasets.

apheris_stats.simple_stats.exceptions🔗

ObjectNotFound 🔗

Bases: ApherisException

Raised when trying to access an object that does not exist.

InsufficientPermissions 🔗

Bases: Exception

Raised when an operation does not have sufficient permissions to be performed.

PrivacyException 🔗

Bases: Exception

Raised when a privacy mechanism required by the data provider(s) fails to be applied, is violated, or is incompatible with the user-chosen settings.

RestrictedPreprocessingViolation 🔗

Bases: PrivacyException

Raised when a prohibited command is requested to be executed due to restricted preprocessing.

apheris_stats.simple_stats.util🔗

FederatedDataFrame 🔗

Object that simplifies preprocessing by providing a pandas-like interface to preprocess tabular data. The FederatedDataFrame contains preprocessing transformations that are to be applied on a remote dataset. On which dataset it operates is specified in the constructor.

__init__(data_source, read_format=None, filename_in_zip=None) 🔗

Parameters:

Name Type Description Default
data_source Union[str, RemoteData]

remote id or RemoteData object or path to a data file or graph JSON file

required
read_format Union[str, InputFormat, None]

format of data source

None
filename_in_zip Union[str, None]

used for ZIP format to identify which file out of ZIP to take The argument is optional, but must be specified for ZIP format. If read_format is ZIP, the value of this argument is used to read one CSV.

None

Example:

  • via dataset id: assume your dataset id is 'data-cloudnode':
        df = FederatedDataFrame('data-cloudnode')
    
  • optional: for remote data containing multiple files, choose which file to read:
        df = FederatedDataFrame('data-cloudnode', filename_in_zip='patients.csv')
    

loc: '_LocIndexer' property 🔗

Use pandas .loc notation to access the data

__setitem__(index, value) 🔗

Manipulates values of columns or rows of a FederatedDataFrame. This operation does not return a copy of the FederatedDataFrame object, instead this operation is implemented inplace. That means, the computation graph within the FederatedDataFrame object is modified on the object level. This function is not available in a privacy fully preserving mode.

Example:

Assume the dummy data for 'data_cloudnode' looks like this:

```
    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df["new column"] = df["weight"]
df.preprocess_on_dummy()
```

results in
```
   patient_id  age  weight  new_column
0           1   77      55          55
1           2   88      60          60
2           3   93      83          83
```

Parameters:

Name Type Description Default
index Union[str, int]

column index or name or a boolean valued FederatedDataFrame as index mask.

required
value Union[ALL_TYPES]

a constant value or a single column FederatedDataFrame

required

__getitem__(key) 🔗

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df = df["weight"]
df.preprocess_on_dummy()

results in

   weight
0    55
1    60
2    83

Args: key: column index or name or a boolean valued FederatedDataFrame as index mask.

Returns:

Type Description
'FederatedDataFrame'

new instance of the current object with updated graph. If the key was a

'FederatedDataFrame'

column identifier, the computation graph results in a single-column

'FederatedDataFrame'

FederatedDataFrame. If the key was an index mask the resulting computation

'FederatedDataFrame'

graph will produce a filtered FederatedDataFrame.

add(left, right, result=None) 🔗

Privacy-preserving addition: to a column (left) add another column or constant value (right) and store the result in result. Adding arbitrary iterables would allow for singling out attacks and is therefore disallowed.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df.add("weight", 100, "new_weight")
df.preprocess_on_dummy()

returns

   patient_id  age  weight  new_weight
0           1   77      55         155
1           2   88      60         160
2           3   93      83         183

df.add("weight", "age", "new_weight")

returns

   patient_id  age  weight  new_weight
0           1   77      55         132
1           2   88      60         148
2           3   93      83         176

Parameters:

Name Type Description Default
left ColumnIdentifier

a column identifier

required
right

a column identifier or constant value

required
result Optional[ColumnIdentifier]

name for the new result column can be set to None to overwrite the left column

None

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

neg(column_to_negate, result_column=None) 🔗

Privacy-preserving negation: negate column column_to_negate and store the result in column result_column, or leave result_column as None and overwrite column_to_negate. Using this form of negation removes the need for setitem functionality which is not privacy-preserving.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df = df.neg("age", "neg_age")
df.preprocess_on_dummy()

returns

   patient_id  age  weight  neg_age
0           1   77      55      -77
1           2   88      60      -88
2           3   93      83      -93

Parameters:

Name Type Description Default
column_to_negate ColumnIdentifier

column identifier

required
result_column Optional[ColumnIdentifier]

optional name for the new column, if not specified, column_to_negate is overwritten

None

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

sub(left, right, result) 🔗

Privacy-preserving subtraction: computes left - right and stores the result in the column result. Both left and right can be column names, or one of it a column name and one a constant. Arbitrary subtraction with iterables would allow for singling-out attacks and is therefore disallowed.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df = df.sub("weight", 100, "new_weight")
df.preprocess_on_dummy()

returns

   patient_id  age  weight  new_weight
0           1   77      55         -45
1           2   88      60         -40
2           3   93      83         -17

df.sub("weight", "age", "new_weight")

returns

   patient_id  age  weight  new_weight
0           1   77      55         -22
1           2   88      60         -28
2           3   93      83         -10

Parameters:

Name Type Description Default
left

column identifier or constant

required
right

column identifier or constant

required
result ColumnIdentifier

column name for the new result colum

required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

mult(left, right, result=None) 🔗

Privacy-preserving multiplication: to a column (left) multiply another column or constant value (right) and store the result in result. Multiplying arbitrary iterables would allow for singling out attacks and is therefore disallowed.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df.mult("weight", 2, "new_weight")
df.preprocess_on_dummy()

returns

    patient_id  age  weight  new_weight
0           1   77      55         110
1           2   88      60         120
2           3   93      83         166

df.mult("weight", "patient_id", "new_weight")

returns

   patient_id  age  weight  new_weight
0           1   77      55          55
1           2   88      60         120
2           3   93      83         249

Parameters:

Name Type Description Default
left ColumnIdentifier

a column identifier

required
right

a column identifier or constant value

required
result Optional[ColumnIdentifier]

name for the new result column, can be set to None to overwrite the left column

None

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

truediv(left, right, result) 🔗

Privacy-preserving division: divide a column or constant (left) by another column or constant (right) and store the result in result. Dividing by arbitrary iterables would allow for singling out attacks and is therefore disallowed.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df.truediv("weight", 2, "new_weight")
df.preprocess_on_dummy()

returns

    patient_id  age  weight  new_weight
0           1   77      55        27.5
1           2   88      60        30.0
2           3   93      83        41.5

df.truediv("weight", "patient_id", "new_weight")

returns

   patient_id  age  weight  new_weight
0           1   77      55   55.000000
1           2   88      60   30.000000
2           3   93      83   27.666667

Parameters:

Name Type Description Default
left ColumnIdentifier

a column identifier

required
right

a column identifier or constant value

required
result ColumnIdentifier

name for the new result column

required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

invert(column_to_invert, result_column=None) 🔗

Privacy-preserving inversion (~ operator): invert column column_to_invert and store the result in column result_column, or leave result_column as None and overwrite column_to_invert. Using this form of negation removes the need for setitem functionality which is not privacy-preserving.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight  death
0           1   77    55.0   True
1           2   88    60.0  False
2           3   23     NaN   True

df = FederatedDataFrame('data_cloudnode')
df = df.invert("death", "survival")
df.preprocess_on_dummy()

returns

   patient_id  age  weight  death  survival
0           1   77    55.0   True     False
1           2   88    60.0  False      True
2           3   23     NaN   True     False

Parameters:

Name Type Description Default
column_to_invert ColumnIdentifier

column identifier

required
result_column Optional[ColumnIdentifier]

optional name for the new column, if not specified, column_to_negate is overwritten

None

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

__lt__(other) 🔗

Compare a single-column FederatedDataFrame with a constant using the operator '<' Example: Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   40      50

df = FederatedDataFrame('data_cloudnode')
df = df["age"] < df["weight"]
df.preprocess_on_dummy()
returns
```
0    False
1    False
2     True
```

Parameters:

Name Type Description Default
other

FederatedDataFrame or value to compare with

required

Returns:

Type Description
FederatedDataFrame

single column FederatedDataFrame with computation graph resulting in a

FederatedDataFrame

boolean Series.

__gt__(other) 🔗

Compare a single-column FederatedDataFrame with a constant using the operator '>'

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   40      50

df = FederatedDataFrame('data_cloudnode')
df = df["age"] > df["weight"]
df.preprocess_on_dummy()

returns

0     True
1     True
2    False

Parameters:

Name Type Description Default
other

FederatedDataFrame or value to compare with

required

Returns:

Type Description
FederatedDataFrame

single column FederatedDataFrame with computation graph resulting in a

FederatedDataFrame

boolean Series.

__eq__(other) 🔗

Compare a single-column FederatedDataFrame with a constant using the operator '=='

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   40      40

df = FederatedDataFrame('data_cloudnode')
df = df["age"] == df["weight"]
df.preprocess_on_dummy()

returns

0    False
1    False
2     True

Parameters:

Name Type Description Default
other

FederatedDataFrame or value to compare with

required

Returns:

Type Description
FederatedDataFrame

single column FederatedDataFrame with computation graph resulting in a

FederatedDataFrame

boolean Series.

__le__(other) 🔗

Compare a single-column FederatedDataFrame with a constant using the operator '<='

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   40      40

df = FederatedDataFrame('data_cloudnode')
df = df["age"] <= df["weight"]
df.preprocess_on_dummy()

returns

0    False
1    False
2     True

Parameters:

Name Type Description Default
other

FederatedDataFrame or value to compare with

required

Returns:

Type Description
FederatedDataFrame

single column FederatedDataFrame with computation graph resulting in a

FederatedDataFrame

boolean Series.

__ge__(other) 🔗

Compare a single-column FederatedDataFrame with a constant using the operator '>='

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   40      40

df = FederatedDataFrame('data_cloudnode')
df = df["age"] >= df["weight"]
df.preprocess_on_dummy()

returns

0    True
1    True
2    True

Parameters:

Name Type Description Default
other

FederatedDataFrame or value to compare with

required

Returns:

Type Description
FederatedDataFrame

single column FederatedDataFrame with computation graph resulting in a

FederatedDataFrame

boolean Series.

__ne__(other) 🔗

Compare a single-column FederatedDataFrame with a constant using the operator '!='

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   40      40

df = FederatedDataFrame('data_cloudnode')
df = df["age"] != df["weight"]
df.preprocess_on_dummy()

returns

0     True
1     True
2    False

Parameters:

Name Type Description Default
other

FederatedDataFrame or value to compare with

required

Returns:

Type Description
FederatedDataFrame

single column FederatedDataFrame with computation graph resulting in a

FederatedDataFrame

boolean Series.

to_datetime(on_column=None, result_column=None, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit='ns', infer_datetime_format=False, origin='unix') 🔗

Convert the column on_column to datetime format. Further arguments can be passed to the respective underlying pandas' to_datetime function with kwargs. Results in a table where column is updated, no need for the unsafe setitem operation.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  start_date    end_date
0           1  "2015-08-01"  "2015-12-01"
1           2  "2017-11-11"  "2020-11-11"
2           3  "2020-01-01"         NaN

df = FederatedDataFrame('data_cloudnode')
df = df.to_datetime("start_date", "new_start_date")
df.preprocess_on_dummy()

returns

       patient_id  start_date    end_date new_start_date
0           1  "2015-08-01"  "2015-12-01"     2015-08-01
1           2  "2017-11-11"  "2020-11-11"     2017-11-11
2           3  "2020-01-01"          NaN      2020-01-01

Parameters:

Name Type Description Default
on_column Optional[ColumnIdentifier]

column to convert

None
result_column Optional[ColumnIdentifier]

optional column where the result should be stored, defaults to on_column if not specified

None
errors str

optional argument how to handle errors during parsing, "raise": raise an exception upon errors (default), "coerce": set value to NaT and continue, "ignore": return the input and continue

'raise'
dayfirst bool

optional argument to specify the parse order, if True, parses with the day first, e.g. 01/02/03 is parsed to 1st February 2003 defaults to False

False
yearfirst bool

optional argument to specify the parse order, if True, parses the year first, e.g. 01/02/03 is parsed to 3rd February 2001 defaults to False

False
utc bool

optional argument to control the time zone, if False (default), assume input is in UTC, if True, time zones are converted to UTC

None
format str

optional strftime argument to parse the time, e.g. "%d/%m/%Y, defaults to None

None
exact bool

optional argument to control how "format" is used, if True (default), an exact format match is required, if False, the format is allowed to match anywhere in the target string

True
unit str

optional argument to denote the unit, defaults to "ns", e.g. unit="ms" and origin="unix" calculates the number of milliseconds to the unix epoch start

'ns'
infer_datetime_format bool

optional argument to attempt to infer the format based on the first (non-NaN) argument when set to True and no format is specified, defaults to False

False
origin str

optional argument to define the reference date, numeric values are parsed as number of units defined by the "unit" argument since the reference date, e.g. "unix" (default) sets the origin to 1970-01-01, "julian" (with "unit" set to "D") sets the origin to the beginning of the Julian Calendar (January 1st 4713 BC).

'unix'

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

fillna(value, on_column=None, result_column=None) 🔗

Fill NaN values with a constant (int, float, string) similar to pandas' fillna. The following arguments from pandas implementation are not supported: method, axis, inplace, limit, downcast

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id   age  weight
0           1  77.0    55.0
1           2   NaN    60.0
2           3  88.0     NaN
df = FederatedDataFrame('data_cloudnode')
df2 = df.fillna(7)
df2.preprocess_on_dummy()

returns

   patient_id   age  weight
0           1  77.0    55.0
1           2   7.0    60.0
2           3  88.0     7.0
df3 = df.fillna(7, on_column="weight")
df3.preprocess_on_dummy()

returns

   patient_id   age  weight
0           1  77.0    55.0
1           2   NaN    60.0
2           3  88.0     7.0

Parameters:

Name Type Description Default
value Union[ALL_TYPES]

value to use for filling up NaNs

required
on_column Optional[ColumnIdentifier]

only operate on the specified column, defaults to None, i.e., operate on the entire table

None
result_column Optional[ColumnIdentifier]

if on_column is specified, optionally store the result in a new column with this name, defaults to None, i.e., overwriting the column

None

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

dropna(axis=0, how='any', thresh=None, subset=None) 🔗

Drop Nan values from the table with arguments like for pandas' dropna.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id   age  weight
0           1  77.0    55.0
1           2  88.0     NaN
2           3   NaN     NaN
df = FederatedDataFrame('data_cloudnode')
df2 = df.dropna()
df2.preprocess_on_dummy()

returns

    patient_id   age  weight
0           1  77.0    55.0
df3 = df.dropna(axis=0, subset=["age"])
df3.preprocess_on_dummy()
returns
   patient_id   age  weight
0           1  77.0    55.0
1           2  88.0     NaN

Parameters:

Name Type Description Default
axis

axis to apply this operation to, defaults to zero

0
how

determine if row or column is removed from FederatedDataFrame, when we have at least one NA or all NA, defaults to "any". ‘any’ : If any NA values are present, drop that row or column. ‘all’ : If all values are NA, drop that row or column.

'any'
thresh Optional[int]

optional - require that many non-NA values to drop, defaults to None

None
subset Union[ColumnIdentifier, List[ColumnIdentifier], None]

optional - use only a subset of columns, defaults to None, i.e., operate on the entire data frame, subset of rows is not permitted for privacy reasons.

None

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

isna(on_column=None, result_column=None) 🔗

Checks if an entry is null for given columns or FederatedDataFrame and sets boolean value accordingly in the result column.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id   age  weight
0           1  77.0    55.0
1           2  88.0     NaN
2           3   NaN     NaN
df = FederatedDataFrame('data_cloudnode')
df2 = df.isna()
df2.preprocess_on_dummy()
returns
    patient_id    age  weight
0       False  False   False
1       False  False   False
2       False   True    True
df3 = df.isna("age", "na_age")
df3.preprocess_on_dummy()
returns
    patient_id   age  weight na_age
0           1  77.0    55.0  False
1           2  88.0     NaN  False
2           3   NaN     NaN  True

Parameters:

Name Type Description Default
on_column Optional[ColumnIdentifier]

column name which is being checked

None
result_column Optional[ColumnIdentifier]

optional result columns. If specified, a new column is added to the FederatedDataFrame, otherwise on_column is overwritten.

None

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

astype(dtype, on_column=None, result_column=None) 🔗

Convert the entire table to the given datatype similarly to pandas' astype. The following arguments from pandas implementation are not supported: copy, errors Optionally arguments not present in pandas implementation: on_column and result_column: give a column to which the astype function should be applied.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77    55.4
1           2   88    60.0
2           3   99    65.5
df = FederatedDataFrame('data_cloudnode')
df2 = df.astype(str)
df2.preprocess_on_dummy()
returns
   patient_id   age  weight
0         "1"  "77"  "55.4"
1         "2"  "88"  "60.0"
2         "3"  "99"  "65.5"

df3 = df.astype(float, on_column="age")

   patient_id   age  weight
0           1  77.0    55.4
1           2  88.0    60.0
2           3  99.0    65.5

Parameters:

Name Type Description Default
dtype Union[type, str]

type to convert to

required
on_column Optional[ColumnIdentifier]

optional column to convert, defaults to None, i.e., the entire FederatedDataFrame is converted

None
result_column Optional[ColumnIdentifier]

optional result column if on_column is specified, defaults to None, i.e., the on_column is overwritten

None

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None) 🔗

Merges two FederatedDataFrames. When the preprocessing privacy guard is enabled, merges are only possible as the first preprocessing step. See also pandas documentation.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

patients.csv
    id  age  death
0  423   34      1
1  561   55      0
2  917   98      1
insurance.csv
    id insurance
0  561        TK
1  917       AOK
2  123      None
patients = FederatedDataFrame('data_cloudnode',
    filename_in_zip='patients.csv')
insurance = FederatedDataFrame('data_cloudnode',
    filename_in_zip="insurance.csv")
merge1 = patients.merge(insurance, left_on="id", right_on="id", how="left")
merge1.preprocess_on_dummy()
returns
    id  age  death insurance
0  423   34      1       NaN
1  561   55      0        TK
2  917   98      1       AOK
merge2 = patients.merge(insurance, left_on="id", right_on="id", how="right")
merge2.preprocess_on_dummy()
returns
    id   age  death insurance
0  561  55.0    0.0        TK
1  917  98.0    1.0       AOK
2  123   NaN    NaN      None

merge3 = patients.merge(insurance, left_on="id", right_on="id", how="outer")
merge3.preprocess_on_dummy()
returns
    id   age  death insurance
0  423  34.0    1.0       NaN
1  561  55.0    0.0        TK
2  917  98.0    1.0       AOK
3  123   NaN    NaN      None

Parameters:

Name Type Description Default
right FederatedDataFrame

the other FederatedDataFrame to merge with

required
how Literal['left', 'right', 'outer', 'inner', 'cross']

type of merge ("left", "right", "outer", "inner", "cross")

'inner'
on Optional[ColumnIdentifier]

column or index to join on, that is available on both sides

None
left_on Optional[ColumnIdentifier]

column or index to join the left FederatedDataFrame

None
right_on Optional[ColumnIdentifier]

column or index to join the right FederatedDataFrame

None
left_index bool

use the index of the left FederatedDataFrame

False
right_index bool

use the index of the right FederatedDataFrame

False
sort bool

Sort the join keys in the resulting FederatedDataFrame

False
suffixes

A sequence ot two strings. If columns overlap, these suffixes are appended to column names defaults to ("_x", "_y"), i.e., if you have the column "id" in both tables, the left table's id column will be renamed to "id_x" and the right to "id_y".

('_x', '_y')
copy bool

If False, avoid copy if possible.

True
indicator bool

If true, a column "_merge" will be added to the resulting FederatedDataFrame that indicates the origin of a row

False
validate Optional[str]

“one_to_one”/“one_to_many”/“many_to_one”/“many_to_many”. If set, a check is performed if the specified type is met.

None

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

Raises:

Type Description
PrivacyException

if merges are unsecure due the operations done before

concat(other, join='outer', ignore_index=True, verify_integrity=False, sort=False) 🔗

Concatenate two FederatedDataFrames verically. The following arguments from pandas implementation are not supported: keys, levels, names, verify_integrity, copy. Args: other: the other FederatedDataFrame to concatenate with join: type of join to perform ('inner' or 'outer'), defaults to 'outer' ignore_index: whether to ignore the index, defaults to True verify_integrity: whether to verify the integrity of the result, defaults to False sort: whether to sort the result, defaults to False

rename(columns) 🔗

Rename column(s) similarly to pandas' rename. The following arguments from pandas implementation are not supported: mapper,index, axis, copy, inplace, level, errors

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77    55.4
1           2   88    60.0
2           3   99    65.5
df = FederatedDataFrame('data_cloudnode')
df = df.rename({"patient_id": "patient_id_new", "age": "age_new"})
df.preprocess_on_dummy()
returns
   patient_id_new  age_new  weight
0           1           77    55.4
1           2           88    60.0
2           3           99    65.5

Parameters:

Name Type Description Default
columns Dict[ColumnIdentifier, ColumnIdentifier]

dict containing the remapping of old names to new names

required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph

drop_column(column) 🔗

Remove the given column from the table.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df = df.drop_column("weight")
df.preprocess_on_dummy()
returns
patient_id  age
0           1   77
1           2   88
2           3   93

Parameters:

Name Type Description Default
column Union[ColumnIdentifier, List[ColumnIdentifier]]

column name or list of column names to drop

required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

sample(n=None, frac=None, replace=False, random_state=None, ignore_index=False) 🔗

Sample the data frame based on a given mask and percentage. Only one of n (number of samples) or frac (fraction of the data) can be specified. The following arguments from pandas implementation are not supported: weights and axis.

Parameters:

Name Type Description Default
n Optional[int]

number of samples to take

None
frac Optional[float]

fraction of the data to sampl between 0 and 1

None
replace bool

whether to sample with replacement

False
random_state Optional[int]

seed for the random number generator

None
ignore_index bool

whether to ignore the index when sampling

False

__add__(other) 🔗

Arithmetic operator, which adds a constant value or a single column FederatedDataFrame to a single column FederatedDataFrame. This operator is useful only in combination with setitem. In a privacy preserving mode use the add function instead.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] + 100
df.preprocess_on_dummy()
returns
   patient_id  age  weight  new_weight
0           1   77      55         155
1           2   88      60         160
2           3   93      83         183

df["new_weight"] = df["weight"] + df["age"]
returns
   patient_id  age  weight  new_weight
0           1   77      55         132
1           2   88      60         148
2           3   93      83         176

Parameters:

Name Type Description Default
other Union[ALL_TYPES]

constant value or a single column FederatedDataFrame to add.

required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

__radd__(other) 🔗

Arithmetic operator, which adds a constant value or a single column FederatedDataFrame to a single column FederatedDataFrame from right. This operator is useful only in combination with setitem. In a privacy preserving mode use the add function instead.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = 100 + df["weight"]
df.preprocess_on_dummy()
returns
   patient_id  age  weight  new_weight
0           1   77      55         155
1           2   88      60         160
2           3   93      83         183

Parameters:

Name Type Description Default
other

constant value or a single column FederatedDataFrame to add.

required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

__neg__() 🔗

Logical operator, which negates values of a single column FederatedDataFrame. This operator is useful only in combination with setitem. In a privacy preserving mode use the neg function instead.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["neg_age"] = - df["age"]
df.preprocess_on_dummy()
returns
    patient_id  age  weight  neg_age
0           1   77      55      -77
1           2   88      60      -88
2           3   93      83      -93

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

__invert__() 🔗

Logical operator, which inverts bool values (known as tilde in pandas, ~).

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight  death
0           1   77    55.0   True
1           2   88    60.0  False
2           3   23     NaN   True
df = FederatedDataFrame('data_cloudnode')
df["survival"] = ~df["death"]
df.preprocess_on_dummy()
returns
   patient_id  age  weight  death  survival
0           1   77    55.0   True     False
1           2   88    60.0  False      True
2           3   23     NaN   True     False

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

__sub__(other) 🔗

Arithmetic operator, which subtracts a constant value or a single column FederatedDataFrame to a single column FederatedDataFrame. This operator is useful only in combination with setitem. In a privacy preserving mode use the sub function instead.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] - 100
df.preprocess_on_dummy()
returns
   patient_id  age  weight  new_weight
0           1   77      55         -45
1           2   88      60         -40
2           3   93      83         -17

df["new_weight"] = df["weight"] - df["age"]
returns
   patient_id  age  weight  new_weight
0           1   77      55         -22
1           2   88      60         -28
2           3   93      83         -10

Parameters:

Name Type Description Default
other

constant value or a single column FederatedDataFrame to subtract.

required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

__rsub__(other) 🔗

Arithmetic operator, which subtracts a single column FederatedDataFrame from a constant value or a single column FederatedDataFrame. This operator is useful only in combination with setitem. In a privacy preserving mode use the sub function instead.

Parameters:

Name Type Description Default
other

constant value or a single column FederatedDataFrame from which to subtract.

required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83

df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = 100 - df["weight"]
df.preprocess_on_dummy()

returns

   patient_id  age  weight  new_weight
0           1   77      55         45
1           2   88      60         40
2           3   93      83         17

__truediv__(other) 🔗

Arithmetic operator, which divides FederatedDataFrame by a constant or another FederatedDataFrame.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] / 2
df.preprocess_on_dummy()
returns
    patient_id  age  weight  new_weight
0           1   77      55        27.5
1           2   88      60        30.0
2           3   93      83        41.5

df["new_weight"] = df["weight"] / df["patient_id"]
returns
   patient_id  age  weight  new_weight
0           1   77      55   55.000000
1           2   88      60   30.000000
2           3   93      83   27.666667

Parameters:

Name Type Description Default
other Union[FederatedDataFrame, int, float, bool]

constant value or another FederatedDataFrame to divide by.

required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

__mul__(other) 🔗

Arithmetic operator, which multiplies FederatedDataFrame by a constant or another FederatedDataFrame.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] * 2
df.preprocess_on_dummy()
returns
    patient_id  age  weight  new_weight
0           1   77      55         110
1           2   88      60         120
2           3   93      83         166

df["new_weight"] = df["weight"] * df["patient_id"]
returns
   patient_id  age  weight  new_weight
0           1   77      55          55
1           2   88      60         120
2           3   93      83         249

Parameters:

Name Type Description Default
other Union[FederatedDataFrame, int, float, bool]

constant value or another FederatedDataFrame to multiply by.

required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

__rmul__(other) 🔗

Arithmetic operator, which multiplies FederatedDataFrame by a constant or another FederatedDataFrame.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

    patient_id  age  weight
0           1   77      55
1           2   88      60
2           3   93      83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = 2 * df["weight"] * 2
df.preprocess_on_dummy()
returns
    patient_id  age  weight  new_weight
0           1   77      55         110
1           2   88      60         120
2           3   93      83         166

Parameters:

Name Type Description Default
other Union[FederatedDataFrame, int, float, bool]

constant value or another FederatedDataFrame to multiply by.

required

Returns: new instance of the current object with updated graph.

__and__(other) 🔗

Logical operator, which conjuncts values of a single column FederatedDataFrame with a constant or another single column FederatedDataFrame.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  death  infected
0           1   77      1         1
1           2   88      0         1
2           3   40      1         0
df = FederatedDataFrame('data_cloudnode')
df = df["death"] & df["infected"]
df.preprocess_on_dummy()
returns
0    1
1    0
2    0

Args: other: constant value or another FederatedDataFrame to logically conjunct

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

__or__(other) 🔗

Logical operator, which conjuncts values of a single column FederatedDataFrame with a constant or another single column FederatedDataFrame.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  death  infected
0           1   77      1         1
1           2   88      0         1
2           3   40      1         0
df = FederatedDataFrame('data_cloudnode')
df = df["death"] | df["infected"]
df.preprocess_on_dummy()
returns
0    1
1    1
2    1

Parameters:

Name Type Description Default
other Union[FederatedDataFrame, bool, int]

constant value or another FederatedDataFrame to logically conjunct

required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

str_contains(pattern) 🔗

Checks if string values of single column FederatedDataFrame contain pattern. Typical usage federated_dataframe[column].str.contains(pattern)

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight   race
0           1   77      55  white
1           2   88      60  black
2           3   93      83  asian
df = FederatedDataFrame('data_cloudnode')
df = df["race"].str.contains("a")
df.preprocess_on_dummy()
returns
0    False
1     True
2     True

Parameters:

Name Type Description Default
pattern str

pattern string to check for

required

Returns: new instance of the current object with updated graph.

str_len() 🔗

Computes string lenght for each entry. Typical usage federated_dataframe[column].str.len()

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight   race
0           1   77      55      w
1           2   88      60     bl
2           3   93      83  asian
df = FederatedDataFrame('data_cloudnode')
df = df["race"].str.len()
df.preprocess_on_dummy()
returns
0    1
1    2
2    5

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

dt_datetime_like_properties(datetime_like_property) 🔗

Checks if a property of datetime-like object can be applied to a column of FederatedDataFrame. Typical usage federated_dataframe[column].dt.days

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  start_date    end_date
0           1  2015-08-01  2015-12-01
1           2  2017-11-11  2020-11-11
2           3  2020-01-01  2022-06-16
df = FederatedDataFrame('data_cloudnode')
df = df.to_datetime("start_date")
df = df.to_datetime("start_date")
df = df.sub("end_date", "start_date", "duration")
df = df["duration"] = df["duration"].dt.days - 5
df.preprocess_on_dummy()
returns
   patient_id start_date   end_date  duration
0           1 2015-08-01 2015-12-01       117
1           2 2017-11-11 2020-11-11      1091
2           3 2020-01-01 2022-06-16       892

Parameters:

Name Type Description Default
datetime_like_property

datetime-like (.dt) property to be accessed

required

Returns: new instance of the current object with updated graph.

sort_values(by, axis=0, ascending=True, kind='quicksort', na_position='last', ignore_index=False) 🔗

Sort values, similar to pandas' sort_values. The following arguments from pandas implementation are not supported: key - we do not support the key argument, as that could be an arbitrary function.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77    55.0
1           2   88    60.0
2           3   93    83.0
3           4   18     NaN
df = FederatedDataFrame('data_cloudnode')
df = df.sort_values(by="weight", axis="index", ascending=False)
df.preprocess_on_dummy()
returns
   patient_id  age  weight
2           3   93    83.0
1           2   88    60.0
0           1   77    55.0
3           4   18     NaN

Parameters:

Name Type Description Default
by Union[ColumnIdentifier, List[ColumnIdentifier]]

column name or list of column names to sort by

required
axis

axis to be sorted: 0 or "index" means sort by index, thus, by contains column labels 1 or "column" means sort by column, thus, by contains index labels

0
ascending bool

defaults to ascending sorting, but can be set to False for descending sorting

True
kind str

defaults to the quicksort sorting algorithm; mergesort, heapsort and stable are available as well

'quicksort'
na_position

defaults to sorting NaNs to the end, set to "first" to put them in the beginning

'last'
ignore_index bool

defaults to false, otherwise, the resulting axis will be labelled 0, 1, ... length-1

False

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

isin(values) 🔗

Whether each element in the data is contained in values, similar to pandas' isin.

Example

Assume the dummy data for 'data_cloudnode' looks like this:

patients.csv:
   patient_id  age  weight
0           1   77    55.0
1           2   88    60.0
2           3   93    83.0
3           4   18     NaN
other.csv:
   patient_id  age  weight
0           1   77    55.0
1           2   88    60.0
2           7   33    93.0
3           8   66     NaN
df = FederatedDataFrame('data_cloudnode',
    filename_in_zip='patients.csv')
df = df.isin(values = {"age": [77], "weight": [55]})
df.preprocess_on_dummy()
returns
   patient_id    age  weight
0       False   True    True
1       False  False   False
2       False  False   False
3       False  False   False

df_other = FederatedDataFrame('data_cloudnode',
    filename_in_zip='other.csv')
df = df.isin(df_other)
df.preprocess_on_dummy()
returns
   patient_id    age  weight
0        True   True    True
1        True   True    True
2       False  False   False
3       False  False   False

Parameters:

Name Type Description Default
values

iterable, dict or FederatedDataFrame to check against. Returns True at each location if all the labels match,

  • if values is a Series, that's the index,
  • if values is a dict, the keys are expected to be column names,
  • if values is a FederatedDataFrame, both index and column labels must match.
required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

groupby(by=None, axis=0, sort=True, group_keys=True, observed=False, dropna=True) 🔗

Group the data using a mapper. Notice that this operation must be followed by an aggregation (such as .last or .first) before further operations can be made. The arguments are similar to pandas' original groupby. The following arguments from pandas implementation are not supported: axis, level, as_index

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight procedures  start_date
0           1   77      55          a  2015-08-01
1           1   77      55          b  2015-10-01
2           2   88      60          a  2017-11-11
3           3   93      83          c  2020-01-01
4           3   93      83          b  2020-05-01
5           3   93      83          a  2021-01-04
df = FederatedDataFrame('data_cloudnode')
grouped_first = df.groupby(by='patient_id').first()
grouped_first.preprocess_on_dummy()
returns
            age  weight procedures start_date
patient_id
1            77      55          a 2015-08-01
2            88      60          a 2017-11-11
3            93      83          c 2020-01-01

grouped_last = df.groupby(by='patient_id').last()
grouped_last.preprocess_on_dummy()
returns
            age  weight procedures start_date
patient_id
1            77      55          b 2015-10-01
2            88      60          a 2017-11-11
3            93      83          a 2021-01-04

Parameters:

Name Type Description Default
by

dictionary, series, label, or list of labels to determine the groups. Grouping with a custom function is not allowed. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups. If a list or ndarray of length equal to the selected axis is passed, the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.

None
axis int

Split along rows (0 or "index") or columns (1 or "columns")

0
sort bool

Sort group keys.

True
group_keys bool

During aggregation, add group keys to index to identify groups.

True
observed bool

Only applies to categorical grouping, if true, only show observed values, otherwise, show all values.

False
dropna bool

if true and groups contain NaN values, they will be dropped together with the row/column, otherwise, treat NaN as key in groups.

True

Returns:

Type Description
_FederatedDataFrameGroupBy

_FederatedGroupBy object to be used in combination with further aggregations.

Raises:

Type Description
PrivacyException

if by is a function

rolling(window, min_periods=None, center=False, on=None, axis=0, closed=None) 🔗

Rolling window operation, similar to pandas.DataFrame.rolling Following pandas arguments are not supported: win_type, method, step

drop_duplicates(subset=None, keep='first', ignore_index=False) 🔗

Drop duplicates in a table or column, similar to pandas' drop_duplicates

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      83
2           3   93      83
3           3   93      83
df = FederatedDataFrame('data_cloudnode')
df1 = df.drop_duplicates()
df1.preprocess_on_dummy()
returns
   patient_id  age  weight
0           1   77      55
1           2   88      83
2           3   93      83
df2 = df.drop_duplicates(subset=['weight'])
df2.preprocess_on_dummy()
returns
   patient_id  age  weight
0           1   77      55
1           2   88      83

Parameters:

Name Type Description Default
subset Union[ColumnIdentifier, List[ColumnIdentifier], None]

optional column label or sequence of column labels to consider when identifying duplicates, uses all columns by default

None
keep Union[Literal['first'], Literal['last'], Literal[False]]

string determining which duplicates to keep, can be "first" or "last" or set to False to keep no duplicates

'first'
ignore_index bool

if set to True, the resulting axis will be re-labeled, defaults to False

False

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

charlson_comorbidities(index_column, icd_columns, mapping=None) 🔗

Converts icd codes into comorbidities. If no comorbidity mapping is specified, the default mapping of the NCI is used. See function 'apheris.datatools.transformations.utils.formats.get_default_comorbidity_mapping' for the mapping or the original SAS file maintained by the NCI: https://healthcaredelivery.cancer.gov/seermedicare/considerations/NCI.comorbidity.macro.sas

Parameters:

Name Type Description Default
index_column str

column name of the index column (e.g. patient_id)

required
icd_columns List[str]

names of columns containing icd codes, contributing to comorbidity derivation

required
mapping Dict[str, List]

dictionary that maps comorbidity strings to list of icd codes

None

Returns:

Type Description
FederatedDataFrame

pandas.DataFrame with comorbidity columns according to the used mapping and index from given index column, containing comorbidity entries as boolean values.

charlson_comorbidity_index(index_column, icd_columns, mapping=None) 🔗

Converts icd codes into Charlson Comorbidity Index score. If no comorbidity mapping is specified, the default mapping of the NCI is used. See function 'apheris.datatools.transformations.utils.formats.get_default_comorbidity_mapping' for the mapping or the original SAS file maintained by the NCI: https://healthcaredelivery.cancer.gov/seermedicare/considerations/NCI.comorbidity.macro.sas

Parameters:

Name Type Description Default
index_column str

column name of the index column (e.g. patient_id)

required
icd_columns Union[List[str], str]

names of columns containing icd codes, contributing to comorbidity derivation

required
mapping Dict[str, List]

dictionary that maps comorbidity strings to list of icd codes

None

Returns:

Type Description
FederatedDataFrame

pandas.DataFrame with containing comorbidity score per patient.

reset_index(drop=False) 🔗

Resets the index, e.g., after a groupby operation, similar to pandas reset_index. The following arguments from pandas implementation are not supported: level, inplace, col_level, col_fill, allow_duplicates, names

Example

Assume the dummy data for 'data_cloudnode' looks like this:

   patient_id  age  weight
0           1   77      55
1           2   88      83
2           3   93      60
3           4   18      72
df = FederatedDataFrame('data_cloudnode')
df1 = df.reset_index()
df1.preprocess_on_dummy()
returns
   index  Unnamed: 0  patient_id  age  weight
0      0           0           1   77      55
1      1           1           2   88      83
2      2           2           3   93      60
3      3           3           4   18      72

df2 = df.reset_index(drop=True)
df2.preprocess_on_dummy()
returns
   Unnamed: 0  patient_id  age  weight
0           0           1   77      55
1           1           2   88      83
2           2           3   93      60
3           3           4   18      72

Parameters:

Name Type Description Default
drop bool

If true, do not try to insert index into the data columns. This resets the index to the default integer index. Defaults to False.

False

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

transform_columns(transformation) 🔗

Transform columns of a FederatedDataFrame using a pandas DataFrame as Transformation Matrix. The DataFrame index must correspond to the columns of the original FederatedDataFrame. The transformation is applied row-wise, i.e. each row is transformed to a subspace of the original feature space defined by the columns of the original FederatedDataFrame.

Parameters:

Name Type Description Default
transformation DataFrame

DataFrame with the same index as the columns of the original FederatedDataFrame. The DataFrame must have the same number of rows as the original FederatedDataFrame has columns.

required

Returns:

Type Description
FederatedDataFrame

new instance of the current object with updated graph.

display_graph() 🔗

Convert DiGraph from networkx into pydot and output SVG

Returns: SVG content

save_graph_as_image(filepath, image_format='svg') 🔗

Convert DiGraph from networkx into pydot and save SVG Args: filepath: path where to save an image on the disk image_format: image format to be specified, supported formats are taken from pydot library

export() 🔗

Export FederatedDataFrame object as JSON which can be then imported when needed

Example
df = FederatedDataFrame('data_cloudnode')
df_json = df.export()
# store df_json and later:
df_imported = FederatedDataFrame(data_source=df_json)
# go on using df_imported as you would use df

Returns:

Type Description
str

JSON-like string containing graph and node uuid

preprocess_on_dummy() 🔗

Execute computations "recorded" inside the FederatedDataFrame object on the dummy data attached to the RemoteData object used during initialization.

If no dummy data is available, this method will fail. If you have data for testing stored on your local machine, please use preprocess_on_files instead.

Example
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] + 100

# executes the addition on the dummy data of 'data_cloudnode'
df.preprocess_on_dummy()

# the resulting dataframe is equivalent to:
df_raw = pandas.read_csv(
    apheris_auth.RemoteData('data_cloudnode').dummy_data_path
)
df_raw["new_weight"] = df_raw["weight"] + 100

Returns:

Type Description
DataFrame

resulting pandas.DataFrame after preprocessing has been applied to dummy

DataFrame

data.

preprocess_on_files(filepaths) 🔗

Execute computations "recorded" inside the FederatedDataFrame object on local data.

Parameters:

Name Type Description Default
filepaths Dict[str, str]

dictionary to overwrite RemoteData used during FederatedDataFrame intitialization with other data sources from your local machine. Keys are expected to be RemoteData ids, values are expected to be file paths.

required
Example
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] + 100
df.preprocess_on_files({'data_cloudnode':
                        'myDirectory/local/replacement_data.csv'})

# the resulting dataframe is equivalent to:
df_raw = pd.read_csv('myDirectory/local/replacement_data.csv')
df_raw["new_weight"] = df_raw["weight"] + 100

Note that in case the FederatedDataFrame merges multiple RemoteData objects and you don't specify all their ids in the filepaths, we use dummy data for all "missing" ids (if available, otherwise, an exception is raised).

Returns:

Type Description
DataFrame

resulting pandas.DataFrame after preprocessing has been applied to given file

LocalDebugDataset 🔗

__init__(dataset_id, gateway_id, dataset_fpath, permissions=None, policy=None) 🔗

Dataset class for LocalDebugSimpleStatsSessions.

Parameters:

Name Type Description Default
dataset_id str

Name of the dataset. Allowed characters: letters, numbers, "_", "-", "."

required
gateway_id str

Name of a hypothetical gateway that this dataset resides on. Datasets with the same gateway_id will be launched into the same client. Allowed characters: letters, numbers, "_", "-", "."

required
dataset_fpath str

Absolute filepath to data.

required
policy dict

Policy dict. If not provided, we use empty policies.

None
permissions dict

Permissions dict. If not provided, we allow all operations.

None

LocalDebugSimpleStatsSession 🔗

Bases: LocalSimpleStatsSession

For debugging Apheris Statistics computations locally on your machine. You can work with local files and custom policies and custom permissions. Inject the LocalDebugSimpleStatsSession into a simple-stats computation.

To use the PDB debugger, it is necessary to set max_threads=1.

__init__(datasets, workspace=None, max_threads=None) 🔗

Inits a LocalDebugSimpleStatsSession.

Parameters:

Name Type Description Default
datasets List[LocalDebugDataset]

A list of LocalDebugDataset that define the datasets.

required
workspace Union[str, Path]

path to use as workspace. If not provided, a temporary directory is used as workspace, and information is lost after a statistical query is finished.

None
max_threads Optional[int]

The maximum number of parallel threads to use for the Flare simulator. This should be between 1 and the number of gateways used by the session. Note that debugging may fail for max_threads > 1. Default=1.

None

LocalDummySimpleStatsSession 🔗

Bases: LocalSimpleStatsSession

__init__(dataset_ids=None, workspace=None, policies=None, permissions=None, max_threads=None) 🔗

Inits a LocalDummySimpleStatsSession. When you use the session, DummyData, policies and permissions are downloaded to your machine. Then a simulator runs on your local machine. You can step into the code with a Debugger to investigate problems. Instead of using the original policies and permissions, you can use custom ones. This might be necessary if the DummyData datasets are too small to fullfil privacy constraints for your query. This comes with the downside that your simulation deviates from a "real" execution.

To use the PDB debugger, it is necessary to set max_threads=1.

Parameters:

Name Type Description Default
dataset_ids List[str]

List of dataset IDs. For each dataset ID, a client will be spun up, that uses the datasets' DummyData as his dataset. We automatically apply the privacy policies and permissions of the specified datasets.

None
workspace Union[str, Path]

Union[str, Path] = None

None
policies Optional[Dict[str, dict]]

Dictionary that defines an asset policy (value) per dataset ID (key) in dataset_ids. If a dataset ID is not given in the dictionary, we use the one of the original data. If None, we use the policies of the original data.

None
permissions Optional[Dict[str, dict]]

Dictionary that defines permissions (value) per dataset ID (key) in dataset_ids. If a dataset ID is not given in the dictionary, we use the one of the original data. If None, we use the permissions of the original data.

None
max_threads Optional[int]

The maximum number of parallel threads to use for the Flare simulator. This should be between 1 and the number of gateways used by the session. Note that debugging may fail for max_threads > 1. Default=1.

None

provision(dataset_ids, client_n_cpu=0.5, client_memory=1000, server_n_cpu=0.5, server_memory=1000) 🔗

Create and activate a cluster of Compute Clients and a Compute Aggregator.

Parameters:

Name Type Description Default
dataset_ids List[str]

List of dataset IDs. For each dataset ID, a Compute Client will be spun up.

required
client_n_cpu float

number of vCPUs of Compute Clients

0.5
client_memory int

memory of Compute Clients [MByte]

1000
server_n_cpu float

number of vCPUs of Compute Aggregators

0.5
server_memory int

memory of Compute Aggregators [MByte]

1000

Returns: SimpleStatsSession - Use this session in with simple statistics functions like apheris_stats.simple_stats.tableone.

PrivacyHandlingMethod 🔗

Bases: Enum

Defines the handling method when bounded privacy is violated.

Attributes:

Name Type Description
FILTER

Filter out all groups that are violating privacy bound

FILTER_DATASET

Removes out the entire dataset from the federated computation in case of privacy violations

ROUND

only valid for counts, rounds to the privacy bound or 0

RAISE

raises a PrivacyException if privacy bound was violated

ResultsNotFound 🔗

Bases: Exception

SimpleStatsSession 🔗

Bases: StatsSession

__init__(compute_spec_id) 🔗

Inits a SimpleStatsSession that connects to a running cluster of Compute Clients and an Aggregator. If you have no provisioned/activated cluster yet, then use apheris_stats.simple_stats.util.provision

Parameters:

Name Type Description Default
compute_spec_id UUID

Compute spec ID that corresponds to a running cluster or Compute Clients and an Aggregator. (If you have no provisioned/activated cluster yet, then use apheris_stats.simple_stats.util.provision)

required

get_module_functions(module) 🔗

Return a list of functions in module.