Apheris Statistics Reference🔗
apheris_stats.simple_stats🔗
corr(datasets, session, column_names, global_means=None, group_by=None, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Computes the federated pearson correlation matrix for a given set of columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
datasets that the computation shall be run on |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names |
Iterable[str]
|
set of columns |
required |
global_means |
Dict[Union[str, Tuple], Union[int, float, Number]]
|
means over all datasets for given column names. If global_means is None, it will be automatically determined in a separate pre-run |
None
|
group_by |
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result as a pandas DataFrame with the |
|
correlation matrix of the specified columns. |
Example
corr_matrix = simple_stats.corr(
datasets=[transformations_dataset_essex, transformations_dataset_norfolk],
column_names=['age', 'length of covid infection'],
global_means={'age': 50, 'length of covid infection': 10},
session=session
)
cov(datasets, session, column_names, global_means=None, group_by=None, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Computes the federated covariance matrix for a given set of columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
datasets that the computation shall be run on |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_names |
Iterable[str]
|
set of columns |
required |
global_means |
Dict[Union[str, Tuple], Union[int, float, Number]]
|
means over all datasets for given column names. If global_means is None, it will be automatically determined in a separate pre-run |
None
|
group_by |
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result as a pandas DataFrame with the |
|
correlation matrix of the specified columns. |
Example
coc_matrix = simple_stats.cov(
datasets=[transformations_dataset_essex, transformations_dataset_norfolk],
column_names=['age', 'length of covid infection'],
global_means={'age': 50, 'length of covid infection': 10},
session=session
)
count_column_value(datasets, session, column_name, value, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns how often value
appears in a certain column of the datasets
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name |
str
|
Name of the column over which the function shall be calculated |
required |
value |
This value will be counted |
required | |
aggregation |
bool
|
Defines whether the counts should be aggregated over
all |
True
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result |
count_group_by(datasets, session, column_name, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Function that counts categorical values of a table column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name |
str
|
name of the column for which the statistical query shall be computed |
required |
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result. Its result contains a pandas DataFrame with the counts summed over the datasets. |
count_null(datasets, session, column_name, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the number of occurrences of NA values
(such as None
or
numpy.NaN
) and the number of non-NA values
in the datasets. NA are counted based
on panda's isna()
and notna()
functions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name |
str
|
name of the column over which the NA values shall be counted |
required |
group_by |
Union[Hashable, Iterable[Hashable]]
|
(optional) mapping, label, or list of labels, used to group before aggregation. |
None
|
aggregation |
bool
|
defines whether the counts should be aggregated over all |
True
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result |
describe(datasets, session, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Create a description of a dataset
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical description of datasets |
histogram(datasets, session, column_name, bins, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns a histogram for the given datasets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name |
str
|
name of the column for which the histogram shall be generated |
required |
bins |
int or sequence of scalars. If bins is an int, it defines the number of bins with equal width. If it is a sequence, its content defines the bin edges. |
required | |
group_by |
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
aggregation |
bool
|
If True, the histogram is aggregated over all |
True
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns: statistical result
iqr_column(datasets, session, column_name, global_min_max, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Function to approximate the interquartile range (IQR) over multiple datasets. Internally, first a histogram with a user-defined number of bins and user-defined upper and lower bounds is created over all datasets. Based on this histogram the IQR is approximated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name |
str
|
name of the column for which the histogram shall be generated |
required |
global_min_max |
Iterable[float]
|
a list that contains the global minimum and maximum values
of the combined datasets. This needs to be computed separately, using for
example the function |
required |
group_by |
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
n_bins |
int
|
number of bins for internal histogram |
100
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result |
kaplan_meier(datasets, session, duration_column_name, event_column_name, group_by=None, plot=False, stepsize=1, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Create a Kaplan Meier survival statistic
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
duration_column_name |
str
|
duration column for survival function |
required |
event_column_name |
str
|
event column - indicating death |
required |
group_by |
str
|
grouping column |
None
|
plot |
bool
|
if True results will be displayed using pd.DataFrame.plot() |
False
|
stepsize |
int
|
histogram bin size |
1
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result |
max_column(datasets, session, column_name, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the max over a specified column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name |
str
|
name of the column over which the max shall be calculated |
required |
group_by |
Union[Hashable, Iterable[Hashable]]
|
optional; mapping, label, or list of labels, used to group before aggregation. |
None
|
aggregation |
bool
|
defines whether the max should be aggregated over
all |
True
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result |
mean_column(datasets, session, column_name, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the mean over a specified column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name |
str
|
name of the column over which the mean shall be calculated |
required |
group_by |
Union[Hashable, Iterable[Hashable]]
|
optional; mapping, label, or list of labels, used to group before aggregation. |
None
|
aggregation |
bool
|
defines whether the mean should be aggregated over
all |
True
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result |
median_with_confidence_intervals_column(datasets, session, column_name, global_min_max, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Function to approximate the median and the 95% confidence interval over multiple datasets. Internally, first a histogram with a user-defined number of bins and user-defined upper and lower bounds is created over all datasets. Based on this histogram the median and the confidence interval are approximated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name |
str
|
name of the column for which the histogram shall be generated |
required |
global_min_max |
Iterable[float]
|
a list that contains the global minimum and maximum values
of the combined datasets. This needs to be computed separately, using for
example the function |
required |
group_by |
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
n_bins |
int
|
number of bins for internal histogram |
100
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result
- If no |
median_with_quartiles(datasets, session, column_name, global_min_max, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Function to approximate the median and the 1st and 3rd quartile over multiple datasets. Internally, first a histogram with a user-defined number of bins and user-defined upper and lower bounds is created over all datasets. Based on this histogram above-mentioned values are approximated.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name |
str
|
name of the column for which the statistical query shall be computed |
required |
global_min_max |
List[float]
|
a list that contains the global minimum and maximum values
of the combined datasets. This needs to be computed separately, using for
example the function |
required |
group_by |
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
n_bins |
int
|
number of bins for the internal histogram |
100
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result; Its result contains a tuple with the 1st quartile, the median, and the 3rd quartile. |
min_column(datasets, session, column_name, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the min over a specified column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name |
str
|
name of the column over which the min shall be calculated |
required |
group_by |
Union[Hashable, Iterable[Hashable]]
|
optional; mapping, label, or list of labels, used to group before aggregation. |
None
|
aggregation |
bool
|
defines whether the min should be aggregated over
all |
True
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result |
pca_transformation(datasets, session, column_names, n_components, handle_outliers=PrivacyHandlingMethod.RAISE.value)
🔗
Computes the principal components transformation matrix of given list of datasets.
Args:
datasets: datasets that the computation shall be run on
session: For remote runs, use a SimpleStatsSession
that refers to a cluster
column_names: set of columns
n_components: number of components to keep
handle_outliers:
Parameter of enum type PrivacyHandlingMethod which specifies
the handling method in case of bounded privacy violations.
The implemented options are:
- `PrivacyHandlingMethod.FILTER`: filters out all groups that are violating
privacy bound.
- `PrivacyHandlingMethod.FILTER_DATASET`: removes out the entire dataset
from the federated computation in case of privacy violations.
- `PrivacyHandlingMethod.RAISE`: raises a PrivacyException if privacy bound
was violated.
Default is `PrivacyHandlingMethod.RAISE`.
Returns: transformation matrix as pandas DataFrame.
shape(datasets, session, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the shape of the datasets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result |
squared_errors_by_column(datasets, session, column_name, global_mean, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the sum over the squared difference from global_mean
over a specified
column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name |
str
|
name of the column over which the operation shall be calculated |
required |
group_by |
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
global_mean |
float
|
the deviation of each element to this value is squared and then added up. The mean can be computed via apheris.simple_stats.mean_column. |
required |
aggregation |
bool
|
defines whether the operation should be aggregated over
all |
True
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result |
sum_column(datasets, session, column_name, group_by=None, aggregation=True, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Returns the sum over a specified column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
column_name |
str
|
name of the column over which the sum shall be calculated |
required |
group_by |
Union[Hashable, Iterable[Hashable]]
|
optional; mapping, label, or list of labels, used to group before aggregation. |
None
|
aggregation |
bool
|
defines whether the sum should be aggregated over
all |
True
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result |
tableone(datasets, session, numerical_columns=None, numerical_nonnormal_columns=None, categorical_columns=None, group_by=None, n_bins=100, handle_outliers=PrivacyHandlingMethod.RAISE)
🔗
Create an overview statistic
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
Union[Iterable[FederatedDataFrame], FederatedDataFrame]
|
list of FederatedDataFrames that define the pre-preprocessing of individual datasets. |
required |
session |
Union[SimpleStatsSession, LocalDummySimpleStatsSession, LocalDebugSimpleStatsSession]
|
For remote runs, use a |
required |
numerical_columns |
Iterable[str]
|
names of columns for which mean and standard deviation shall be calculated. |
None
|
numerical_nonnormal_columns |
Iterable[str]
|
names of columns for which the median, as well as 1st and 3rd quartile shall be calculated. These values are approximated via a histogram. |
None
|
categorical_columns |
Iterable[str]
|
names of categorical columns, whose value counts shall be counted. |
None
|
group_by |
Union[Hashable, Iterable[Hashable]]
|
mapping, label, or list of labels, used to group before aggregation. |
None
|
n_bins |
int
|
number of bins of the histogram that is used to approximate the
median and 1st and 3rd quartile of columns in |
100
|
handle_outliers |
Union[PrivacyHandlingMethod, str]
|
Parameter of enum type PrivacyHandlingMethod which specifies the handling method in case of bounded privacy violations. The implemented options are: - Default is |
RAISE
|
Returns:
Type | Description |
---|---|
statistical result; Its result contains a pandas DataFrame with the |
|
tableone statistics over the datasets. |
apheris_stats.simple_stats.exceptions🔗
ObjectNotFound
🔗
Bases: ApherisException
Raised when trying to access an object that does not exist.
InsufficientPermissions
🔗
Bases: Exception
Raised when an operation does not have sufficient permissions to be performed.
PrivacyException
🔗
Bases: Exception
Raised when a privacy mechanism required by the data provider(s) fails to be applied, is violated, or is incompatible with the user-chosen settings.
RestrictedPreprocessingViolation
🔗
Bases: PrivacyException
Raised when a prohibited command is requested to be executed due to restricted preprocessing.
apheris_stats.simple_stats.util🔗
FederatedDataFrame
🔗
Object that simplifies preprocessing by providing a pandas-like interface to preprocess tabular data. The FederatedDataFrame contains preprocessing transformations that are to be applied on a remote dataset. On which dataset it operates is specified in the constructor.
__init__(data_source, read_format=None, filename_in_zip=None)
🔗
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_source |
Union[str, RemoteData]
|
remote id or RemoteData object or path to a data file or graph JSON file |
required |
read_format |
Union[str, InputFormat, None]
|
format of data source |
None
|
filename_in_zip |
Union[str, None]
|
used for ZIP format to identify which file out of ZIP to take The argument is optional, but must be specified for ZIP format. If read_format is ZIP, the value of this argument is used to read one CSV. |
None
|
Example:
- via dataset id: assume your dataset id is 'data-cloudnode':
df = FederatedDataFrame('data-cloudnode')
- optional: for remote data containing multiple files, choose which file to read:
df = FederatedDataFrame('data-cloudnode', filename_in_zip='patients.csv')
loc: '_LocIndexer'
property
🔗
Use pandas .loc notation to access the data
__setitem__(index, value)
🔗
Manipulates values of columns or rows of a FederatedDataFrame. This operation does not return a copy of the FederatedDataFrame object, instead this operation is implemented inplace. That means, the computation graph within the FederatedDataFrame object is modified on the object level. This function is not available in a privacy fully preserving mode.
Example:
Assume the dummy data for 'data_cloudnode' looks like this:
```
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df["new column"] = df["weight"]
df.preprocess_on_dummy()
```
results in
```
patient_id age weight new_column
0 1 77 55 55
1 2 88 60 60
2 3 93 83 83
```
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index |
Union[str, int]
|
column index or name or a boolean valued FederatedDataFrame as index mask. |
required |
value |
Union[ALL_TYPES]
|
a constant value or a single column FederatedDataFrame |
required |
__getitem__(key)
🔗
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df = df["weight"]
df.preprocess_on_dummy()
results in
weight
0 55
1 60
2 83
Args: key: column index or name or a boolean valued FederatedDataFrame as index mask.
Returns:
Type | Description |
---|---|
'FederatedDataFrame'
|
new instance of the current object with updated graph. If the key was a |
'FederatedDataFrame'
|
column identifier, the computation graph results in a single-column |
'FederatedDataFrame'
|
FederatedDataFrame. If the key was an index mask the resulting computation |
'FederatedDataFrame'
|
graph will produce a filtered FederatedDataFrame. |
add(left, right, result=None)
🔗
Privacy-preserving addition: to a column (left
)
add another column or constant value (right
)
and store the result in result
.
Adding arbitrary iterables would allow for
singling out attacks and is therefore disallowed.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df.add("weight", 100, "new_weight")
df.preprocess_on_dummy()
returns
patient_id age weight new_weight
0 1 77 55 155
1 2 88 60 160
2 3 93 83 183
df.add("weight", "age", "new_weight")
returns
patient_id age weight new_weight
0 1 77 55 132
1 2 88 60 148
2 3 93 83 176
Parameters:
Name | Type | Description | Default |
---|---|---|---|
left |
ColumnIdentifier
|
a column identifier |
required |
right |
a column identifier or constant value |
required | |
result |
Optional[ColumnIdentifier]
|
name for the new result column can be set to None to overwrite the left column |
None
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
neg(column_to_negate, result_column=None)
🔗
Privacy-preserving negation: negate column column_to_negate
and store
the result in column result_column
, or leave result_column
as None
and overwrite column_to_negate
.
Using this form of negation removes the need for setitem functionality
which is not privacy-preserving.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df = df.neg("age", "neg_age")
df.preprocess_on_dummy()
returns
patient_id age weight neg_age
0 1 77 55 -77
1 2 88 60 -88
2 3 93 83 -93
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column_to_negate |
ColumnIdentifier
|
column identifier |
required |
result_column |
Optional[ColumnIdentifier]
|
optional name for the new column, if not specified, column_to_negate is overwritten |
None
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
sub(left, right, result)
🔗
Privacy-preserving subtraction:
computes left
- right
and stores
the result in the column result
.
Both left and right can be column names,
or one of it a column name and one a constant.
Arbitrary subtraction with iterables would allow for
singling-out attacks and is therefore disallowed.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df = df.sub("weight", 100, "new_weight")
df.preprocess_on_dummy()
returns
patient_id age weight new_weight
0 1 77 55 -45
1 2 88 60 -40
2 3 93 83 -17
df.sub("weight", "age", "new_weight")
returns
patient_id age weight new_weight
0 1 77 55 -22
1 2 88 60 -28
2 3 93 83 -10
Parameters:
Name | Type | Description | Default |
---|---|---|---|
left |
column identifier or constant |
required | |
right |
column identifier or constant |
required | |
result |
ColumnIdentifier
|
column name for the new result colum |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
mult(left, right, result=None)
🔗
Privacy-preserving multiplication: to a column (left
)
multiply another column or constant value (right
)
and store the result in result
.
Multiplying arbitrary iterables would allow for
singling out attacks and is therefore disallowed.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df.mult("weight", 2, "new_weight")
df.preprocess_on_dummy()
returns
patient_id age weight new_weight
0 1 77 55 110
1 2 88 60 120
2 3 93 83 166
df.mult("weight", "patient_id", "new_weight")
returns
patient_id age weight new_weight
0 1 77 55 55
1 2 88 60 120
2 3 93 83 249
Parameters:
Name | Type | Description | Default |
---|---|---|---|
left |
ColumnIdentifier
|
a column identifier |
required |
right |
a column identifier or constant value |
required | |
result |
Optional[ColumnIdentifier]
|
name for the new result column, can be set to None to overwrite the left column |
None
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
truediv(left, right, result)
🔗
Privacy-preserving division: divide a column or constant (left
)
by another column or constant (right
)
and store the result in result
.
Dividing by arbitrary iterables would allow for
singling out attacks and is therefore disallowed.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df.truediv("weight", 2, "new_weight")
df.preprocess_on_dummy()
returns
patient_id age weight new_weight
0 1 77 55 27.5
1 2 88 60 30.0
2 3 93 83 41.5
df.truediv("weight", "patient_id", "new_weight")
returns
patient_id age weight new_weight
0 1 77 55 55.000000
1 2 88 60 30.000000
2 3 93 83 27.666667
Parameters:
Name | Type | Description | Default |
---|---|---|---|
left |
ColumnIdentifier
|
a column identifier |
required |
right |
a column identifier or constant value |
required | |
result |
ColumnIdentifier
|
name for the new result column |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
invert(column_to_invert, result_column=None)
🔗
Privacy-preserving inversion (~ operator):
invert column column_to_invert
and store
the result in column result_column
, or leave result_column
as None
and overwrite column_to_invert
.
Using this form of negation removes the need for setitem functionality
which is not privacy-preserving.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight death
0 1 77 55.0 True
1 2 88 60.0 False
2 3 23 NaN True
df = FederatedDataFrame('data_cloudnode')
df = df.invert("death", "survival")
df.preprocess_on_dummy()
returns
patient_id age weight death survival
0 1 77 55.0 True False
1 2 88 60.0 False True
2 3 23 NaN True False
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column_to_invert |
ColumnIdentifier
|
column identifier |
required |
result_column |
Optional[ColumnIdentifier]
|
optional name for the new column, if not specified, column_to_negate is overwritten |
None
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
__lt__(other)
🔗
Compare a single-column FederatedDataFrame with a constant using the operator '<' Example: Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 40 50
df = FederatedDataFrame('data_cloudnode')
df = df["age"] < df["weight"]
df.preprocess_on_dummy()
returns
```
0 False
1 False
2 True
```
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
FederatedDataFrame or value to compare with |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
single column FederatedDataFrame with computation graph resulting in a |
FederatedDataFrame
|
boolean Series. |
__gt__(other)
🔗
Compare a single-column FederatedDataFrame with a constant using the operator '>'
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 40 50
df = FederatedDataFrame('data_cloudnode')
df = df["age"] > df["weight"]
df.preprocess_on_dummy()
returns
0 True
1 True
2 False
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
FederatedDataFrame or value to compare with |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
single column FederatedDataFrame with computation graph resulting in a |
FederatedDataFrame
|
boolean Series. |
__eq__(other)
🔗
Compare a single-column FederatedDataFrame with a constant using the operator '=='
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 40 40
df = FederatedDataFrame('data_cloudnode')
df = df["age"] == df["weight"]
df.preprocess_on_dummy()
returns
0 False
1 False
2 True
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
FederatedDataFrame or value to compare with |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
single column FederatedDataFrame with computation graph resulting in a |
FederatedDataFrame
|
boolean Series. |
__le__(other)
🔗
Compare a single-column FederatedDataFrame with a constant using the operator '<='
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 40 40
df = FederatedDataFrame('data_cloudnode')
df = df["age"] <= df["weight"]
df.preprocess_on_dummy()
returns
0 False
1 False
2 True
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
FederatedDataFrame or value to compare with |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
single column FederatedDataFrame with computation graph resulting in a |
FederatedDataFrame
|
boolean Series. |
__ge__(other)
🔗
Compare a single-column FederatedDataFrame with a constant using the operator '>='
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 40 40
df = FederatedDataFrame('data_cloudnode')
df = df["age"] >= df["weight"]
df.preprocess_on_dummy()
returns
0 True
1 True
2 True
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
FederatedDataFrame or value to compare with |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
single column FederatedDataFrame with computation graph resulting in a |
FederatedDataFrame
|
boolean Series. |
__ne__(other)
🔗
Compare a single-column FederatedDataFrame with a constant using the operator '!='
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 40 40
df = FederatedDataFrame('data_cloudnode')
df = df["age"] != df["weight"]
df.preprocess_on_dummy()
returns
0 True
1 True
2 False
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
FederatedDataFrame or value to compare with |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
single column FederatedDataFrame with computation graph resulting in a |
FederatedDataFrame
|
boolean Series. |
to_datetime(on_column=None, result_column=None, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit='ns', infer_datetime_format=False, origin='unix')
🔗
Convert the column on_column
to datetime format.
Further arguments can be passed to the respective underlying pandas'
to_datetime function with kwargs.
Results in a table where column
is updated,
no need for the unsafe setitem operation.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id start_date end_date
0 1 "2015-08-01" "2015-12-01"
1 2 "2017-11-11" "2020-11-11"
2 3 "2020-01-01" NaN
df = FederatedDataFrame('data_cloudnode')
df = df.to_datetime("start_date", "new_start_date")
df.preprocess_on_dummy()
returns
patient_id start_date end_date new_start_date
0 1 "2015-08-01" "2015-12-01" 2015-08-01
1 2 "2017-11-11" "2020-11-11" 2017-11-11
2 3 "2020-01-01" NaN 2020-01-01
Parameters:
Name | Type | Description | Default |
---|---|---|---|
on_column |
Optional[ColumnIdentifier]
|
column to convert |
None
|
result_column |
Optional[ColumnIdentifier]
|
optional column where the result should be stored, defaults to on_column if not specified |
None
|
errors |
str
|
optional argument how to handle errors during parsing, "raise": raise an exception upon errors (default), "coerce": set value to NaT and continue, "ignore": return the input and continue |
'raise'
|
dayfirst |
bool
|
optional argument to specify the parse order, if True, parses with the day first, e.g. 01/02/03 is parsed to 1st February 2003 defaults to False |
False
|
yearfirst |
bool
|
optional argument to specify the parse order, if True, parses the year first, e.g. 01/02/03 is parsed to 3rd February 2001 defaults to False |
False
|
utc |
bool
|
optional argument to control the time zone, if False (default), assume input is in UTC, if True, time zones are converted to UTC |
None
|
format |
str
|
optional strftime argument to parse the time, e.g. "%d/%m/%Y, defaults to None |
None
|
exact |
bool
|
optional argument to control how "format" is used, if True (default), an exact format match is required, if False, the format is allowed to match anywhere in the target string |
True
|
unit |
str
|
optional argument to denote the unit, defaults to "ns", e.g. unit="ms" and origin="unix" calculates the number of milliseconds to the unix epoch start |
'ns'
|
infer_datetime_format |
bool
|
optional argument to attempt to infer the format based on the first (non-NaN) argument when set to True and no format is specified, defaults to False |
False
|
origin |
str
|
optional argument to define the reference date, numeric values are parsed as number of units defined by the "unit" argument since the reference date, e.g. "unix" (default) sets the origin to 1970-01-01, "julian" (with "unit" set to "D") sets the origin to the beginning of the Julian Calendar (January 1st 4713 BC). |
'unix'
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
fillna(value, on_column=None, result_column=None)
🔗
Fill NaN values with a constant (int, float, string)
similar to pandas' fillna.
The following arguments from pandas implementation are not supported:
method
, axis
, inplace
, limit
, downcast
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77.0 55.0
1 2 NaN 60.0
2 3 88.0 NaN
df = FederatedDataFrame('data_cloudnode')
df2 = df.fillna(7)
df2.preprocess_on_dummy()
returns
patient_id age weight
0 1 77.0 55.0
1 2 7.0 60.0
2 3 88.0 7.0
df3 = df.fillna(7, on_column="weight")
df3.preprocess_on_dummy()
returns
patient_id age weight
0 1 77.0 55.0
1 2 NaN 60.0
2 3 88.0 7.0
Parameters:
Name | Type | Description | Default |
---|---|---|---|
value |
Union[ALL_TYPES]
|
value to use for filling up NaNs |
required |
on_column |
Optional[ColumnIdentifier]
|
only operate on the specified column, defaults to None, i.e., operate on the entire table |
None
|
result_column |
Optional[ColumnIdentifier]
|
if on_column is specified, optionally store the result in a new column with this name, defaults to None, i.e., overwriting the column |
None
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
dropna(axis=0, how='any', thresh=None, subset=None)
🔗
Drop Nan values from the table with arguments like for pandas' dropna.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77.0 55.0
1 2 88.0 NaN
2 3 NaN NaN
df = FederatedDataFrame('data_cloudnode')
df2 = df.dropna()
df2.preprocess_on_dummy()
returns
patient_id age weight
0 1 77.0 55.0
df3 = df.dropna(axis=0, subset=["age"])
df3.preprocess_on_dummy()
patient_id age weight
0 1 77.0 55.0
1 2 88.0 NaN
Parameters:
Name | Type | Description | Default |
---|---|---|---|
axis |
axis to apply this operation to, defaults to zero |
0
|
|
how |
determine if row or column is removed from FederatedDataFrame, when we have at least one NA or all NA, defaults to "any". ‘any’ : If any NA values are present, drop that row or column. ‘all’ : If all values are NA, drop that row or column. |
'any'
|
|
thresh |
Optional[int]
|
optional - require that many non-NA values to drop, defaults to None |
None
|
subset |
Union[ColumnIdentifier, List[ColumnIdentifier], None]
|
optional - use only a subset of columns, defaults to None, i.e., operate on the entire data frame, subset of rows is not permitted for privacy reasons. |
None
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
isna(on_column=None, result_column=None)
🔗
Checks if an entry is null for given columns or FederatedDataFrame and sets boolean value accordingly in the result column.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77.0 55.0
1 2 88.0 NaN
2 3 NaN NaN
df = FederatedDataFrame('data_cloudnode')
df2 = df.isna()
df2.preprocess_on_dummy()
patient_id age weight
0 False False False
1 False False False
2 False True True
df3 = df.isna("age", "na_age")
df3.preprocess_on_dummy()
patient_id age weight na_age
0 1 77.0 55.0 False
1 2 88.0 NaN False
2 3 NaN NaN True
Parameters:
Name | Type | Description | Default |
---|---|---|---|
on_column |
Optional[ColumnIdentifier]
|
column name which is being checked |
None
|
result_column |
Optional[ColumnIdentifier]
|
optional result columns. If specified, a new column is added to the FederatedDataFrame, otherwise on_column is overwritten. |
None
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
astype(dtype, on_column=None, result_column=None)
🔗
Convert the entire table to the given datatype
similarly to pandas' astype.
The following arguments from pandas implementation are not supported:
copy
, errors
Optionally arguments not present in pandas implementation:
on_column
and result_column
: give a column to which the astype function
should be applied.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55.4
1 2 88 60.0
2 3 99 65.5
df = FederatedDataFrame('data_cloudnode')
df2 = df.astype(str)
df2.preprocess_on_dummy()
patient_id age weight
0 "1" "77" "55.4"
1 "2" "88" "60.0"
2 "3" "99" "65.5"
df3 = df.astype(float, on_column="age")
patient_id age weight
0 1 77.0 55.4
1 2 88.0 60.0
2 3 99.0 65.5
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dtype |
Union[type, str]
|
type to convert to |
required |
on_column |
Optional[ColumnIdentifier]
|
optional column to convert, defaults to None, i.e., the entire FederatedDataFrame is converted |
None
|
result_column |
Optional[ColumnIdentifier]
|
optional result column if on_column is specified, defaults to None, i.e., the on_column is overwritten |
None
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
🔗
Merges two FederatedDataFrames. When the preprocessing privacy guard is enabled, merges are only possible as the first preprocessing step. See also pandas documentation.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patients.csv
id age death
0 423 34 1
1 561 55 0
2 917 98 1
insurance.csv
id insurance
0 561 TK
1 917 AOK
2 123 None
patients = FederatedDataFrame('data_cloudnode',
filename_in_zip='patients.csv')
insurance = FederatedDataFrame('data_cloudnode',
filename_in_zip="insurance.csv")
merge1 = patients.merge(insurance, left_on="id", right_on="id", how="left")
merge1.preprocess_on_dummy()
returns
id age death insurance
0 423 34 1 NaN
1 561 55 0 TK
2 917 98 1 AOK
merge2 = patients.merge(insurance, left_on="id", right_on="id", how="right")
merge2.preprocess_on_dummy()
id age death insurance
0 561 55.0 0.0 TK
1 917 98.0 1.0 AOK
2 123 NaN NaN None
merge3 = patients.merge(insurance, left_on="id", right_on="id", how="outer")
merge3.preprocess_on_dummy()
id age death insurance
0 423 34.0 1.0 NaN
1 561 55.0 0.0 TK
2 917 98.0 1.0 AOK
3 123 NaN NaN None
Parameters:
Name | Type | Description | Default |
---|---|---|---|
right |
FederatedDataFrame
|
the other FederatedDataFrame to merge with |
required |
how |
Literal['left', 'right', 'outer', 'inner', 'cross']
|
type of merge ("left", "right", "outer", "inner", "cross") |
'inner'
|
on |
Optional[ColumnIdentifier]
|
column or index to join on, that is available on both sides |
None
|
left_on |
Optional[ColumnIdentifier]
|
column or index to join the left FederatedDataFrame |
None
|
right_on |
Optional[ColumnIdentifier]
|
column or index to join the right FederatedDataFrame |
None
|
left_index |
bool
|
use the index of the left FederatedDataFrame |
False
|
right_index |
bool
|
use the index of the right FederatedDataFrame |
False
|
sort |
bool
|
Sort the join keys in the resulting FederatedDataFrame |
False
|
suffixes |
A sequence ot two strings. If columns overlap, these suffixes are appended to column names defaults to ("_x", "_y"), i.e., if you have the column "id" in both tables, the left table's id column will be renamed to "id_x" and the right to "id_y". |
('_x', '_y')
|
|
copy |
bool
|
If False, avoid copy if possible. |
True
|
indicator |
bool
|
If true, a column "_merge" will be added to the resulting FederatedDataFrame that indicates the origin of a row |
False
|
validate |
Optional[str]
|
“one_to_one”/“one_to_many”/“many_to_one”/“many_to_many”. If set, a check is performed if the specified type is met. |
None
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
Raises:
Type | Description |
---|---|
PrivacyException
|
if merges are unsecure due the operations done before |
concat(other, join='outer', ignore_index=True, verify_integrity=False, sort=False)
🔗
Concatenate two FederatedDataFrames verically.
The following arguments from pandas implementation are not supported:
keys
, levels
, names
, verify_integrity
, copy
.
Args:
other: the other FederatedDataFrame to concatenate with
join: type of join to perform ('inner' or 'outer'), defaults to 'outer'
ignore_index: whether to ignore the index, defaults to True
verify_integrity: whether to verify the integrity of the result, defaults
to False
sort: whether to sort the result, defaults to False
rename(columns)
🔗
Rename column(s) similarly to pandas' rename.
The following arguments from pandas implementation are not supported:
mapper
,index
, axis
, copy
, inplace
, level
, errors
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55.4
1 2 88 60.0
2 3 99 65.5
df = FederatedDataFrame('data_cloudnode')
df = df.rename({"patient_id": "patient_id_new", "age": "age_new"})
df.preprocess_on_dummy()
patient_id_new age_new weight
0 1 77 55.4
1 2 88 60.0
2 3 99 65.5
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
Dict[ColumnIdentifier, ColumnIdentifier]
|
dict containing the remapping of old names to new names |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph |
drop_column(column)
🔗
Remove the given column from the table.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df = df.drop_column("weight")
df.preprocess_on_dummy()
patient_id age
0 1 77
1 2 88
2 3 93
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column |
Union[ColumnIdentifier, List[ColumnIdentifier]]
|
column name or list of column names to drop |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
sample(n=None, frac=None, replace=False, random_state=None, ignore_index=False)
🔗
Sample the data frame based on a given mask and percentage.
Only one of n
(number of samples) or frac
(fraction of the data)
can be specified. The following arguments from pandas implementation are not
supported: weights
and axis
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n |
Optional[int]
|
number of samples to take |
None
|
frac |
Optional[float]
|
fraction of the data to sampl between 0 and 1 |
None
|
replace |
bool
|
whether to sample with replacement |
False
|
random_state |
Optional[int]
|
seed for the random number generator |
None
|
ignore_index |
bool
|
whether to ignore the index when sampling |
False
|
__add__(other)
🔗
Arithmetic operator, which adds a constant value or a single column
FederatedDataFrame to a single column FederatedDataFrame. This operator is
useful only in combination with setitem. In a privacy preserving mode use
the add
function instead.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] + 100
df.preprocess_on_dummy()
patient_id age weight new_weight
0 1 77 55 155
1 2 88 60 160
2 3 93 83 183
df["new_weight"] = df["weight"] + df["age"]
patient_id age weight new_weight
0 1 77 55 132
1 2 88 60 148
2 3 93 83 176
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
Union[ALL_TYPES]
|
constant value or a single column FederatedDataFrame to add. |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
__radd__(other)
🔗
Arithmetic operator, which adds a constant value or a single column
FederatedDataFrame to a single column FederatedDataFrame from right. This operator
is useful only in combination with setitem. In a privacy preserving mode use
the add
function instead.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = 100 + df["weight"]
df.preprocess_on_dummy()
patient_id age weight new_weight
0 1 77 55 155
1 2 88 60 160
2 3 93 83 183
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
constant value or a single column FederatedDataFrame to add. |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
__neg__()
🔗
Logical operator, which negates values of a single column
FederatedDataFrame. This operator is
useful only in combination with setitem. In a privacy preserving mode use
the neg
function instead.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df["neg_age"] = - df["age"]
df.preprocess_on_dummy()
patient_id age weight neg_age
0 1 77 55 -77
1 2 88 60 -88
2 3 93 83 -93
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
__invert__()
🔗
Logical operator, which inverts bool values (known as tilde in pandas, ~).
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight death
0 1 77 55.0 True
1 2 88 60.0 False
2 3 23 NaN True
df = FederatedDataFrame('data_cloudnode')
df["survival"] = ~df["death"]
df.preprocess_on_dummy()
patient_id age weight death survival
0 1 77 55.0 True False
1 2 88 60.0 False True
2 3 23 NaN True False
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
__sub__(other)
🔗
Arithmetic operator, which subtracts a constant value or a single column
FederatedDataFrame to a single column FederatedDataFrame. This operator is
useful only in combination with setitem. In a privacy preserving mode use
the sub
function instead.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] - 100
df.preprocess_on_dummy()
patient_id age weight new_weight
0 1 77 55 -45
1 2 88 60 -40
2 3 93 83 -17
df["new_weight"] = df["weight"] - df["age"]
patient_id age weight new_weight
0 1 77 55 -22
1 2 88 60 -28
2 3 93 83 -10
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
constant value or a single column FederatedDataFrame to subtract. |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
__rsub__(other)
🔗
Arithmetic operator, which subtracts a single column FederatedDataFrame from a
constant value or a single column FederatedDataFrame. This operator is
useful only in combination with setitem. In a privacy preserving mode use
the sub
function instead.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
constant value or a single column FederatedDataFrame from which to subtract. |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = 100 - df["weight"]
df.preprocess_on_dummy()
returns
patient_id age weight new_weight
0 1 77 55 45
1 2 88 60 40
2 3 93 83 17
__truediv__(other)
🔗
Arithmetic operator, which divides FederatedDataFrame by a constant or another FederatedDataFrame.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] / 2
df.preprocess_on_dummy()
patient_id age weight new_weight
0 1 77 55 27.5
1 2 88 60 30.0
2 3 93 83 41.5
df["new_weight"] = df["weight"] / df["patient_id"]
patient_id age weight new_weight
0 1 77 55 55.000000
1 2 88 60 30.000000
2 3 93 83 27.666667
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
Union[FederatedDataFrame, int, float, bool]
|
constant value or another FederatedDataFrame to divide by. |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
__mul__(other)
🔗
Arithmetic operator, which multiplies FederatedDataFrame by a constant or another FederatedDataFrame.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] * 2
df.preprocess_on_dummy()
patient_id age weight new_weight
0 1 77 55 110
1 2 88 60 120
2 3 93 83 166
df["new_weight"] = df["weight"] * df["patient_id"]
patient_id age weight new_weight
0 1 77 55 55
1 2 88 60 120
2 3 93 83 249
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
Union[FederatedDataFrame, int, float, bool]
|
constant value or another FederatedDataFrame to multiply by. |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
__rmul__(other)
🔗
Arithmetic operator, which multiplies FederatedDataFrame by a constant or another FederatedDataFrame.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 60
2 3 93 83
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = 2 * df["weight"] * 2
df.preprocess_on_dummy()
patient_id age weight new_weight
0 1 77 55 110
1 2 88 60 120
2 3 93 83 166
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
Union[FederatedDataFrame, int, float, bool]
|
constant value or another FederatedDataFrame to multiply by. |
required |
Returns: new instance of the current object with updated graph.
__and__(other)
🔗
Logical operator, which conjuncts values of a single column FederatedDataFrame with a constant or another single column FederatedDataFrame.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age death infected
0 1 77 1 1
1 2 88 0 1
2 3 40 1 0
df = FederatedDataFrame('data_cloudnode')
df = df["death"] & df["infected"]
df.preprocess_on_dummy()
0 1
1 0
2 0
Args: other: constant value or another FederatedDataFrame to logically conjunct
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
__or__(other)
🔗
Logical operator, which conjuncts values of a single column FederatedDataFrame with a constant or another single column FederatedDataFrame.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age death infected
0 1 77 1 1
1 2 88 0 1
2 3 40 1 0
df = FederatedDataFrame('data_cloudnode')
df = df["death"] | df["infected"]
df.preprocess_on_dummy()
0 1
1 1
2 1
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other |
Union[FederatedDataFrame, bool, int]
|
constant value or another FederatedDataFrame to logically conjunct |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
str_contains(pattern)
🔗
Checks if string values of single column FederatedDataFrame contain
pattern. Typical usage
federated_dataframe[column].str.contains(pattern)
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight race
0 1 77 55 white
1 2 88 60 black
2 3 93 83 asian
df = FederatedDataFrame('data_cloudnode')
df = df["race"].str.contains("a")
df.preprocess_on_dummy()
0 False
1 True
2 True
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pattern |
str
|
pattern string to check for |
required |
Returns: new instance of the current object with updated graph.
str_len()
🔗
Computes string lenght for each entry. Typical usage
federated_dataframe[column].str.len()
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight race
0 1 77 55 w
1 2 88 60 bl
2 3 93 83 asian
df = FederatedDataFrame('data_cloudnode')
df = df["race"].str.len()
df.preprocess_on_dummy()
0 1
1 2
2 5
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
dt_datetime_like_properties(datetime_like_property)
🔗
Checks if a property of datetime-like object can be applied to a column
of FederatedDataFrame. Typical usage
federated_dataframe[column].dt.days
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id start_date end_date
0 1 2015-08-01 2015-12-01
1 2 2017-11-11 2020-11-11
2 3 2020-01-01 2022-06-16
df = FederatedDataFrame('data_cloudnode')
df = df.to_datetime("start_date")
df = df.to_datetime("start_date")
df = df.sub("end_date", "start_date", "duration")
df = df["duration"] = df["duration"].dt.days - 5
df.preprocess_on_dummy()
patient_id start_date end_date duration
0 1 2015-08-01 2015-12-01 117
1 2 2017-11-11 2020-11-11 1091
2 3 2020-01-01 2022-06-16 892
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datetime_like_property |
datetime-like (.dt) property to be accessed |
required |
Returns: new instance of the current object with updated graph.
sort_values(by, axis=0, ascending=True, kind='quicksort', na_position='last', ignore_index=False)
🔗
Sort values, similar to pandas' sort_values.
The following arguments from pandas implementation are not supported:
key
- we do not support the key
argument, as that could be an arbitrary
function.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55.0
1 2 88 60.0
2 3 93 83.0
3 4 18 NaN
df = FederatedDataFrame('data_cloudnode')
df = df.sort_values(by="weight", axis="index", ascending=False)
df.preprocess_on_dummy()
patient_id age weight
2 3 93 83.0
1 2 88 60.0
0 1 77 55.0
3 4 18 NaN
Parameters:
Name | Type | Description | Default |
---|---|---|---|
by |
Union[ColumnIdentifier, List[ColumnIdentifier]]
|
column name or list of column names to sort by |
required |
axis |
axis to be sorted: 0 or "index" means sort by index, thus, by contains column labels 1 or "column" means sort by column, thus, by contains index labels |
0
|
|
ascending |
bool
|
defaults to ascending sorting, but can be set to False for descending sorting |
True
|
kind |
str
|
defaults to the |
'quicksort'
|
na_position |
defaults to sorting NaNs to the end, set to "first" to put them in the beginning |
'last'
|
|
ignore_index |
bool
|
defaults to false, otherwise, the resulting axis will be labelled 0, 1, ... length-1 |
False
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
isin(values)
🔗
Whether each element in the data is contained in values, similar to pandas' isin.
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patients.csv:
patient_id age weight
0 1 77 55.0
1 2 88 60.0
2 3 93 83.0
3 4 18 NaN
other.csv:
patient_id age weight
0 1 77 55.0
1 2 88 60.0
2 7 33 93.0
3 8 66 NaN
df = FederatedDataFrame('data_cloudnode',
filename_in_zip='patients.csv')
df = df.isin(values = {"age": [77], "weight": [55]})
df.preprocess_on_dummy()
patient_id age weight
0 False True True
1 False False False
2 False False False
3 False False False
df_other = FederatedDataFrame('data_cloudnode',
filename_in_zip='other.csv')
df = df.isin(df_other)
df.preprocess_on_dummy()
patient_id age weight
0 True True True
1 True True True
2 False False False
3 False False False
Parameters:
Name | Type | Description | Default |
---|---|---|---|
values |
iterable, dict or FederatedDataFrame to check against. Returns True at each location if all the labels match,
|
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
groupby(by=None, axis=0, sort=True, group_keys=True, observed=False, dropna=True)
🔗
Group the data using a mapper. Notice that this operation must be followed by
an aggregation (such as .last or .first) before further operations can be made.
The arguments are similar to pandas' original groupby.
The following arguments from pandas implementation are not supported:
axis
, level
, as_index
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight procedures start_date
0 1 77 55 a 2015-08-01
1 1 77 55 b 2015-10-01
2 2 88 60 a 2017-11-11
3 3 93 83 c 2020-01-01
4 3 93 83 b 2020-05-01
5 3 93 83 a 2021-01-04
df = FederatedDataFrame('data_cloudnode')
grouped_first = df.groupby(by='patient_id').first()
grouped_first.preprocess_on_dummy()
age weight procedures start_date
patient_id
1 77 55 a 2015-08-01
2 88 60 a 2017-11-11
3 93 83 c 2020-01-01
grouped_last = df.groupby(by='patient_id').last()
grouped_last.preprocess_on_dummy()
age weight procedures start_date
patient_id
1 77 55 b 2015-10-01
2 88 60 a 2017-11-11
3 93 83 a 2021-01-04
Parameters:
Name | Type | Description | Default |
---|---|---|---|
by |
dictionary, series, label, or list of labels to determine the groups. Grouping with a custom function is not allowed. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups. If a list or ndarray of length equal to the selected axis is passed, the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key. |
None
|
|
axis |
int
|
Split along rows (0 or "index") or columns (1 or "columns") |
0
|
sort |
bool
|
Sort group keys. |
True
|
group_keys |
bool
|
During aggregation, add group keys to index to identify groups. |
True
|
observed |
bool
|
Only applies to categorical grouping, if true, only show observed values, otherwise, show all values. |
False
|
dropna |
bool
|
if true and groups contain NaN values, they will be dropped together with the row/column, otherwise, treat NaN as key in groups. |
True
|
Returns:
Type | Description |
---|---|
_FederatedDataFrameGroupBy
|
_FederatedGroupBy object to be used in combination with further aggregations. |
Raises:
Type | Description |
---|---|
PrivacyException
|
if |
rolling(window, min_periods=None, center=False, on=None, axis=0, closed=None)
🔗
Rolling window operation, similar to pandas.DataFrame.rolling
Following pandas arguments are not supported: win_type
, method
, step
drop_duplicates(subset=None, keep='first', ignore_index=False)
🔗
Drop duplicates in a table or column, similar to pandas' drop_duplicates
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 83
2 3 93 83
3 3 93 83
df = FederatedDataFrame('data_cloudnode')
df1 = df.drop_duplicates()
df1.preprocess_on_dummy()
patient_id age weight
0 1 77 55
1 2 88 83
2 3 93 83
df2 = df.drop_duplicates(subset=['weight'])
df2.preprocess_on_dummy()
patient_id age weight
0 1 77 55
1 2 88 83
Parameters:
Name | Type | Description | Default |
---|---|---|---|
subset |
Union[ColumnIdentifier, List[ColumnIdentifier], None]
|
optional column label or sequence of column labels to consider when identifying duplicates, uses all columns by default |
None
|
keep |
Union[Literal['first'], Literal['last'], Literal[False]]
|
string determining which duplicates to keep, can be "first" or "last" or set to False to keep no duplicates |
'first'
|
ignore_index |
bool
|
if set to True, the resulting axis will be re-labeled, defaults to False |
False
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
charlson_comorbidities(index_column, icd_columns, mapping=None)
🔗
Converts icd codes into comorbidities. If no comorbidity mapping is specified, the default mapping of the NCI is used. See function 'apheris.datatools.transformations.utils.formats.get_default_comorbidity_mapping' for the mapping or the original SAS file maintained by the NCI: https://healthcaredelivery.cancer.gov/seermedicare/considerations/NCI.comorbidity.macro.sas
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_column |
str
|
column name of the index column (e.g. patient_id) |
required |
icd_columns |
List[str]
|
names of columns containing icd codes, contributing to comorbidity derivation |
required |
mapping |
Dict[str, List]
|
dictionary that maps comorbidity strings to list of icd codes |
None
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
pandas.DataFrame with comorbidity columns according to the used mapping and index from given index column, containing comorbidity entries as boolean values. |
charlson_comorbidity_index(index_column, icd_columns, mapping=None)
🔗
Converts icd codes into Charlson Comorbidity Index score. If no comorbidity mapping is specified, the default mapping of the NCI is used. See function 'apheris.datatools.transformations.utils.formats.get_default_comorbidity_mapping' for the mapping or the original SAS file maintained by the NCI: https://healthcaredelivery.cancer.gov/seermedicare/considerations/NCI.comorbidity.macro.sas
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index_column |
str
|
column name of the index column (e.g. patient_id) |
required |
icd_columns |
Union[List[str], str]
|
names of columns containing icd codes, contributing to comorbidity derivation |
required |
mapping |
Dict[str, List]
|
dictionary that maps comorbidity strings to list of icd codes |
None
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
pandas.DataFrame with containing comorbidity score per patient. |
reset_index(drop=False)
🔗
Resets the index, e.g., after a groupby operation, similar to pandas
reset_index
.
The following arguments from pandas implementation are not supported:
level
, inplace
, col_level
, col_fill
, allow_duplicates
, names
Example
Assume the dummy data for 'data_cloudnode' looks like this:
patient_id age weight
0 1 77 55
1 2 88 83
2 3 93 60
3 4 18 72
df = FederatedDataFrame('data_cloudnode')
df1 = df.reset_index()
df1.preprocess_on_dummy()
index Unnamed: 0 patient_id age weight
0 0 0 1 77 55
1 1 1 2 88 83
2 2 2 3 93 60
3 3 3 4 18 72
df2 = df.reset_index(drop=True)
df2.preprocess_on_dummy()
Unnamed: 0 patient_id age weight
0 0 1 77 55
1 1 2 88 83
2 2 3 93 60
3 3 4 18 72
Parameters:
Name | Type | Description | Default |
---|---|---|---|
drop |
bool
|
If true, do not try to insert index into the data columns. This resets the index to the default integer index. Defaults to False. |
False
|
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
transform_columns(transformation)
🔗
Transform columns of a FederatedDataFrame using a pandas DataFrame as Transformation Matrix. The DataFrame index must correspond to the columns of the original FederatedDataFrame. The transformation is applied row-wise, i.e. each row is transformed to a subspace of the original feature space defined by the columns of the original FederatedDataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transformation |
DataFrame
|
DataFrame with the same index as the columns of the original FederatedDataFrame. The DataFrame must have the same number of rows as the original FederatedDataFrame has columns. |
required |
Returns:
Type | Description |
---|---|
FederatedDataFrame
|
new instance of the current object with updated graph. |
display_graph()
🔗
Convert DiGraph from networkx into pydot and output SVG
Returns: SVG content
save_graph_as_image(filepath, image_format='svg')
🔗
Convert DiGraph from networkx into pydot and save SVG Args: filepath: path where to save an image on the disk image_format: image format to be specified, supported formats are taken from pydot library
export()
🔗
Export FederatedDataFrame object as JSON which can be then imported when needed
Example
df = FederatedDataFrame('data_cloudnode')
df_json = df.export()
# store df_json and later:
df_imported = FederatedDataFrame(data_source=df_json)
# go on using df_imported as you would use df
Returns:
Type | Description |
---|---|
str
|
JSON-like string containing graph and node uuid |
preprocess_on_dummy()
🔗
Execute computations "recorded" inside the FederatedDataFrame object on the dummy data attached to the RemoteData object used during initialization.
If no dummy data is available, this method will fail. If you have data for
testing stored on your local machine, please use preprocess_on_files
instead.
Example
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] + 100
# executes the addition on the dummy data of 'data_cloudnode'
df.preprocess_on_dummy()
# the resulting dataframe is equivalent to:
df_raw = pandas.read_csv(
apheris_auth.RemoteData('data_cloudnode').dummy_data_path
)
df_raw["new_weight"] = df_raw["weight"] + 100
Returns:
Type | Description |
---|---|
DataFrame
|
resulting pandas.DataFrame after preprocessing has been applied to dummy |
DataFrame
|
data. |
preprocess_on_files(filepaths)
🔗
Execute computations "recorded" inside the FederatedDataFrame object on local data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepaths |
Dict[str, str]
|
dictionary to overwrite RemoteData used during FederatedDataFrame intitialization with other data sources from your local machine. Keys are expected to be RemoteData ids, values are expected to be file paths. |
required |
Example
df = FederatedDataFrame('data_cloudnode')
df["new_weight"] = df["weight"] + 100
df.preprocess_on_files({'data_cloudnode':
'myDirectory/local/replacement_data.csv'})
# the resulting dataframe is equivalent to:
df_raw = pd.read_csv('myDirectory/local/replacement_data.csv')
df_raw["new_weight"] = df_raw["weight"] + 100
Note that in case the FederatedDataFrame merges multiple RemoteData objects and you don't specify all their ids in the filepaths, we use dummy data for all "missing" ids (if available, otherwise, an exception is raised).
Returns:
Type | Description |
---|---|
DataFrame
|
resulting pandas.DataFrame after preprocessing has been applied to given file |
LocalDebugDataset
🔗
__init__(dataset_id, gateway_id, dataset_fpath, permissions=None, policy=None)
🔗
Dataset class for LocalDebugSimpleStatsSessions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_id |
str
|
Name of the dataset. Allowed characters: letters, numbers, "_", "-", "." |
required |
gateway_id |
str
|
Name of a hypothetical gateway that this dataset resides on. Datasets with the same gateway_id will be launched into the same client. Allowed characters: letters, numbers, "_", "-", "." |
required |
dataset_fpath |
str
|
Absolute filepath to data. |
required |
policy |
dict
|
Policy dict. If not provided, we use empty policies. |
None
|
permissions |
dict
|
Permissions dict. If not provided, we allow all operations. |
None
|
LocalDebugSimpleStatsSession
🔗
Bases: LocalSimpleStatsSession
For debugging Apheris Statistics computations locally on your machine. You can work
with local files and custom policies and custom permissions. Inject the
LocalDebugSimpleStatsSession
into a simple-stats computation.
To use the PDB debugger, it is necessary to set max_threads=1.
__init__(datasets, workspace=None, max_threads=None)
🔗
Inits a LocalDebugSimpleStatsSession.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets |
List[LocalDebugDataset]
|
A list of |
required |
workspace |
Union[str, Path]
|
path to use as workspace. If not provided, a temporary directory is used as workspace, and information is lost after a statistical query is finished. |
None
|
max_threads |
Optional[int]
|
The maximum number of parallel threads to use for the Flare simulator. This should be between 1 and the number of gateways used by the session. Note that debugging may fail for max_threads > 1. Default=1. |
None
|
LocalDummySimpleStatsSession
🔗
Bases: LocalSimpleStatsSession
__init__(dataset_ids=None, workspace=None, policies=None, permissions=None, max_threads=None)
🔗
Inits a LocalDummySimpleStatsSession. When you use the session, DummyData,
policies and permissions are downloaded to your machine. Then a simulator runs on
your local machine. You can step into the code with a Debugger to investigate
problems.
Instead of using the original policies
and permissions
, you can use custom
ones. This might be necessary if the DummyData datasets are too small to fullfil
privacy constraints for your query. This comes with the downside that your
simulation deviates from a "real" execution.
To use the PDB debugger, it is necessary to set max_threads=1.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_ids |
List[str]
|
List of dataset IDs. For each dataset ID, a client will be spun up, that uses the datasets' DummyData as his dataset. We automatically apply the privacy policies and permissions of the specified datasets. |
None
|
workspace |
Union[str, Path]
|
Union[str, Path] = None |
None
|
policies |
Optional[Dict[str, dict]]
|
Dictionary that defines an asset policy (value) per dataset ID (key)
in |
None
|
permissions |
Optional[Dict[str, dict]]
|
Dictionary that defines permissions (value) per dataset ID (key)
in |
None
|
max_threads |
Optional[int]
|
The maximum number of parallel threads to use for the Flare simulator. This should be between 1 and the number of gateways used by the session. Note that debugging may fail for max_threads > 1. Default=1. |
None
|
provision(dataset_ids, client_n_cpu=0.5, client_memory=1000, server_n_cpu=0.5, server_memory=1000)
🔗
Create and activate a cluster of Compute Clients and a Compute Aggregator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_ids |
List[str]
|
List of dataset IDs. For each dataset ID, a Compute Client will be spun up. |
required |
client_n_cpu |
float
|
number of vCPUs of Compute Clients |
0.5
|
client_memory |
int
|
memory of Compute Clients [MByte] |
1000
|
server_n_cpu |
float
|
number of vCPUs of Compute Aggregators |
0.5
|
server_memory |
int
|
memory of Compute Aggregators [MByte] |
1000
|
Returns:
SimpleStatsSession - Use this session in with simple statistics functions like
apheris_stats.simple_stats.tableone
.
PrivacyHandlingMethod
🔗
Bases: Enum
Defines the handling method when bounded privacy is violated.
Attributes:
Name | Type | Description |
---|---|---|
FILTER |
Filter out all groups that are violating privacy bound |
|
FILTER_DATASET |
Removes out the entire dataset from the federated computation in case of privacy violations |
|
ROUND |
only valid for counts, rounds to the privacy bound or 0 |
|
RAISE |
raises a PrivacyException if privacy bound was violated |
ResultsNotFound
🔗
Bases: Exception
SimpleStatsSession
🔗
Bases: StatsSession
__init__(compute_spec_id)
🔗
Inits a SimpleStatsSession that connects to a running cluster of Compute Clients
and an Aggregator. If you have no provisioned/activated cluster yet, then use
apheris_stats.simple_stats.util.provision
Parameters:
Name | Type | Description | Default |
---|---|---|---|
compute_spec_id |
UUID
|
Compute spec ID that corresponds to a running cluster or
Compute Clients and an Aggregator. (If you have no provisioned/activated
cluster yet, then use |
required |
get_module_functions(module)
🔗
Return a list of functions in module
.