Skip to content

Machine Learning with Apheris 3.0 using the NVIDIA FLARE Simulator and Python CLIπŸ”—

This guide will walk you through how to run a model on the Apheris Platform using the CLI, including simulating your workflow on dummy data or data on your local file system.

The CLI provides both terminal and Python interfaces to interact with the Apheris 3.0 platform. It can be used to create, activate and deactivate compute specs, and to submit and monitor compute jobs.

In this tutorial, you will learn about the Apheris CLI and how it can be used to train a model for segmentation of Hippocampus MRI images into anterior and posterior region using the nnU-Net model.

A quick introduction to nnU-NetπŸ”—

nnU-Net is an open-source model for biomedical image segmentation, based on the 2020 paper by Isensee et. al. It is designed to configure itself for various datasets without manual intervention. By optimizing its entire pipeline for each new dataset, including preprocessing, network architecture, training, and post-processing, nnU-Net aims to address the complexities of creating deep learning models for medical image analysis.

Federating nnU-Net on ApherisπŸ”—

To bring nnU-Net into the Apheris environment, we use the pip-installable package for nnU-Net and add wrapping code to federate it use NVIDIA FLARE.

Fingerprinting, Planning and Pre-ProcessingπŸ”—

nnU-Net differs from many machine learning models in that it takes an Auto-ML approach to determine the model structure and any pre-processing required on the input data.

To do this, it uses a technique called fingerprinting; essentially calculating a set of statistics over the dataset, which serves as input to the planning phase, where it makes decisions on various aspects of the model structure. For more details, please see the nnU-Net paper.

To federate nnU-Net, Apheris nnU-Net must federate these fingerprint statistics, such that the resulting model takes into account the statistics of the overall combined dataset from all sites. The statistics themselves are sufficiently abstracted from the raw data, that there are no privacy concerns with sharing them with the Orchestrator.

Therefore, we carry out fingerprinting on each Compute Gateway individually, then pass the fingerprints back to the Orchestrator for aggregation. The fingerprint statistics are then aggregated on the Orchestrator and returned to the client site where they are combined with the dataset descriptors that nnU-Net generates.

Once they receive the aggregated fingerprints from the orchestrator, the Compute Gateways will start the planning phase. This is deterministic with respect to the fingerprint input, therefore we expect all gateways to produce the same plan. There is a sanity check within the code, that will ensure this is the case, since it is important that all Compute Gateways operate on the same model.

Finally, once we are confident that all plans are identical, we perform the preprocessing defined by the plan against the datasets in each Compute Gateway. The pre-processed data remains on the Compute Gateway for the duration of the Gateway's activation.

Since this functionality is slightly beyond the standard training loop for machine learning, Apheris nnU-Net Uses a number of custom FLARE components to enable this.

TrainingπŸ”—

Once the data is pre-processed, it is simple to train the model using nnU-Net's training functions. After one or more epochs; we load the weights, return them to the server, aggregate them and pass the aggregated model back to each Compute Gateway. nnU-Net is a parametric model, and therefore we can apply FedAvg Federation using the built-in NVIDIA FLARE ScatterAndGather component.

Once the requested number of rounds have completed, you can download the resulting checkpoint from the Orchestrator using the CLI. The checkpoint will be formatted for nnU-Net, with the appropriate metadata for inference.

InferenceπŸ”—

Apheris nnU-Net also supports federated inference. In this case, the model weights are kept on each Compute Gateway alongside the data. For each dataset, we run prediction, compile the results into a zip archive, and return them to the server. The user can then download them with the NVIDIA FLARE workspace upon completion of their run. The checkpoint must be of a compatible format for nnU-Net, and must already reside in the Compute Gateway's Docker image. For more information on how to embed you checkpoint into your gateway image, please contact an Apheris representative.

DatasetsπŸ”—

When training nnU-Net, your data needs to be organised in a specific format to allow it to be ingested correctly into the model. We assume that this data pre-formatting step has been undertaken before the dataset has been registered to Apheris. For more information on the specific format, please see the nnU-Net documentation.

Important

Please note that a federated computation across multiple datasets requires that each dataset resides in a different Gateway.

Parameters for Apheris nnU-NetπŸ”—

To run Apheris nnU-Net, you need to provide a set of parameters that tell the platform how to execute your model. In this section, we'll show you what those parameters are, and how you can use them to customise your federated workloads on Apheris.

When using the Apheris CLI, you provide parameters as a JSON payload. This payload should match the schema that is used in the model definition, to ensure your workload runs without error.

Training PayloadπŸ”—

The payload is defined by this Pydantic model (no other fields are allowed):

class TrainingPayload(BaseModel):
    mode: Literal["training"]
    num_rounds: int
    dataset_id: int
    model_configuration: Literal["2d", "3d_lowres", "3d_fullres"]
    device: Literal["cpu", "cuda", "mps"] = "cuda"
    fold_id: Union[Literal["all"], int] = "all"
    trainer_name: str = "nnUNetTrainer"
    epochs_per_round: int = 1
    starting_fed_round: int = 0

num_rounds is the number of communication rounds for your training run.

dataset_id is the nnU-Net dataset ID, i.e. XYZ in DatasetXYZ_name_text. For example, if the dataset you wish to use is Dataset004_SiteA_Data, the dataset ID is 4.

model_configuration is the nnU-Net model configuration to train. Cascades are not currently supported.

device is the hardware device on which to train. Generally this should be cuda. On M* Apple hardware, you can use mps to accelerate training.

fold_id is the fold of the dataset to use. nnU-Net supports performing k-fold cross-validation during training. You can supply the fold number (0-4) to train only a specific fold of the dataset (see the nnU-Net documentation for more details). We recommend training with "all" when using federated training on Apheris as it is not currently possible to pass the splits file between submitted jobs.

trainer_name must be a valid trainer name from the nnU-Net code-base.

epochs_per_round is the number of times we pass through the dataset between federated aggregations.

starting_fed_round can be used if you wish to resume from an existing training run.

Example:

{
    "mode": "training",
    "device": "mps",
    "num_rounds": "2",
    "model_configuration": "2d",
    "dataset_id": 4
}

Inference PayloadπŸ”—

The payload is defined by this Pydantic model (no other fields are allowed):

class InferencePayload(BaseModel):
    mode: Literal["inference"]
    checkpoint_path: Path
    fold_id: Union[Literal["all"], int]
    device: Optional[Literal["cpu", "cuda", "mps"]] = "cuda"
    apheris_dataset_subdirs: Dict[str, str] = {}
    checkpoint_name: Literal["checkpoint_final.pth", "checkpoint_best.pth"] = (
        "checkpoint_final.pth"
    )

checkpoint_path is the location of the checkpoint on the Compute Gateway. Note that it is currently necessary to build a custom model image containing the checkpoint.

fold_id is the fold of the dataset to use - the dataset will be split and only the specified part of the split will be used. This must match whatever is in your checkpoint.

device is the hardware device on which to execute. Generally this should be cuda. On M* Apple hardware, you can use mps to accelerate inference.

apheris_dataset_subdirs for datasets provided as IDs (as they will be in remote data runs), you can provide a subdirectory path inside the dataset that contains the data. For example, if your dataset is an nnU-Net formatted dataset, the inference data is inside imagesTs. This is a dictionary of Apheris Dataset ID to subdirectory. The subdirectory path must not be absolute and cannot be outside the dataset root or the run will raise an error.

checkpoint_name whether to use the best or final weights from training. This must match whatever is in your checkpoint.

Example:

{
    "mode": "inference",
    "device": "cpu",
    "fold_id": 0,
    "checkpoint_path": "/local/workspace/nnUNet_results/Dataset001_XYZ/nnUNetTrainer__nnUNetPlans__2d",
    "apheris_dataset_subdirs": {
        "medical-decathlon-task004-hippocampus-a_gateway-1_org-1": "imagesTs"
    }
}

Resources:

Installation InstructionsπŸ”—

To run this example, you'll need to install the Apheris CLI and its prerequisites.

The CLI installation instructions will be covered here in a brief way. For a more detailed guide, please see Guide: Getting started with Apheris CLI.

First, create a virtual environment then install the latest version of the Apheris Authentication and Apheris CLI packages using Python Wheels provided by your Customer Solutions Engineer (replacing x.y.z with the version number of your wheel files):

python3 -m venv apheris_nnunet
source apheris_nnunet/bin/activate
pip install apheris_auth-x.y.z-py3-none-any.whl
pip install apheris_cli-x.y.z-py3-none-any.whl

Important

With Apheris 3.2, there have been some large changes to the underlying structure of the Apheris-provided packages.

  • Apheris SDK has been removed and should no longer be installed.
  • All Apheris packages are now prefixed with apheris- (e.g. cli -> apheris-cli).
  • The authentication and authorisation components have moved to the new apheris-auth package, interaction code has moved to apheris-cli and the low-level statistics functions have moved to apheris-statistics.

If your environment has the apheris and cli packages installed from prior versions of Apheris, please first uninstall them as follows:

$ pip uninstall apheris
$ pip uninstall cli

You do not need to install anything else to run remote jobs on Apheris - the model is pre-built inside the environment, so you just have to use the CLI to provision resources and run the job. We'll cover that in more detail in a later section.

However, should you wish to use the simulator for trying out your training runs on local or dummy data before submitting them for execution on real data, you will need to install the Apheris nnU-Net requirements too. The following section will guide you through that installation and show you how to train a model using the simulator.

Note

The simulator wrapper exists in the Apheris nnU-Net model repository. To gain access to this repository, please speak to your Apheris representative.

Clone the nnU-Net model repository, following the instructions provided by your Apheris representative.

In the simulator directory of this repository, you will find scripts to wrap the NVIDIA FLARE simulator with the Apheris utilities and allow you to download dummy data using the Apheris platform. These have some additional requirements you need to install:

pip install <path_to_apheris_nnunet>/requirements.txt

You should also add the src directory of the repository to your PYTHONPATH:

export PYTHONPATH=<path_to_apheris_nnunet>/src

Running nnU-Net locally using the SimulatorπŸ”—

In the simulator directory of the repository, you will find scripts to wrap the NVIDIA FLARE simulator with the Apheris utilities and allow you to download dummy data using the Apheris platform. This is helpful for testing / debugging your applications before running them on real world data.

The NVIDIA FLARE Simulator will create an NVIDIA FLARE workspace on your local machine. This mimics the behaviour of the production behaviour, except that the server and all clients are held on your local machine.

The simulation wrapper will create this path if it doesn't exist - you just need to supply a valid path, e.g. /path/to/workspace.

Due to the assumption that all datasets to be aggregated will have the same ID and the requirements from nnU-Net, each dataset ID may only exist once, we currently only support a single dataset / gateway in simulated training mode. This limitation does not exist in remote execution mode.

Working with dummy dataπŸ”—

Datasets can include what's known as dummy data, which serves as a stand-in for real data. This dummy data is designed to be representative of the actual data but without containing any sensitive information. It's recommended that when data custodians register their datasets, they also provide this dummy data alongside it.

For this tutorial, you will use the Medical Decathlon's "Hippocampus" dataset. This has been pre-registered in the Apheris environment and is available for you to experiment with. It has been split into 2 parts, and each part is registered to a different organisation.

The IDs for these datasets are:

  • medical-decathlon-task004-hippocampus-a_gateway-1_org-1
  • medical-decathlon-task004-hippocampus-b_gateway-2_org-2

Each of these datasets has a small number of images stored as dummy data, in the correct format for nnUNetv2.

The NVIDIA FLARE Simulator will create an NVIDIA FLARE workspace on your local machine. This mimics the behaviour of the production platform, except that the server and all clients are held on your local machine. The simulation wrapper will create this path if it doesn't exist - you just need to supply a valid path, e.g. /path/to/workspace.

The --payload parameter is the same as you would use when submitting a job with the CLI. It is either a JSON string or the path of a JSON file. The content contains the parameters that are required when submitting your job, in this case, training.

$ python3 -m simulator /path/to/nnunet_dataset_directory /path/to/workspace \
  --apheris-dataset-id medical-decathlon-task004-hippocampus-a_gateway-1_org-1 \
  --payload '{"mode":"training","device":"mps","num_rounds":"2", "model_configuration": "2d", "dataset_id": 4}'

The simulator harness will then use the Apheris utilities to download and extract the dummy data for medical-decathlon-task004-hippocampus-a_gateway-1_org-1 onto the local machine and begin training. The value provided for dataset_id in the payload should match the dataΒ ID inside the downloaded dataset.

Starting from scratch: Formatting a dataset and using it locallyπŸ”—

In this section, you'll start from scratch; downloading an example dataset from the Medical Decathlon Website, pre-formatting it for nnUnetv2 as per the instructions on the nnunetv2 repository and finally running a training job using the simulator.

First, visit the Medical Decathlon Website and click to enter the Google Drive. Download Task004_Hippocampus.tar into /path/to/Task004_Hippocampus.tar.

Extract the tarball: $ tar -xf Task04_Hippocampus.tar

Next you need to take that raw dataset and format it for nnU-net. Helpfully, the nnU-Net repo provides a script to do this for the Medical Decathlon datasets: nnUNetv2_convert_MSD_dataset.

Before running the script, you need to set some environment variables to allow it to find the data.

Here, we'll assume you wish to store all the nnU-Net data inside /path/to/nnunet_dataset_directory:

export nnUNet_raw=/path/to/nnunet_dataset_directory/nnUNet_raw
export nnUNet_preprocessed=/path/to/nnunet_dataset_directory/nnUNet_preprocessed
export nnUNet_results=/path/to/nnunet_dataset_directory/nnUNet_results

For more information on these environment variables, please see the path environment variables documentation in the nnUnetv2 repository.

Now you can run the script to convert your dataset to an nnUnetv2 compatible dataset (this will create the directory <nnUNet_raw>/Dataset004_Hippocampus):

nnUNetv2_convert_MSD_dataset -i/path/to/Task04_Hippocampus -np 8

At this point, the dataset is correctly configured and present in the nnUNet_raw directory; so now you can use the simulator on the raw files with the following command:

$ python3 -m simulator /path/to/nnunet_dataset_directory /path/to/workspace  \
  --local-datasets <nnUNet_raw>/Dataset004_Hippocampus \
  --payload '{"mode":"training","device":"mps","num_rounds":"2", "model_configuration": "2d", "dataset_id": 4}'

Here, since the dataset is already downloaded and configured for nnunet, we use the --local-datasets parameter to set it, and ensure again that dataset_id in the payload matches the id from the data in this case 4 to match Dataset004_Hippocampus.

Note

nnU-Net assumes that only one dataset exists with a given ID inside the nnUNet_raw directory, so it's important to keep this in mind when preparing multiple datasets.

Local datasets can be either a correctly formatted directory, or a zip-file containing a correctly formatted directory. If a zip-file is provided, the simulator wrapper will extract it into the nnUNet_raw directory.

Using the simulator for inferenceπŸ”—

You must first have downloaded a suitable checkpoint from a training run with appropriate data.

Provide the path to the checkpoint as the path to the configuration of model you wish to use, e.g.:

/local/workspace/nnUNet_results/Dataset001_XYZ/nnUNetTrainer__nnUNetPlans__2d

Similarly to the training example, you can provide your data either as apheris datasets (with dummy data), local zip files to extract or a local file system.

At inference time, it is not necessary to have your data structured as per nnunet's training format, but if you do, you can provide the subdirectory containing the training images using the --local-dataset-subdirs argument (for both local directories and zip files). If provided, the number of subdirectories must equal the number of provided datasets of that type (use "." for a dataset that doesn't have a subdirectory if you need to supply this for others).

For dummy datasets downloaded using the Apheris SDK, you should provide the subdirectories in the payload as you would when running the simulator on remote data. The simulator will combine the subdir paths for local and dummy data before executing.

It is possible to provide a combination of any number of apheris, local zip and local directory datasets. The simulator will spawn one site per provided dataset and run them all in parallel. Please bear in mind that running too many parallel jobs is likely to stretch the resources of the local system.

The payload for inference can be quite long, so you can store it in a JSON file to make things easier.

Assume you have a file, payload.json, defining the inference payload as follows:

{
    "mode": "inference",
    "device": "cpu",
    "fold_id": 0,
    "checkpoint_path": "/local/workspace/nnUNet_results/Dataset001_XYZ/nnUNetTrainer__nnUNetPlans__2d",
    "apheris_dataset_subdirs": {
        "medical-decathlon-task004-hippocampus-a_gateway-1_org-1": "imagesTs"
    }
}
$ python -m simulator /path/to/nnunet_dataset_directory /path/to/workspace \
  --apheris-dataset-ids medical-decathlon-task004-hippocampus-a_gateway-1_org-1 \
  --local-datasets /path/to/local/data/Dataset005_MoreData \
  --local-dataset-subdirs "imagesTs" \
  --payload payload.json

The results will be stored in the server's workspace path:

/path/to/workspace/simulate_job/app_server/inference_results

Running nnUNetv2 in the Apheris environmentπŸ”—

Now that you've tested your model locally, you can use the Apheris environment to run it on real data.

This section will introduce the Apheris CLI and show you how to run training on datasets that are registered via the Apheris Governance Portal.

The CLI: Getting helpπŸ”—

The CLI provides documentation on the commands and their arguments. To access this, use the --help flag from the terminal CLI. This can either be done at root-level, or for a specific subset of commands.

apheris --help

You can also find a list of all the supported commands in the Python API Reference document.

LoginπŸ”—

To interact with the Apheris environment, you must first log in. You can either do this using the CLI directly, or from the CLI python interface as you'll see below:

import apheris
import aphcli
from aphcli.api import compute, job, models, datasets
from pathlib import Path
import time
apheris.login()
Logging in to your company account...
Apheris:
Authenticating with Apheris Cloud Platform...
Please continue the authorization process in your browser.
Gateway:
Authenticating with Apheris Compute environments...
Please continue the authorization process in your browser.

Login was successful

You can check your login status at any time, using the Apheris CLI:

import aphcli.utils
aphcli.utils.get_login_status()

This will return a tuple containing 4 values:

(is_logged_in: bool, user_email: str, user_org: str, user_environment: str)

For example:

(True, 'user.name@apheris.com', 'Apheris', 'production')

Viewing Available Datasets and ModelsπŸ”—

Important

Apheris 3.2 introduces a number of API changes in order to simplify the package architecture. Please note that the commands below may differ from those used in earlier versions.

Once logged into Apheris, you may have access to a number of datasets based on the asset policies that have named you as a beneficiary. To see what datasets you have access to, you can use the datasets API to list them, and their associated data custodian:

datasets.list_datasets()

Output:

+-----+---------------------------------------------------------+--------------+----------------------+
| idx |                         dataset_id                      | organization |    data custodian    |
+-----+---------------------------------------------------------+--------------+----------------------+
|  0  | medical-decathlon-task004-hippocampus-a_gateway-1_org-1 |     Org 2    |      Orsino Hoek     |
|  1  | medical-decathlon-task004-hippocampus-b_gateway-2_org-2 |     Org 1    |   Agathe McFarland   |
+-----+---------------------------------------------------------+--------------+----------------------+

Note

Prior to Apheris 3.2, the list returned by apheris.list_datasets() was a Pandas DataFrame, however in an effort to reduce the dependency overhead of the CLI, results are now returned as a prettytable.PrettyTable by default and it is no longer possible to return a RemoteData object as that has been removed from the user facing API.

The list returned by datasets.list_datasets() a prettytable.PrettyTable, which is designed for producing a human-readable list. To use the list programmatically, you can call datasets.list_datasets(to_table=False), which will instead return a list of dictionaries, with each element containing the details about the dataset.

For prediction with nnU-Net, this tutorial will to use the Medical Decathlon's Hippocampus dataset, so the following command can be used to filter the list of datasets to include only those with the hippocampus key in their ID:

all_datasets = datasets.list_datasets(to_table=False)
hippocampus_datasets = [d["slug"] for d in all_datasets if "hippocampus" in d["slug"]]

Output:

['medical-decathlon-task004-hippocampus-a_gateway-1_org-1', 'medical-decathlon-task004-hippocampus-b_gateway-2_org-2']

You can also list which models are available in the model registry using the models API:

models.list_models()

Output:

+-----+---------------------------+-------------------------------------+
|  id |            name           |               version               |
+-----+---------------------------+-------------------------------------+
|  0  |       apheris-nnunet      |                0.4.0                |
|  1  |     apheris-statistics    |                0.12.0               |
| ... |            ...            |                 ...                 |
+-----+---------------------------+-------------------------------------+

Compute SpecπŸ”—

A "Compute Spec" is a specialized contract that encapsulates the execution environment and parameters for securely running statistics functions and machine learning models on specified datasets. Think of it as a blueprint that includes:

  • Dataset ID(s): Each Compute Spec is linked to a specific dataset or datasets by an identifier. This ID is used to identify which Compute Gateways the code runs on
  • Model: This is defined through a Docker image that contains the pre-configured environment in which the code will execute. It ensures consistency and reproducibility by packaging the code, runtime, system tools, system libraries, and settings.
  • Max Compute resources: Here, the Compute Spec details the required computing resources; specifically the type and number of virtual CPUs, the amount of RAM, and any GPU requirements. This ensures that the computation will run on infrastructure that's equipped to handle the task's demands.

In essence, a Compute Spec is a comprehensive contract that delineates how, where, and on what data a piece of code can execute within the respective Compute Gateway.

To define the compute spec, we specify the dataset(s), hardware infrastructure requirements, and model.

Since this example shows training on a relatively simple nnU-Net model structure with a small dataset, the compute requirements are relatively low, though we still require a GPU. For training 3d models, at least 12GB of gateway memory are required.

You'll see that the compute requirements on the server are much lower - this is because only aggregation is performed on the server, which is a relatively low complexity operation.

Important

From Apheris 3.2, the withCustomCode field has been removed from the compute spec. Please update your scripts accordingly.

client_resource_cfg = {
    "client_n_cpu": 1,
    "client_n_gpu": 1,
    "client_memory": 8000
    }

server_resource_cfg = {
    "server_n_cpu": 0.25,
    "server_n_gpu": 0,
    "server_memory": 2000
}

# These datasets are on GW2 and GW1 respectively. You could also run with only one of them
compute_spec_id = compute.create_from_args(
    [
        'medical-decathlon-task004-hippocampus-a_gateway-1_org-1',
        'medical-decathlon-task004-hippocampus-b_gateway-2_org-2'
    ],
    **client_resource_cfg,
    **server_resource_cfg,
    model_id="apheris-nnunet",
    model_version="0.3.2"
 )

Important

The model version provided above is the latest at time of writing, but please use models.list_models()) to find the versions to which you have access.

The compute spec ID will be returned as a UUID object:

UUID('d086dce5-d89e-45e9-a6ff-82d13872ad63')

If you're not sure what values you want to put in the compute spec, you can use interactive mode to guide you through the creation. For example, the following snippet doesn't provide the model, so you'll be asked to provide it:

from aphcli.api.models import Model

cs = compute.ComputeSpec(
    model=Model("apheris-nnunet", "0.4.0"),
    **client_resource_cfg, 
    **server_resource_cfg,
)
cs.ask_for_empty_inputs()

Output:

## List of available datasets:

+-----+---------------------------------------------------------+--------------+---------------------------+
| idx |                        dataset_id                       | organization |       data custodian      |
+-----+---------------------------------------------------------+--------------+---------------------------+
|  0  |          cancer-medical-images_gateway-2_org-2          |    Org 2     |        Orsino Hoek        |
|  1  |          pneumonia-x-ray-images_gateway-2_org-2         |    Org 2     |        Orsino Hoek        |
|  2  |            covid-19-patients_gateway-1_org-1            |    Org 1     |      Agathe McFarland     |
|  3  | medical-decathlon-task004-hippocampus-a_gateway-1_org-1 |    Org 1     |      Agathe McFarland     |
|  4  | medical-decathlon-task004-hippocampus-b_gateway-2_org-2 |    Org 2     |        Orsino Hoek        |
| ... |                           ...                           |     ...      |            ...            |
+-----+---------------------------------------------------------+--------------+---------------------------+...

Please list the indices of all datasets that you want to include. Separate the numbers by comma, semicolon or space:
:3,4

You have selected following indices: [3, 4]

They correspond to following dataset IDs:
 ['medical-decathlon-task004-hippocampus-a-files_gateway-1_org-1', 'medical-decathlon-task004-hippocampus-b_gateway-2_org-2']

Now you can run the following to create the compute spec:

compute.create(cs)

Once created, you can see the full JSON that was used to create a compute spec using the get function:

spec = compute.get()
print(spec.to_json())

Output:

{
   "datasets": [
      "medical-decathlon-task004-hippocampus-a-files_gateway-1_org-1",
      "medical-decathlon-task004-hippocampus-b-files_gateway-2_org-2"
   ],
   "resources": {
      "clients": {
         "cpu": 1,
         "gpu": 1,
         "memory": 8000
      },
      "server": {
         "cpu": 0.25,
         "gpu": 0,
         "memory": 2000
      }
   },
   "model": {
      "id": "apheris-nnunet",
      "version": "0.4.0"
   }
}

You could save this JSON to a file and later use it with the CLI when creating a new compute spec to save writing the full specification each time. This requires the terminal interface:

apheris compute create --json /path/to/json/file.json

You can learn more about the terminal interface in the Getting started with Apheris CLI guide.

Activating the Compute SpecπŸ”—

Once a compute spec is ready_to_activate, you can activate it. That means, you can deploy a cluster of Compute Clients and an Orchestrator (central server for aggregation). When you call the compute.activate() function without the compute_spec_id argument, the CLI uses a cached one. This is the last compute_spec_id that was used successfully.

Important

Once a compute spec is activated, the resources assigned to it are reserved and the infrastructure is live. It is good practice to deactivate your compute spec once finished with it to avoid capacity issues and/or incurring additional infrastructure costs.

Let's activate a compute spec!

compute.activate()

Output:

On 2024-04-02 20:05:36 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2

There will be a short delay (up to 2-3 mins on first run) while the compute spec is activated.

You can check the status of this as follows:

compute.get_activation_status()

Note

Prior to Apheris 3.2, this function was called compute.get_deploy_status(). That function still exists for 3.2, but will output a deprecation warning and will be removed for 3.3. Please update your scripts accordingly.

Output:

On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2
'creating'

Note

The internal compute spec creation mechanism has changed with Apheris 3.2 and the compute spec is ready to activate immediately after creation. Therefore the "stage" field of the status response has been removed, as has the verbose mode of apheris compute status.

Activation status will be one of the following values:

  • waiting_activation: The compute spec has been created, but has never been activated.
  • creating: The infrastructure has been requested and is starting up. This is the initial status after calling compute.activate().
  • running: The compute spec is ready for workloads to be submitted.
  • deleting: The infrastructure is shutting down. This is what you see after calling compute.deactivate().
  • shutdown: The shut down has completed.
  • failed: An error has occurred while provisioning infrastructure.

We can poll the status of the compute spec until it's ready:

compute.wait_until_running(compute_spec_id)

There is a longer-form status command, that will retrieve more detailed information about the specific state of the Orchestrator and Compute Gateways involved in your compute spec.

In particular, this can be helpful to troubleshoot issues, such as out-of-capacity errors, which would be shown as below:

compute.get_status()

Output:

{
    "status": "creating",
    "message": {
        "5a873d08-7f58-4a5f-953c-87a1b7726a0b": "The computation has been scheduled: 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod.",
        "f44f2052-659a-43fd-84f8-8942627d222c": "The computation has been scheduled: no details",
        "orchestrator": "The computation has been scheduled: no details"
    }
}

Listing your compute specsπŸ”—

To help you track which compute specs you have created, and their states, you can use the CLI to retrieve a list of your compute specs. By default this just shows the IDs of all your compute specs (ordered newest to oldest):

compute.list_compute_specs()

Important

Note that with Apheris 3.2, the compute spec format has changed considerably, with a number of fields related to custom code behaviour removed from the returned object.

The output of the command above is a list of compute spec objects, which contain the data from the compute spec and the approval status information:

{
    "id": "8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2",
    "createdAt": "2024-04-02T18:55:41.22733804Z",
    "createdBy": {
       ...
    },
    "updatedAt": "2024-04-02T18:55:41.22733804Z",
    "contextURI": "",
    "datasets": ["medical-decathlon-task004-hippocampus-a_gateway-1_org-1"],
    "model": {"id": "apheris-nnunet", "version": "1.0.0"},
    "resources": {
        "server": {"cpu": 0.25, "memory": 2000, "gpu": 0, "Replicas": 1},
        "clients": {"cpu": 1, "memory": 8000, "gpu": 1, "Replicas": 1},
    },
}

If you want more information, including the current status of the compute specs you can use the helper function get_compute_specs_details. We would suggest limiting this query to 10 or so compute specs for the sake of speed.

compute_specs = compute.list_compute_specs()

# Extract the 10 latest compute specs
compute_specs = compute_specs[:10]

# Get the extra data
compute_spec_data = compute.get_compute_specs_details(compute_specs)

Using the bash CLI, you can simply provide the --verbose flag to get a formatted table (use --very-verbose to prevent truncation of fields).

$ apheris compute list -v --limit 2
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
|                  ID                  |       Created       |        Model         |            Datasets           | Resources        | Activation |
|                                      |                     |                      |                               |                  |   Status   |
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
| 15fdbcaa-80cb-4bb2-9b55-fdb1111d6284 | 2024-06-24 11:29:22 | apheris-nnunet:0.4.0 |           2 datasets          | Orchestrator:    |  shutdown  |
|                                      |                     |                      |                               |   CPU: 1         |            |
|                                      |                     |                      |                               |   GPU: 0         |            |
|                                      |                     |                      |                               |   Memory: 2000MB |            |
|                                      |                     |                      |                               | Gateway:         |            |
|                                      |                     |                      |                               |   CPU: 1         |            |
|                                      |                     |                      |                               |   GPU: 0         |            |
|                                      |                     |                      |                               |   Memory: 2000MB |            |
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
| 0e2ac20b-de6b-4fa3-880b-44ec66368775 | 2024-06-24 11:41:38 | apheris-nnunet:0.4.0 | medical-decat...teway-1_org-1 | Orchestrator:    |  shutdown  |
|                                      |                     |                      |                               |   CPU: 1         |            |
|                                      |                     |                      |                               |   GPU: 0         |            |
|                                      |                     |                      |                               |   Memory: 2000MB |            |
|                                      |                     |                      |                               | Gateway:         |            |
|                                      |                     |                      |                               |   CPU: 1         |            |
|                                      |                     |                      |                               |   GPU: 1         |            |
|                                      |                     |                      |                               |   Memory: 8000MB |            |
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+

Submitting JobsπŸ”—

Models in the Apheris Model Registry accept parameters to allow them to be configured or to perform different tasks.

To submit workloads to the Apheris environment, you use the jobs API to submit those parameters as Python dictionary (or a JSON object if using the bash CLI) to the Secure Runtime inside the Apheris Orchestrator.

The Secure Runtime then validates those parameters to ensure they are acceptable for the model (checking types, numerical bounds, etc.), before submitting a federation job for execution on the Orchestrator and Compute Gateways.

Here, you'll submit a simple 2D training job using the datasets you defined in the compute spec:

params = {
    "mode": "training",
    "model_configuration": "2d", # 2d / 3d / 3d_fullres
    "dataset_id": 4, # The nnU-Net dataset ID, i.e. Dataset004...
    "num_rounds": 1, # The number of federation rounds
    "device": "cuda" # Train on GPU
}

job_id = job.submit(params)

Similar to the compute specs, the job_id you will receive back is a UUID object.

To submit a job with the terminal CLI, you provide the arguments as a JSON payload:

$ apheris job run --payload '{"mode": "training", "model_config": "2d", "dataset_id": 4, "num_rounds": 1, "device": "cpu"}'

On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2

You can check the status of your job using job.status().

job.status()

Output:

On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2

On 2024-04-02 20:45:43 you have used the job ID `2312560a-29a4-4aa4-9166-ba57852b0e03`.
We will use the job ID 2312560a-29a4-4aa4-9166-ba57852b0e03.
'RUNNING'

When you submit a job, or run one of the job functions below, the ID used is cached and used by default for further commands.

Alternatively, to check the status of a different job simply enter its ID as a parameter.

You can also list all jobs you've submitted since activating the compute spec, along with their statuses:

job.list_jobs()

Output:

On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2
[{'duration': '0:01:21.143068', 'job_id': '2312560a-29a4-4aa4-9166-ba57852b0e03', 'job_name': 'training', 'status': 'RUNNING', 'submit_time': '2024-04-02T19:45:43.587616+00:00'}]

Monitoring your jobπŸ”—

You can view the current logs for your currently running training job using the jobs.logs() command:

>>> job.logs()

On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2

On 2024-04-02 20:45:43 you have used the job ID `2312560a-29a4-4aa4-9166-ba57852b0e03`.
We will use the job ID 2312560a-29a4-4aa4-9166-ba57852b0e03.
"2024-04-02 19:45:47,056 - runner_process - INFO - Runner_process started.\n2024-04-02 19:45:47,095 - CoreCell - INFO - server.2312560a-29a4-4aa4-9166-ba57852b0e03: created backbone internal connector to tcp://localhost:13461 on parent\n2024-04-02 19:45:47,095 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:13461] is starting\n2024-04-02 19:45:47,097 - Cell - INFO - Register blob CB for channel='server_command', topic='*'\n2024-04-02 19:45:47,097 - Cell - INFO - Register blob CB for channel='aux_communication', topic='*'\n2024-04-02 19:45:47,097 - ServerCommandAgent - INFO - ServerCommandAgent cell register_request_cb: server.2312560a-29a4-4aa4-9166-ba57852b0e03\n2024-04-02 19:45:47,102 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 127.0.0.1:55436 => 127.0.0.1:13461] is created: PID: 30\n2024-04-02 19:45:47,153 - ServerRunner - INFO - [identity=Unnamed_project_54d5653d-93af-4433-a32d-7774ca74f3dc, run=2312560a-29a4-4aa4-9166-ba57852b0e03]: Server runner starting ...\n2024

Important

job.logs() is only applicable for jobs that are currently in the RUNNING state, as it streams the logs directly from the computation container. To download logs for a job that has completed (either successful, errored or aborted), please use apheris job download-results.

In order to protect the privacy of the data on the Compute Gateway, you only have access to the Orchestrator logs; however Apheris models will send sanitised logs from the Compute Gateways to the Orchestrator, so you can see what's happening there too.

You can find these with the prefix "ClientLogForwarder", they look like this:

"2024-04-02 19:48:20,444 - ClientLogForwarder - INFO - [identity=Unnamed_project_54d5653d-93af-4433-a32d-7774ca74f3dc, run=2312560a-29a4-4aa4-9166-ba57852b0e03, wf=fingerprint]: Message from '5a873d08-7f58-4a5f-953c-87a1b7726a0b [INFO]': Fingerprint stage complete, sending to server"

If your job errors on the Gateway in an unexpected way, the Apheris wrapper will catch the error and sanitise it before returning it to you via the ClientLogForwarder. This means you'll be able to see where in the code the error happens, but not the values of any variables.

When a job completes, it will report its status as FINISHED:COMPLETED. Other statuses you might encounter are:

  • SUBMITTED: The job has been sent to the system, and it is queued for running. If you see a job sitting in this state for prolonged periods, it might indicate an issue with the activated infrastructure.
  • RUNNING: The job has been started and is currently computing.
  • FINISHED:COMPLETED: The job has successfully finished.
  • FINISHED:ABORTED: The job was aborted (using apheris job abort while running).
  • FINISHED:EXECUTION_EXCEPTION: The job errored. To understand why, you can use apheris job download-results and investigate the logs.

You can poll for the job status until it shows the job is FINISHED, then download the result. This should take roughly 10 minutes for a single federation round of training (1 epoch over the Hippocampus dataset):

while True:
    status = job.status()
    print(status)
    if "FINISHED" in status:
        break
    time.sleep(10)

Downloading ResultsπŸ”—

The result is the workspace of the NVIDIA FLARE server. This contains:

  • The logs from the server
  • Any files that were stored in the workspace as part of the task. In this case, the aggregated final checkpoint
  • The job configurations that were created by the secure_runtime and define the workflow.
  • Some meta data about the run (statistics round latency, etc.).

Once the status is FINISHED:COMPLETED, you can download the results using the download_results command:

download_path = Path("./training_results")
job.download_results(download_path)

This will download the workspace as described above, including the aggregated checkpoint inside the app_server directory:

$ tree training_results/workspace/
training_results/workspace/
β”œβ”€β”€ app_server
β”‚Β Β  β”œβ”€β”€ <name of checkpoint>
β”‚Β Β  β”‚Β Β  └── nnUNetTrainer__nnUNetPlans__2d
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ dataset.json
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ dataset_fingerprint.json
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ fold_all
β”‚Β Β  β”‚Β Β      β”‚Β Β  └── checkpoint_final.pth
β”‚Β Β  β”‚Β Β      └── plans.json
β”‚Β Β  β”œβ”€β”€ FL_global_model.pt
β”‚Β Β  └── config
β”‚Β Β      β”œβ”€β”€ config_fed_client.json
β”‚Β Β      └── config_fed_server.json
β”œβ”€β”€ fl_app.txt
β”œβ”€β”€ log.txt
β”œβ”€β”€ meta.json
└── stats_pool_summary.json

In the output above, you can see a directory <name of checkpoint>. This is the trained checkpoint from nnunet, that can now be used for local and/or centralised inference or to initialise further centralised training.

Important

The storage attached to a compute spec is not persistent. This means that when you deactivate a compute spec, any results stored on the compute pod are deleted along with the deployment.

Please ensure to download any results or logs you might need before deactivation.

Clean upπŸ”—

Now that your job is completed, it is important to deactivate the compute spec to shut down the attached infrastructure:

compute.deactivate()

SummaryπŸ”—

In this tutorial, you have been introduced to the nnU-Net machine learning model as an example of how to run your model on the Apheris environment.

You have seen how to use the NVIDIA FLARE Simulator to test your model locally on dummy data, then been guided through how to run the same workload on the Apheris environment with real data using the Apheris Python API.

If you'd like to learn more about how to interact with the Apheris environment, we'd recommend the Getting started with Apheris CLI guide, or to find out more about how to use the Statistics package to analyse data in a secure and privacy-preserving way, check out our guide to Simulating and Running Statistics.