Skip to content

Machine Learning with Apheris using the NVIDIA FLARE Simulator and Python CLIπŸ”—

This guide will walk you through how to run a model on the Apheris Platform using the CLI, including simulating your workflow on dummy data or data on your local file system.

The CLI provides both terminal and Python interfaces to interact with the Apheris 3.0 platform. It can be used to create, activate and deactivate compute specs, and to submit and monitor compute jobs.

In this tutorial, you will learn about the Apheris CLI and how it can be used to train a model for segmentation of Hippocampus MRI images into anterior and posterior region using the nnU-Net model.

What is nnU-Net?πŸ”—

To learn more about Apheris Federated nnU-Net, please check out our Model Registry: nnU-Net guide.

Installation InstructionsπŸ”—

To run this example, you'll need to install the Apheris CLI and its prerequisites.

First, install the Apheris CLI package by following the Quickstart: Installing the Apheris CLI guide

To run remote jobs on Apheris, that's it! You do not need to install anything else - the model is pre-built inside the environment, so you just have to use the CLI to provision resources and run the job. We'll cover that in more detail in a later section.

However, should you wish to use the simulator for trying out your training runs on local or dummy data before submitting them for execution on real data, you will need to install the Apheris nnU-Net requirements too. The following section will guide you through that installation and show you how to train a model using the simulator.

Note

The simulator wrapper exists in the Apheris nnU-Net model repository. To gain access to this repository, please speak to your Apheris representative.

Clone the nnU-Net model repository, following the instructions provided by your Apheris representative.

In the simulator directory of this repository, you will find scripts to wrap the NVIDIA FLARE simulator with the Apheris utilities and allow you to download dummy data using the Apheris platform. These have some additional requirements you need to install:

pip install <path_to_apheris_nnunet>/requirements.txt

You should also add the src directory of the repository to your PYTHONPATH:

export PYTHONPATH=<path_to_apheris_nnunet>/src

Running nnU-Net locally using the SimulatorπŸ”—

In the simulator directory of the repository, you will find scripts to wrap the NVIDIA FLARE simulator with the Apheris utilities and allow you to download dummy data using the Apheris platform. This is helpful for testing / debugging your applications before running them on real world data.

The NVIDIA FLARE Simulator will create an NVIDIA FLARE workspace on your local machine. This mimics the behaviour of the production behaviour, except that the server and all clients are held on your local machine.

The simulation wrapper will create this path if it doesn't exist - you just need to supply a valid path, e.g. /path/to/workspace.

Due to the assumption that all datasets to be aggregated will have the same ID and the requirements from nnU-Net, each dataset ID may only exist once, we currently only support a single dataset / gateway in simulated training mode. This limitation does not exist in remote execution mode.

Working with dummy dataπŸ”—

Datasets can include what's known as dummy data, which serves as a stand-in for real data. This dummy data is designed to be representative of the actual data but without containing any sensitive information. It's recommended that when data custodians register their datasets, they also provide this dummy data alongside it.

For this tutorial, you will use the Medical Decathlon's "Hippocampus" dataset. This has been pre-registered in the Apheris environment and is available for you to experiment with. It has been split into 2 parts, and each part is registered to a different organisation.

Before you can access Apheris dummy data, you need to provide your environment configuration. This is done using a Python package called dotenv which is installed as a dependency of the CLI and allows easy configuration of environment variables from a file.

Please contact your Apheris representative to get access the dotenv file for your environment. This file contains environment variables for your specific environment, including which URL the CLI should connect to. You can set these environment variables by running dotenv -f <name_of_environment_file> run $SHELL. If you are running this command from a conda environment, please note that you may need to reactivate the conda environment after running this command.

Now log into Apheris using the CLI:

$ apheris login
Logging in to your company account...
Apheris:
Authenticating with Apheris Cloud Platform...
Please continue the authorization process in your browser.

Gateway:
Authenticating with Apheris Compute environments...
Please continue the authorization process in your browser.


Login was successful
You are logged in:
 e-mail:  your.name@your-company.com
 organization: your_organisation
 environment: your_environment

The IDs for the datasets you'll use are:

  • medical-decathlon-task004-hippocampus-a_gateway-1_org-1
  • medical-decathlon-task004-hippocampus-b_gateway-2_org-2

Each of these datasets has a small number of images stored as dummy data, in the correct format for nnU-Net v2.

The NVIDIA FLARE Simulator will create an NVIDIA FLARE workspace on your local machine. This mimics the behaviour of the production platform, except that the server and all clients are held on your local machine. The simulation wrapper will create this path if it doesn't exist - you just need to supply a valid path, e.g. /path/to/workspace.

The --payload parameter is the same as you would use when submitting a job with the CLI. It is either a JSON string or the path of a JSON file. The content contains the parameters that are required when submitting your job, in this case, training.

$ python3 -m simulator /path/to/nnunet_dataset_directory /path/to/workspace \
  --apheris-dataset-id medical-decathlon-task004-hippocampus-a_gateway-1_org-1 \
  --payload '{"mode":"training","device":"cpu","num_rounds":"2", "model_configuration": "2d", "dataset_id": 4}'

The simulator harness will then use the Apheris utilities to download and extract the dummy data for medical-decathlon-task004-hippocampus-a_gateway-1_org-1 onto the local machine and begin training. The value provided for dataset_id in the payload should match the dataΒ ID inside the downloaded dataset.

If you have GPUs available and CUDA installed in your environment, you could set device to cuda to speed up training. Similarly, on M-series Apple hardware, you can set it to mps to use hardware acceleration.

Your results will be stored in the workspace directory you provided above: /path/to/workspace.

The workspace includes the output for the simulated Compute Gateways and Orchestrator. You can find the trained checkpoint in the Orchestrator outputs:

/path/to/workspace/simulate_job/app_server/Dataset004_Dataset004_Hippocampus_dummyA/nnUNetTrainer__nnUNetPlans__2d/

Starting from scratch: Formatting a dataset and using it locallyπŸ”—

In this section, you'll start from scratch; downloading an example dataset from the Medical Decathlon Website, pre-formatting it for nnU-Net v2 as per the instructions on the nnU-Net v2 repository and finally running a training job using the simulator.

First, visit the Medical Decathlon Website and click "Data" then "Google Drive". Download Task004_Hippocampus.tar into /path/to/Task004_Hippocampus.tar.

Extract the tarball with: tar -xf Task04_Hippocampus.tar

Next you need to take that raw dataset and format it for nnU-net. Helpfully, the nnU-Net repo provides a script to do this for the Medical Decathlon datasets: nnUNetv2_convert_MSD_dataset.

Before running the script, you need to set some environment variables to allow it to find the data.

Here, we'll assume you wish to store all the nnU-Net data inside /path/to/nnunet_dataset_directory:

export nnUNet_raw=/path/to/nnunet_dataset_directory/nnUNet_raw
export nnUNet_pre-processed=/path/to/nnunet_dataset_directory/nnUNet_pre-processed
export nnUNet_results=/path/to/nnunet_dataset_directory/nnUNet_results

For more information on these environment variables, please see the path environment variables documentation in the nnU-Net v2 repository.

Now you can run the script to convert your dataset to an nnU-Net v2 compatible dataset (this will create the directory <nnUNet_raw>/Dataset004_Hippocampus):

nnUNetv2_convert_MSD_dataset -i/path/to/Task04_Hippocampus -np 8

At this point, the dataset is correctly configured and present in the nnUNet_raw directory; so now you can use the simulator on the raw files with the following command:

$ python3 -m simulator /path/to/nnunet_dataset_directory /path/to/workspace  \
  --local-datasets <nnUNet_raw>/Dataset004_Hippocampus \
  --payload '{"mode":"training","device":"cpu","num_rounds":"2", "model_configuration": "2d", "dataset_id": 4}'

Here, since the dataset is already downloaded and configured for nnU-Net v2, we use the --local-datasets parameter to set it, and ensure again that dataset_id in the payload matches the id from the data in this case 4 to match Dataset004_Hippocampus.

Note

nnU-Net assumes that only one dataset exists with a given ID inside the nnUNet_raw directory, so it's important to keep this in mind when preparing multiple datasets.

Local datasets can be either a correctly formatted directory, or a zip-file containing a correctly formatted directory. If a zip-file is provided, the simulator wrapper will extract it into the nnUNet_raw directory.

Using the simulator for inferenceπŸ”—

You must first have downloaded a suitable checkpoint from a training run with appropriate data.

Provide the path to the checkpoint as the path to the configuration of model you wish to use, e.g.:

/local/workspace/nnUNet_results/Dataset001_XYZ/nnUNetTrainer__nnUNetPlans__2d

Similarly to the training example, you can provide your data either as Apheris datasets (with dummy data), local zip files to extract or a local file system.

At inference time, it is not necessary to have your data structured as per nnU-Net v2's training format, but if you do, you can provide the subdirectory containing the training images using the --local-dataset-subdirs argument (for both local directories and zip files). If provided, the number of subdirectories must equal the number of provided datasets of that type (use "." for a dataset that doesn't have a subdirectory if you need to supply this for others).

For dummy datasets downloaded using Apheris, you should provide the subdirectories in the payload as you would when running the simulator on remote data. The simulator will combine the subdirectories paths for local and dummy data before executing.

It is possible to provide a combination of any number of apheris, local zip and local directory datasets. The simulator will spawn one site per provided dataset and run them all in parallel. Please bear in mind that running too many parallel jobs is likely to stretch the resources of the local system.

The payload for inference can be quite long, so you can store it in a JSON file to make things easier.

Assume you have a file, payload.json, defining the inference payload as follows:

{
    "mode": "inference",
    "device": "cpu",
    "fold_id": 0,
    "checkpoint_path": "/local/workspace/nnUNet_results/Dataset001_XYZ/nnUNetTrainer__nnUNetPlans__2d",
    "apheris_dataset_subdirs": {
        "medical-decathlon-task004-hippocampus-a_gateway-1_org-1": "imagesTs"
    }
}
$ python -m simulator /path/to/nnunet_dataset_directory /path/to/workspace \
  --apheris-dataset-ids medical-decathlon-task004-hippocampus-a_gateway-1_org-1 \
  --local-datasets /path/to/local/data/Dataset005_MoreData \
  --local-dataset-subdirs "imagesTs" \
  --payload payload.json

The results will be stored in the server's workspace path:

/path/to/workspace/simulate_job/app_server/inference_results

Running nnU-Net v2 in the Apheris environmentπŸ”—

Now that you've tested your model locally, you can use the Apheris environment to run it on real data.

This section will introduce the Apheris CLI and show you how to run training on datasets that are registered via the Apheris Governance Portal.

The CLI: Getting helpπŸ”—

The CLI provides documentation on the commands and their arguments. To access this, use the --help flag from the terminal CLI. This can either be done at root-level, or for a specific subset of commands.

apheris --help

You'll find reference guides for the Terminal CLI and the Python API in the Reference Guides section of the navigation.

For the rest of this guide, you'll use the Python API. It might be helpful, though not necessary, to work through the following steps in a Jupyter Notebook.

LoginπŸ”—

To interact with the Apheris environment, you must first log in. You can either do this using the CLI directly, or from the CLI's Python API as you'll see below:

import apheris
import aphcli
from aphcli.api import compute, job, models, datasets
from pathlib import Path
import time
apheris.login()
Logging in to your company account...
Apheris:
Authenticating with Apheris Cloud Platform...
Please continue the authorization process in your browser.
Gateway:
Authenticating with Apheris Compute environments...
Please continue the authorization process in your browser.

Login was successful

You can check your login status at any time, using the Apheris CLI:

import aphcli.utils
aphcli.utils.get_login_status()

This will return a tuple containing 4 values:

(is_logged_in: bool, user_email: str, user_org: str, user_environment: str)

For example:

(True, 'user.name@apheris.com', 'Apheris', 'production')

Viewing Available Datasets and ModelsπŸ”—

Once logged into Apheris, you may have access to a number of datasets based on the asset policies that have named you as a beneficiary. To see what datasets you have access to, you can use the datasets API to list them, and their associated data custodian:

datasets.list_datasets()

Output:

+-----+---------------------------------------------------------+--------------+----------------------+
| idx |                         dataset_id                      | organization |    data custodian    |
+-----+---------------------------------------------------------+--------------+----------------------+
|  0  | medical-decathlon-task004-hippocampus-a_gateway-1_org-1 |     Org 2    |      Orsino Hoek     |
|  1  | medical-decathlon-task004-hippocampus-b_gateway-2_org-2 |     Org 1    |   Agathe McFarland   |
+-----+---------------------------------------------------------+--------------+----------------------+

The list returned by datasets.list_datasets() a prettytable.PrettyTable, which is designed for producing a human-readable list. To use the list programmatically, you can call datasets.list_datasets(to_table=False), which will instead return a list of dictionaries, with each element containing the details about the dataset.

For prediction with nnU-Net, this tutorial will to use the Medical Decathlon's Hippocampus dataset, so the following command can be used to filter the list of datasets to include only those with the hippocampus key in their ID:

all_datasets = datasets.list_datasets(to_table=False)
hippocampus_datasets = [d["slug"] for d in all_datasets if "hippocampus" in d["slug"]]

Output:

['medical-decathlon-task004-hippocampus-a_gateway-1_org-1', 'medical-decathlon-task004-hippocampus-b_gateway-2_org-2']

You can also list which models are available in the model registry using the models API:

models.list_models()

Output:

+-----+---------------------------+-------------------------------------+
|  id |            name           |               version               |
+-----+---------------------------+-------------------------------------+
|  0  |       apheris-nnunet      |                0.8.0                |
|  1  |     apheris-statistics    |                0.19.0               |
| ... |            ...            |                 ...                 |
+-----+---------------------------+-------------------------------------+

Compute SpecπŸ”—

A "Compute Spec" is a specialized contract that encapsulates the execution environment and parameters for securely running statistics functions and machine learning models on specified datasets. Think of it as a blueprint that includes:

  • Dataset ID(s): Each Compute Spec is linked to a specific dataset or datasets by an identifier. This ID is used to identify which Compute Gateways the code runs on
  • Model: This is defined through a Docker image that contains the pre-configured environment in which the code will execute. It ensures consistency and reproducibility by packaging the code, runtime, system tools, system libraries, and settings.
  • Max Compute resources: Here, the Compute Spec details the required computing resources; specifically the type and number of virtual CPUs, the amount of RAM, and any GPU requirements. This ensures that the computation will run on infrastructure that's equipped to handle the task's demands.

In essence, a Compute Spec is a comprehensive contract that delineates how, where, and on what data a piece of code can execute within the respective Compute Gateway.

To define the compute spec, we specify the dataset(s), hardware infrastructure requirements, and model.

Since this example shows training on a relatively simple nnU-Net model structure with a small dataset, the compute requirements are relatively low, though we still require a GPU. For training 3d models, at least 12GB of gateway memory are required.

You'll see that the compute requirements on the server are much lower - this is because only aggregation is performed on the server, which is a relatively low complexity operation.

client_resource_cfg = {
    "client_n_cpu": 1,
    "client_n_gpu": 1,
    "client_memory": 8000
    }

server_resource_cfg = {
    "server_n_cpu": 0.25,
    "server_n_gpu": 0,
    "server_memory": 2000
}

# These datasets are on GW2 and GW1 respectively. You could also run with only one of them
compute_spec_id = compute.create_from_args(
    [
        'medical-decathlon-task004-hippocampus-a_gateway-1_org-1',
        'medical-decathlon-task004-hippocampus-b_gateway-2_org-2'
    ],
    **client_resource_cfg,
    **server_resource_cfg,
    model_id="apheris-nnunet",
    model_version="0.8.0"
 )

Important

The model version provided above is the latest at time of writing, but please use models.list_models()) to find the versions to which you have access.

The compute spec ID will be returned as a UUID object:

UUID('d086dce5-d89e-45e9-a6ff-82d13872ad63')

If you're not sure what values you want to put in the compute spec, you can use interactive mode to guide you through the creation. For example, the following snippet doesn't provide the model, so you'll be asked to provide it:

from aphcli.api.models import Model

cs = compute.ComputeSpec(
    model=Model("apheris-nnunet", "0.8.0"),
    **client_resource_cfg, 
    **server_resource_cfg,
)
cs.ask_for_empty_inputs()

Output:

## List of available datasets:

+-----+---------------------------------------------------------+--------------+---------------------------+
| idx |                        dataset_id                       | organization |       data custodian      |
+-----+---------------------------------------------------------+--------------+---------------------------+
|  0  |          cancer-medical-images_gateway-2_org-2          |    Org 2     |        Orsino Hoek        |
|  1  |          pneumonia-x-ray-images_gateway-2_org-2         |    Org 2     |        Orsino Hoek        |
|  2  |            covid-19-patients_gateway-1_org-1            |    Org 1     |      Agathe McFarland     |
|  3  | medical-decathlon-task004-hippocampus-a_gateway-1_org-1 |    Org 1     |      Agathe McFarland     |
|  4  | medical-decathlon-task004-hippocampus-b_gateway-2_org-2 |    Org 2     |        Orsino Hoek        |
| ... |                           ...                           |     ...      |            ...            |
+-----+---------------------------------------------------------+--------------+---------------------------+...

Please list the indices of all datasets that you want to include. Separate the numbers by comma, semicolon or space:
:3,4

You have selected following indices: [3, 4]

They correspond to following dataset IDs:
 ['medical-decathlon-task004-hippocampus-a-files_gateway-1_org-1', 'medical-decathlon-task004-hippocampus-b_gateway-2_org-2']

Now you can run the following to create the compute spec:

compute.create(cs)

Once created, you can see the full JSON that was used to create a compute spec using the get function:

spec = compute.get()
print(spec.to_json())

Output:

{
   "datasets": [
      "medical-decathlon-task004-hippocampus-a-files_gateway-1_org-1",
      "medical-decathlon-task004-hippocampus-b-files_gateway-2_org-2"
   ],
   "resources": {
      "clients": {
         "cpu": 1,
         "gpu": 1,
         "memory": 8000
      },
      "server": {
         "cpu": 0.25,
         "gpu": 0,
         "memory": 2000
      }
   },
   "model": {
      "id": "apheris-nnunet",
      "version": "0.8.0"
   }
}

You could save this JSON to a file and later use it with the CLI when creating a new compute spec to save writing the full specification each time. This requires the terminal interface:

apheris compute create --json /path/to/json/file.json

You can learn more about the terminal interface in the Getting started with Apheris CLI guide.

Activating the Compute SpecπŸ”—

Once a compute spec is ready_to_activate, you can activate it. That means, you can deploy a cluster of Compute Clients and an Orchestrator (central server for aggregation). When you call the compute.activate() function without the compute_spec_id argument, the CLI uses a cached one. This is the last compute_spec_id that was used successfully.

Important

Once a compute spec is activated, the resources assigned to it are reserved and the infrastructure is live. It is good practice to deactivate your compute spec once finished with it to avoid capacity issues and/or incurring additional infrastructure costs.

Let's activate a compute spec!

compute.activate()

Output:

On 2024-04-02 20:05:36 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2

There will be a short delay (up to 2-3 mins on first run) while the compute spec is activated.

You can check the status of this as follows:

compute.get_activation_status()

Output:

On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2
'creating'

Activation status will be one of the following values:

  • waiting_activation: The compute spec has been created, but has never been activated.
  • creating: The infrastructure has been requested and is starting up. This is the initial status after calling compute.activate().
  • running: The compute spec is ready for workloads to be submitted.
  • deleting: The infrastructure is shutting down. This is what you see after calling compute.deactivate().
  • shutdown: The shut down has completed.
  • failed: An error has occurred while provisioning infrastructure.

We can poll the status of the compute spec until it's ready:

compute.wait_until_running(compute_spec_id)

There is a longer-form status command, that will retrieve more detailed information about the specific state of the Orchestrator and Compute Gateways involved in your compute spec.

In particular, this can be helpful to troubleshoot issues, such as out-of-capacity errors, which would be shown as below:

compute.get_status()

Output:

{
    "status": "creating",
    "message": {
        "5a873d08-7f58-4a5f-953c-87a1b7726a0b": "The computation is pending due to insufficient resources. It will resume once the necessary resources become available.",
        "f44f2052-659a-43fd-84f8-8942627d222c": "The computation has been scheduled: no details",
        "orchestrator": "The computation has been scheduled: no details"
    },
    "details": {
        "5a873d08-7f58-4a5f-953c-87a1b7726a0b": "The computation has been scheduled: 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod.",
        "f44f2052-659a-43fd-84f8-8942627d222c": "The computation is running",
        "orchestrator": "The computation is running"
    }
}

Listing your compute specsπŸ”—

To help you track which compute specs you have created, and their states, you can use the CLI to retrieve a list of your compute specs. By default this just shows the IDs of all your compute specs (ordered newest to oldest):

compute.list_compute_specs()

The output of the command above is a list of compute spec objects, which contain the data from the compute spec and the approval status information:

{
    "id": "8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2",
    "createdAt": "2024-04-02T18:55:41.22733804Z",
    "createdBy": {
       ...
    },
    "updatedAt": "2024-04-02T18:55:41.22733804Z",
    "contextURI": "",
    "datasets": ["medical-decathlon-task004-hippocampus-a_gateway-1_org-1"],
    "model": {"id": "apheris-nnunet", "version": "1.0.0"},
    "resources": {
        "server": {"cpu": 0.25, "memory": 2000, "gpu": 0, "Replicas": 1},
        "clients": {"cpu": 1, "memory": 8000, "gpu": 1, "Replicas": 1},
    },
}

If you want more information, including the current status of the compute specs you can use the helper function get_compute_specs_details. We would suggest limiting this query to 10 or so compute specs for the sake of speed.

compute_specs = compute.list_compute_specs()

# Extract the 10 latest compute specs
compute_specs = compute_specs[:10]

# Get the extra data
compute_spec_data = compute.get_compute_specs_details(compute_specs)

You can retrieve the Compute Spec list as a table using the apheris.list_compute_specs() shorthand:

apheris.list_compute_specs(limit=2, verbose=True)

Output:

+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
|                  ID                  |       Created       |        Model         |            Datasets           | Resources        | Activation |
|                                      |                     |                      |                               |                  |   Status   |
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
| 15fdbcaa-80cb-4bb2-9b55-fdb1111d6284 | 2024-06-24 11:29:22 | apheris-nnunet:0.8.0 |           2 datasets          | Orchestrator:    |  shutdown  |
|                                      |                     |                      |                               |   CPU: 1         |            |
|                                      |                     |                      |                               |   GPU: 0         |            |
|                                      |                     |                      |                               |   Memory: 2000MB |            |
|                                      |                     |                      |                               | Gateway:         |            |
|                                      |                     |                      |                               |   CPU: 1         |            |
|                                      |                     |                      |                               |   GPU: 0         |            |
|                                      |                     |                      |                               |   Memory: 2000MB |            |
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
| 0e2ac20b-de6b-4fa3-880b-44ec66368775 | 2024-06-24 11:41:38 | apheris-nnunet:0.8.0 | medical-decat...teway-1_org-1 | Orchestrator:    |  shutdown  |
|                                      |                     |                      |                               |   CPU: 1         |            |
|                                      |                     |                      |                               |   GPU: 0         |            |
|                                      |                     |                      |                               |   Memory: 2000MB |            |
|                                      |                     |                      |                               | Gateway:         |            |
|                                      |                     |                      |                               |   CPU: 1         |            |
|                                      |                     |                      |                               |   GPU: 1         |            |
|                                      |                     |                      |                               |   Memory: 8000MB |            |
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+

Submitting JobsπŸ”—

Models in the Apheris Model Registry accept parameters to allow them to be configured or to perform different tasks.

To submit workloads to the Apheris environment, you use the jobs API to submit those parameters as Python dictionary (or a JSON object if using the terminal CLI) to the Secure Runtime inside the Apheris Orchestrator.

The Secure Runtime then validates those parameters to ensure they are acceptable for the model (checking types, numerical bounds, etc.), before submitting a federation job for execution on the Orchestrator and Compute Gateways.

Here, you'll submit a simple 2D training job using the datasets you defined in the compute spec:

params = {
    "mode": "training",
    "model_configuration": "2d", # 2d / 3d / 3d_fullres
    "dataset_id": 4, # The nnU-Net dataset ID, i.e. Dataset004...
    "num_rounds": 1, # The number of federation rounds
    "device": "cuda" # Train on GPU
}

job_id = job.submit(params)

Similar to the compute specs, the job_id you will receive back is a UUID object.

To submit a job with the terminal CLI, you provide the arguments as a JSON payload:

$ apheris job run --payload '{"mode": "training", "model_configuration": "2d", "dataset_id": 4, "num_rounds": 1, "device": "cpu"}'

On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2

You can check the status of your job using job.status().

job.status()

Output:

On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2

On 2024-04-02 20:45:43 you have used the job ID `2312560a-29a4-4aa4-9166-ba57852b0e03`.
We will use the job ID 2312560a-29a4-4aa4-9166-ba57852b0e03.
'running'

When you submit a job, or run one of the job functions below, the ID used is cached and used by default for further commands.

Alternatively, to check the status of a different job simply enter its ID as a parameter.

You can also list all jobs you've submitted since activating the compute spec, along with their statuses:

job.list_jobs()

Output:

On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2
[Job(duration='1m6.46005728s', id=UUID('2312560a-29a4-4aa4-9166-ba57852b0e03'), status='running', created_at=datetime.datetime(2024, 8, 19, 22, 22, 31), compute_spec_id=UUID('4cfe632b-f1e1-4329-8618-096e8c30554c'))]

Note

In Apheris 3.3, the return type of job.get() and job.list_jobs() has changed to a Pydantic model Job, which improves the quality of response parsing from the Orchestrator and allows client-side validation of the returned data.

Monitoring your jobπŸ”—

You can view the current logs for your currently running training job using the jobs.logs() command:

>>> job.logs()

On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2

On 2024-04-02 20:45:43 you have used the job ID `2312560a-29a4-4aa4-9166-ba57852b0e03`.
We will use the job ID 2312560a-29a4-4aa4-9166-ba57852b0e03.
"2024-04-02 19:45:47,056 - runner_process - INFO - Runner_process started.\n2024-04-02 19:45:47,095 - CoreCell - INFO - server.2312560a-29a4-4aa4-9166-ba57852b0e03: created backbone internal connector to tcp://localhost:13461 on parent\n2024-04-02 19:45:47,095 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:13461] is starting\n2024-04-02 19:45:47,097 - Cell - INFO - Register blob CB for channel='server_command', topic='*'\n2024-04-02 19:45:47,097 - Cell - INFO - Register blob CB for channel='aux_communication', topic='*'\n2024-04-02 19:45:47,097 - ServerCommandAgent - INFO - ServerCommandAgent cell register_request_cb: server.2312560a-29a4-4aa4-9166-ba57852b0e03\n2024-04-02 19:45:47,102 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 127.0.0.1:55436 => 127.0.0.1:13461] is created: PID: 30\n2024-04-02 19:45:47,153 - ServerRunner - INFO - [identity=Unnamed_project_54d5653d-93af-4433-a32d-7774ca74f3dc, run=2312560a-29a4-4aa4-9166-ba57852b0e03]: Server runner starting ...\n2024

Important

job.logs() is only applicable for jobs that are currently in the running state, as it streams the logs directly from the computation container. To download logs for a job that has completed (either successful, errored or aborted), please use apheris job download-results.

In order to protect the privacy of the data on the Compute Gateway, you only have access to the Orchestrator logs; however Apheris models will send sanitised logs from the Compute Gateways to the Orchestrator, so you can see what's happening there too.

You can find these with the prefix "ClientLogForwarder", they look like this:

"2024-04-02 19:48:20,444 - ClientLogForwarder - INFO - [identity=Unnamed_project_54d5653d-93af-4433-a32d-7774ca74f3dc, run=2312560a-29a4-4aa4-9166-ba57852b0e03, wf=fingerprint]: Message from '5a873d08-7f58-4a5f-953c-87a1b7726a0b [INFO]': Fingerprint stage complete, sending to server"

If your job errors on the Gateway in an unexpected way, the Apheris wrapper will catch the error and sanitise it before returning it to you via the ClientLogForwarder. This means you'll be able to see where in the code the error happens, but not the values of any variables.

When a job completes, it will report its status as finished.completed. Other statuses you might encounter are:

  • submitted: The job has been sent to the system and it is queued for running. If you see a job sitting in this state for prolonged periods, it might indicate an issue with the activated infrastructure.
  • running: The job has been started and is currently computing.
  • finished.completed: The job has successfully finished.
  • finished.aborted: The job was aborted (using apheris job abort) while running.
  • finished.execution_exception: The job errored. To understand why, you can use apheris job logs and investigate the logs.

Note

With Apheris 3.3, the job statuses have been refactored to lower-case and are now .-separated.

You can poll for the job status until it shows the job is finished, then download the result. This should take roughly 10 minutes for a single federation round of training (1 epoch over the Hippocampus dataset):

job.wait_until_job_finished()

Downloading ResultsπŸ”—

The result is the workspace of the NVIDIA FLARE server. This contains:

  • The logs from the server
  • Any files that were stored in the workspace as part of the task. In this case, the aggregated final checkpoint
  • The job configurations that were created by the secure_runtime and define the workflow.
  • Some meta data about the run (statistics round latency, etc.).

Once the status is finished.completed, you can download the results using the download_results command:

download_path = Path("./training_results")
job.download_results(download_path)

This will download the workspace as described above, including the aggregated checkpoint inside the app_server directory:

$ tree training_results/workspace/
training_results/workspace/
β”œβ”€β”€ app_server
β”‚Β Β  β”œβ”€β”€ <name of checkpoint>
β”‚Β Β  β”‚Β Β  └── nnUNetTrainer__nnUNetPlans__2d
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ dataset.json
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ dataset_fingerprint.json
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ fold_all
β”‚Β Β  β”‚Β Β      β”‚Β Β  └── checkpoint_final.pth
β”‚Β Β  β”‚Β Β      └── plans.json
β”‚Β Β  β”œβ”€β”€ FL_global_model.pt
β”‚Β Β  └── config
β”‚Β Β      β”œβ”€β”€ config_fed_client.json
β”‚Β Β      └── config_fed_server.json
β”œβ”€β”€ fl_app.txt
β”œβ”€β”€ log.txt
β”œβ”€β”€ meta.json
└── stats_pool_summary.json

In the output above, you can see a directory <name of checkpoint>. This is the trained checkpoint from nnU-Net v2, that can now be used for local and/or centralised inference or to initialise further centralised training.

Important

The storage attached to a compute spec is not persistent. This means that when you deactivate a compute spec, any results stored on the compute pod are deleted along with the deployment.

Please ensure to download any results or logs you might need before deactivation.

Clean upπŸ”—

Now that your job is completed, it is important to deactivate the compute spec to shut down the attached infrastructure:

compute.deactivate()

SummaryπŸ”—

In this tutorial, you have been introduced to the nnU-Net machine learning model as an example of how to run your model on the Apheris environment.

You have seen how to use the NVIDIA FLARE Simulator to test your model locally on dummy data, then been guided through how to run the same workload on the Apheris environment with real data using the Apheris Python API.

If you'd like to learn more about how to interact with the Apheris environment, we'd recommend the Getting started with Apheris CLI guide, or to find out more about how to use the Statistics package to analyse data in a secure and privacy-preserving way, check out our guide to Simulating and Running Statistics.