Machine Learning with Apheris using the NVIDIA FLARE Simulator and Python CLIπ
This guide will walk you through how to run a model on the Apheris Platform using the CLI, including simulating your workflow on dummy data or data on your local file system.
The CLI provides both terminal and Python interfaces to interact with the Apheris 3.0 platform. It can be used to create, activate and deactivate Compute Specs, and to submit and monitor compute jobs.
In this tutorial, you will learn about the Apheris CLI and how it can be used to train a model for segmentation of Hippocampus MRI images into anterior and posterior region using the nnU-Net model.
What is nnU-Net?π
To learn more about Apheris Federated nnU-Net, please check out our Model Registry: nnU-Net guide.
Installation Instructionsπ
To run this example, you'll need to install the Apheris CLI and its prerequisites.
First, install the Apheris CLI package by following the Quickstart: Installing the Apheris CLI guide
To run remote jobs on Apheris, that's it! You do not need to install anything else - the model is pre-built inside the environment, so you just have to use the CLI to provision resources and run the job. We'll cover that in more detail in a later section.
However, should you wish to use the simulator for trying out your training runs on local or dummy data before submitting them for execution on real data, you will need to install the Apheris nnU-Net requirements too. The following section will guide you through that installation and show you how to train a model using the simulator.
Note
The simulator wrapper exists in the Apheris nnU-Net model repository. To gain access to this repository, please speak to your Apheris representative.
Clone the nnU-Net model repository, following the instructions provided by your Apheris representative.
In the simulator
directory of this repository, you will find scripts to wrap the NVIDIA
FLARE simulator with the Apheris utilities and allow you to download dummy data using the
Apheris platform. These have some additional requirements you need to install:
pip install <path_to_apheris_nnunet>/requirements.txt
You should also add the src directory of the repository to your PYTHONPATH
:
export PYTHONPATH=<path_to_apheris_nnunet>/src
Running nnU-Net locally using the Simulatorπ
In the simulator
directory of the repository, you will find scripts to wrap the NVIDIA FLARE
simulator with the Apheris utilities and allow you to download dummy data using the
Apheris platform. This is helpful for testing / debugging your applications before running them on real world data.
The NVIDIA FLARE Simulator will create an NVIDIA FLARE workspace on your local machine. This mimics the behaviour of the production behaviour, except that the server and all clients are held on your local machine.
The simulation wrapper will create this path if it doesn't exist - you just need to supply
a valid path, e.g. /path/to/workspace
.
Due to the assumption that all datasets to be aggregated will have the same ID and the requirements from nnU-Net, each dataset ID may only exist once, we currently only support a single dataset / gateway in simulated training mode. This limitation does not exist in remote execution mode.
Working with dummy dataπ
Datasets can include what's known as dummy data, which serves as a stand-in for real data. This dummy data is designed to be representative of the actual data but without containing any sensitive information. It's recommended that when data custodians register their datasets, they also provide this dummy data alongside it.
For this tutorial, you will use the Medical Decathlon's "Hippocampus" dataset. This has been pre-registered in the Apheris environment and is available for you to experiment with. It has been split into 2 parts, and each part is registered to a different organisation.
Before you can access Apheris dummy data, you need to provide your environment configuration. This is done using a Python package called dotenv
which is installed as a dependency of the CLI and allows easy configuration of environment variables from a file.
Please contact your Apheris representative to get access the dotenv file for your environment. This file contains environment variables for your specific environment, including which URL the CLI should connect to. You can set these environment variables by running dotenv -f <name_of_environment_file> run $SHELL
. If you are running this command from a conda environment, please note that you may need to reactivate the conda environment after running this command.
Now log into Apheris using the CLI:
$ apheris login
Logging in to your company account...
Apheris:
Authenticating with Apheris Cloud Platform...
Please continue the authorization process in your browser.
Login was successful
You are logged in:
e-mail: your.name@your-company.com
organization: your_organisation
environment: your_environment
The IDs for the datasets you'll use are:
medical-decathlon-task004-hippocampus-a_gateway-1_org-1
medical-decathlon-task004-hippocampus-b_gateway-2_org-2
Each of these datasets has a small number of images stored as dummy data, in the correct format for nnU-Net v2.
The NVIDIA FLARE Simulator will create an NVIDIA FLARE workspace on your local machine. This mimics
the behaviour of the production platform, except that the server and all clients
are held on your local machine. The simulation wrapper will create this path if it doesn't
exist - you just need to supply a valid path, e.g. /path/to/workspace
.
The --payload
parameter is the same as you would use when submitting a job with the CLI.
It is either a JSON string or the path of a JSON file. The content contains the parameters
that are required when submitting your job, in this case, training.
$ python3 -m simulator /path/to/nnunet_dataset_directory /path/to/workspace \
--apheris-dataset-id medical-decathlon-task004-hippocampus-a_gateway-1_org-1 \
--payload '{"mode":"training","device":"cpu","num_rounds":"2", "model_configuration": "2d", "dataset_id": 4}'
The simulator harness will then use the Apheris utilities to download and extract the
dummy data for medical-decathlon-task004-hippocampus-a_gateway-1_org-1
onto the local machine and begin training. The
value provided for dataset_id
in the payload should match the dataΒ ID inside the
downloaded dataset.
If you have GPUs available and CUDA installed in your environment, you could set device
to cuda
to speed up training.
Similarly, on M-series Apple hardware, you can set it to mps
to use hardware acceleration.
Your results will be stored in the workspace directory you provided above: /path/to/workspace
.
The workspace includes the output for the simulated Compute Gateways and Orchestrator. You can find the trained checkpoint in the Orchestrator outputs:
/path/to/workspace/simulate_job/app_server/Dataset004_Dataset004_Hippocampus_dummyA/nnUNetTrainer__nnUNetPlans__2d/
Starting from scratch: Formatting a dataset and using it locallyπ
In this section, you'll start from scratch; downloading an example dataset from the Medical Decathlon Website, pre-formatting it for nnU-Net v2 as per the instructions on the nnU-Net v2 repository and finally running a training job using the simulator.
First, visit the Medical Decathlon Website and click "Data" then "Google Drive". Download Task004_Hippocampus.tar
into /path/to/Task004_Hippocampus.tar
.
Extract the tarball with: tar -xf Task04_Hippocampus.tar
Next you need to take that raw dataset and format it for nnU-net. Helpfully, the nnU-Net repo provides a script to do this for the Medical Decathlon datasets: nnUNetv2_convert_MSD_dataset
.
Before running the script, you need to set some environment variables to allow it to find the data.
Here, we'll assume you wish to store all the nnU-Net data inside /path/to/nnunet_dataset_directory
:
export nnUNet_raw=/path/to/nnunet_dataset_directory/nnUNet_raw
export nnUNet_pre-processed=/path/to/nnunet_dataset_directory/nnUNet_pre-processed
export nnUNet_results=/path/to/nnunet_dataset_directory/nnUNet_results
For more information on these environment variables, please see the path environment variables documentation in the nnU-Net v2 repository.
Now you can run the script to convert your dataset to an nnU-Net v2 compatible dataset (this will create the directory
<nnUNet_raw>/Dataset004_Hippocampus
):
nnUNetv2_convert_MSD_dataset -i/path/to/Task04_Hippocampus -np 8
At this point, the dataset is correctly configured and present in the nnUNet_raw
directory; so now you can use the simulator on the raw files with the following command:
$ python3 -m simulator /path/to/nnunet_dataset_directory /path/to/workspace \
--local-datasets <nnUNet_raw>/Dataset004_Hippocampus \
--payload '{"mode":"training","device":"cpu","num_rounds":"2", "model_configuration": "2d", "dataset_id": 4}'
Here, since the dataset is already downloaded and configured for nnU-Net v2, we use the --local-datasets
parameter to set it, and ensure again that dataset_id
in the payload matches the id from
the data in this case 4
to match Dataset004_Hippocampus
.
Note
nnU-Net assumes that only one dataset exists with a given ID inside the
nnUNet_raw
directory, so it's important to keep this in mind when preparing multiple
datasets.
Local datasets can be either a correctly formatted directory, or a zip-file containing a
correctly formatted directory. If a zip-file is provided, the simulator wrapper will
extract it into the nnUNet_raw
directory.
Using the simulator for inferenceπ
You must first have downloaded a suitable checkpoint from a training run with appropriate data.
Provide the path to the checkpoint as the path to the configuration of model you wish to use, e.g.:
/local/workspace/nnUNet_results/Dataset001_XYZ/nnUNetTrainer__nnUNetPlans__2d
Similarly to the training example, you can provide your data either as Apheris datasets (with dummy data), local zip files to extract or a local file system.
At inference time, it is not necessary to have your data structured as per nnU-Net v2's
training format, but if you do, you can provide the subdirectory containing the training
images using the --local-dataset-subdirs
argument (for both local directories and zip
files). If provided, the number of subdirectories must equal the number of provided
datasets of that type (use "."
for a dataset that doesn't have a subdirectory if you
need to supply this for others).
For dummy datasets downloaded using Apheris, you should provide the subdirectories in the payload as you would when running the simulator on remote data. The simulator will combine the subdirectories paths for local and dummy data before executing.
It is possible to provide a combination of any number of apheris, local zip and local directory datasets. The simulator will spawn one site per provided dataset and run them all in parallel. Please bear in mind that running too many parallel jobs is likely to stretch the resources of the local system.
The payload for inference can be quite long, so you can store it in a JSON file to make things easier.
Assume you have a file, payload.json
, defining the inference payload as follows:
{
"mode": "inference",
"device": "cpu",
"fold_id": 0,
"checkpoint_path": "/local/workspace/nnUNet_results/Dataset001_XYZ/nnUNetTrainer__nnUNetPlans__2d",
"apheris_dataset_subdirs": {
"medical-decathlon-task004-hippocampus-a_gateway-1_org-1": "imagesTs"
}
}
$ python -m simulator /path/to/nnunet_dataset_directory /path/to/workspace \
--apheris-dataset-ids medical-decathlon-task004-hippocampus-a_gateway-1_org-1 \
--local-datasets /path/to/local/data/Dataset005_MoreData \
--local-dataset-subdirs "imagesTs" \
--payload payload.json
The results will be stored in the server's workspace path:
/path/to/workspace/simulate_job/app_server/inference_results
Running nnU-Net v2 in the Apheris environmentπ
Now that you've tested your model locally, you can use the Apheris environment to run it on real data.
This section will introduce the Apheris CLI and show you how to run training on datasets that are registered via the Apheris Governance Portal.
The CLI: Getting helpπ
The CLI provides documentation on the commands and their arguments. To access this, use the --help
flag from the terminal CLI. This can either be done at root-level, or for a specific subset of commands.
apheris --help
You'll find reference guides for the Terminal CLI and the Python API in the Reference Guides section of the navigation.
For the rest of this guide, you'll use the Python API. It might be helpful, though not necessary, to work through the following steps in a Jupyter Notebook.
Loginπ
To interact with the Apheris environment, you must first log in. You can either do this using the CLI directly, or from the CLI's Python API as you'll see below:
import apheris
import aphcli
from aphcli.api import compute, job, models, datasets
from pathlib import Path
import time
apheris.login()
Logging in to your company account...
Apheris:
Authenticating with Apheris Cloud Platform...
Please continue the authorization process in your browser.
Login was successful
You can check your login status at any time, using the Apheris CLI:
import aphcli.utils
aphcli.utils.get_login_status()
This will return a tuple containing 4 values:
(is_logged_in: bool, user_email: str, user_org: str, user_environment: str)
For example:
(True, 'user.name@apheris.com', 'Apheris', 'production')
Viewing Available Datasets and Modelsπ
Once logged into Apheris, you may have access to a number of datasets based on the asset policies that have named you as a beneficiary. To see what datasets you have access to, you can use the datasets
API to list them, and their associated data custodian:
datasets.list_datasets()
Output:
+-----+---------------------------------------------------------+--------------+----------------------+
| idx | dataset_id | organization | data custodian |
+-----+---------------------------------------------------------+--------------+----------------------+
| 0 | medical-decathlon-task004-hippocampus-a_gateway-1_org-1 | Org 2 | Orsino Hoek |
| 1 | medical-decathlon-task004-hippocampus-b_gateway-2_org-2 | Org 1 | Agathe McFarland |
+-----+---------------------------------------------------------+--------------+----------------------+
The list returned by datasets.list_datasets()
a prettytable.PrettyTable
, which is
designed for producing a human-readable list. To use the list programmatically,
you can call datasets.list_datasets(to_table=False)
, which will instead return a
list of dictionaries, with each element containing the details about the dataset.
For prediction with nnU-Net, this tutorial will to use the
Medical Decathlon's Hippocampus dataset, so the following
command can be used to filter the list of datasets to include only those with the
hippocampus
key in their ID:
all_datasets = datasets.list_datasets(to_table=False)
hippocampus_datasets = [d["slug"] for d in all_datasets if "hippocampus" in d["slug"]]
Output:
['medical-decathlon-task004-hippocampus-a_gateway-1_org-1', 'medical-decathlon-task004-hippocampus-b_gateway-2_org-2']
You can also list which models are available in the Model Registry using the models
API:
models.list_models()
Output:
+-----+---------------------------+-------------------------------------+
| id | name | version |
+-----+---------------------------+-------------------------------------+
| 0 | apheris-nnunet | 0.10.0 |
| 1 | apheris-statistics | 0.24.0 |
| ... | ... | ... |
+-----+---------------------------+-------------------------------------+
Compute Specπ
A "Compute Spec" is a specialized contract that encapsulates the execution environment and parameters for securely running statistics functions and machine learning models on specified datasets. Think of it as a blueprint that includes:
- Dataset ID(s): Each Compute Spec is linked to a specific dataset or datasets by an identifier. This ID is used to identify which Compute Gateways the code runs on
- Model: This is defined through a Docker image that contains the pre-configured environment in which the code will execute. It ensures consistency and reproducibility by packaging the code, runtime, system tools, system libraries, and settings.
- Max Compute resources: Here, the Compute Spec details the required computing resources; specifically the type and number of virtual CPUs, the amount of RAM, and any GPU requirements. This ensures that the computation will run on infrastructure that's equipped to handle the task's demands.
In essence, a Compute Spec is a comprehensive contract that delineates how, where, and on what data a piece of code can execute within the respective Compute Gateway.
To define the Compute Spec, we specify the dataset(s), hardware infrastructure requirements, and model.
Since this example shows training on a relatively simple nnU-Net model structure with a small dataset, the compute requirements are relatively low, though we still require a GPU. For training 3d models, at least 12GB of gateway memory are required.
You'll see that the compute requirements on the server are much lower - this is because only aggregation is performed on the server, which is a relatively low complexity operation.
client_resource_cfg = {
"client_n_cpu": 1,
"client_n_gpu": 1,
"client_memory": 8000
}
server_resource_cfg = {
"server_n_cpu": 0.25,
"server_n_gpu": 0,
"server_memory": 2000
}
# These datasets are on GW2 and GW1 respectively. You could also run with only one of them
compute_spec_id = compute.create_from_args(
[
'medical-decathlon-task004-hippocampus-a_gateway-1_org-1',
'medical-decathlon-task004-hippocampus-b_gateway-2_org-2'
],
**client_resource_cfg,
**server_resource_cfg,
model_id="apheris-nnunet",
model_version="0.10.0"
)
Important
The model version provided above is the latest at time of writing, but please use
models.list_models())
to find the versions to which you have access.
The Compute Spec ID will be returned as a UUID object:
UUID('d086dce5-d89e-45e9-a6ff-82d13872ad63')
If you're not sure what values you want to put in the Compute Spec, you can use interactive mode to guide you through the creation. For example, the following snippet doesn't provide the model, so you'll be asked to provide it:
from aphcli.api.models import Model
cs = compute.ComputeSpec(
model=Model("apheris-nnunet", "0.10.0"),
**client_resource_cfg,
**server_resource_cfg,
)
cs.ask_for_empty_inputs()
Output:
## List of available datasets:
+-----+---------------------------------------------------------+--------------+---------------------------+
| idx | dataset_id | organization | data custodian |
+-----+---------------------------------------------------------+--------------+---------------------------+
| 0 | cancer-medical-images_gateway-2_org-2 | Org 2 | Orsino Hoek |
| 1 | pneumonia-x-ray-images_gateway-2_org-2 | Org 2 | Orsino Hoek |
| 2 | covid-19-patients_gateway-1_org-1 | Org 1 | Agathe McFarland |
| 3 | medical-decathlon-task004-hippocampus-a_gateway-1_org-1 | Org 1 | Agathe McFarland |
| 4 | medical-decathlon-task004-hippocampus-b_gateway-2_org-2 | Org 2 | Orsino Hoek |
| ... | ... | ... | ... |
+-----+---------------------------------------------------------+--------------+---------------------------+...
Please list the indices of all datasets that you want to include. Separate the numbers by comma, semicolon or space:
:3,4
You have selected following indices: [3, 4]
They correspond to following dataset IDs:
['medical-decathlon-task004-hippocampus-a-files_gateway-1_org-1', 'medical-decathlon-task004-hippocampus-b_gateway-2_org-2']
Now you can run the following to create the Compute Spec:
compute.create(cs)
Once created, you can see the full JSON that was used to create a Compute Spec using the get
function:
spec = compute.get()
print(spec.to_json())
Output:
{
"datasets": [
"medical-decathlon-task004-hippocampus-a-files_gateway-1_org-1",
"medical-decathlon-task004-hippocampus-b-files_gateway-2_org-2"
],
"resources": {
"clients": {
"cpu": 1,
"gpu": 1,
"memory": 8000
},
"server": {
"cpu": 0.25,
"gpu": 0,
"memory": 2000
}
},
"model": {
"id": "apheris-nnunet",
"version": "0.10.0"
}
}
You could save this JSON to a file and later use it with the CLI when creating a new Compute Spec to save writing the full specification each time. This requires the terminal interface:
apheris compute create --json /path/to/json/file.json
You can learn more about the terminal interface in the Getting started with Apheris CLI guide.
Activating the Compute Specπ
Once a Compute Spec is ready_to_activate
, you can activate it. That means, you can deploy a cluster of Compute Clients and an Orchestrator (central server for aggregation). When you call the compute.activate()
function without the compute_spec_id
argument, the CLI uses a cached one. This is the last compute_spec_id
that was used successfully.
Important
Once a Compute Spec is activated, the resources assigned to it are reserved and the infrastructure is live. It is good practice to deactivate your Compute Spec once finished with it to avoid capacity issues and/or incurring additional infrastructure costs.
Let's activate a Compute Spec!
compute.activate()
Output:
On 2024-04-02 20:05:36 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2
There will be a short delay (up to 2-3 mins on first run) while the Compute Spec is activated.
You can check the status of this as follows:
compute.get_activation_status()
Output:
On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2
'creating'
Activation status will be one of the following values:
waiting_activation
: The Compute Spec has been created, but has never been activated.creating
: The infrastructure has been requested and is starting up. This is the initial status after callingcompute.activate()
.running
: The Compute Spec is ready for workloads to be submitted.deleting
: The infrastructure is shutting down. This is what you see after callingcompute.deactivate()
.shutdown
: The shut down has completed.failed
: An error has occurred while provisioning infrastructure.
We can poll the status of the Compute Spec until it's ready:
compute.wait_until_running(compute_spec_id)
There is a longer-form status command, that will retrieve more detailed information about the specific state of the Orchestrator and Compute Gateways involved in your Compute Spec.
In particular, this can be helpful to troubleshoot issues, such as out-of-capacity errors, which would be shown as below:
compute.get_status()
Output:
{
"status": "creating",
"message": {
"5a873d08-7f58-4a5f-953c-87a1b7726a0b": "The computation is pending due to insufficient resources. It will resume once the necessary resources become available.",
"f44f2052-659a-43fd-84f8-8942627d222c": "The computation has been scheduled: no details",
"orchestrator": "The computation has been scheduled: no details"
},
"details": {
"5a873d08-7f58-4a5f-953c-87a1b7726a0b": "The computation has been scheduled: 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient nvidia.com/gpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod.",
"f44f2052-659a-43fd-84f8-8942627d222c": "The computation is running",
"orchestrator": "The computation is running"
}
}
Listing your Compute Specsπ
To help you track which Compute Specs you have created, and their states, you can use the CLI to retrieve a list of your Compute Specs. By default this just shows the IDs of all your Compute Specs (ordered newest to oldest):
compute.list_compute_specs()
The output of the command above is a list of Compute Spec objects, which contain the data from the Compute Spec and the approval status information:
{
"id": "8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2",
"createdAt": "2024-04-02T18:55:41.22733804Z",
"createdBy": {
...
},
"updatedAt": "2024-04-02T18:55:41.22733804Z",
"contextURI": "",
"datasets": ["medical-decathlon-task004-hippocampus-a_gateway-1_org-1"],
"model": {"id": "apheris-nnunet", "version": "1.0.0"},
"resources": {
"server": {"cpu": 0.25, "memory": 2000, "gpu": 0, "Replicas": 1},
"clients": {"cpu": 1, "memory": 8000, "gpu": 1, "Replicas": 1},
},
}
If you want more information, including the current status of the Compute Specs you can use the helper function get_compute_specs_details
. We would suggest limiting this query to 10 or so Compute Specs for the sake of speed.
compute_specs = compute.list_compute_specs()
# Extract the 10 latest Compute Specs
compute_specs = compute_specs[:10]
# Get the extra data
compute_spec_data = compute.get_compute_specs_details(compute_specs)
You can retrieve the Compute Spec list as a table using the apheris.list_compute_specs()
shorthand:
apheris.list_compute_specs(limit=2, verbose=True)
Output:
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
| ID | Created | Model | Datasets | Resources | Activation |
| | | | | | Status |
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
| 15fdbcaa-80cb-4bb2-9b55-fdb1111d6284 | 2024-06-24 11:29:22 | apheris-nnunet:0.10.0 | 2 datasets | Orchestrator: | shutdown |
| | | | | CPU: 1 | |
| | | | | GPU: 0 | |
| | | | | Memory: 2000MB | |
| | | | | Gateway: | |
| | | | | CPU: 1 | |
| | | | | GPU: 0 | |
| | | | | Memory: 2000MB | |
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
| 0e2ac20b-de6b-4fa3-880b-44ec66368775 | 2024-06-24 11:41:38 | apheris-nnunet:0.10.0 | medical-decat...teway-1_org-1 | Orchestrator: | shutdown |
| | | | | CPU: 1 | |
| | | | | GPU: 0 | |
| | | | | Memory: 2000MB | |
| | | | | Gateway: | |
| | | | | CPU: 1 | |
| | | | | GPU: 1 | |
| | | | | Memory: 8000MB | |
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
Submitting Jobsπ
Models in the Apheris Model Registry accept parameters to allow them to be configured or to perform different tasks.
To submit workloads to the Apheris environment, you use the jobs
API to submit those parameters as Python dictionary (or a JSON object if using the terminal CLI) to the Secure Runtime inside the Apheris Orchestrator.
The Secure Runtime then validates those parameters to ensure they are acceptable for the model (checking types, numerical bounds, etc.), before submitting a federation job for execution on the Orchestrator and Compute Gateways.
Here, you'll submit a simple 2D training job using the datasets you defined in the Compute Spec:
params = {
"mode": "training",
"model_configuration": "2d", # 2d / 3d / 3d_fullres
"dataset_id": 4, # The nnU-Net dataset ID, i.e. Dataset004...
"num_rounds": 1, # The number of federation rounds
"device": "cuda" # Train on GPU
}
job_id = job.submit(params)
Similar to the Compute Specs, the job_id
you will receive back is a UUID object.
To submit a job with the terminal CLI, you provide the arguments as a JSON payload:
$ apheris job run --payload '{"mode": "training", "model_configuration": "2d", "dataset_id": 4, "num_rounds": 1, "device": "cpu"}'
On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2
You can check the status of your job using job.status()
.
job.status()
Output:
On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2
On 2024-04-02 20:45:43 you have used the job ID `2312560a-29a4-4aa4-9166-ba57852b0e03`.
We will use the job ID 2312560a-29a4-4aa4-9166-ba57852b0e03.
'running'
When you submit a job, or run one of the job functions below, the ID used is cached and used by default for further commands.
Alternatively, to check the status of a different job simply enter its ID as a parameter.
You can also list all jobs you've submitted since activating the Compute Spec, along with their statuses:
job.list_jobs()
Output:
On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2
[Job(duration='1m6.46005728s', id=UUID('2312560a-29a4-4aa4-9166-ba57852b0e03'), status='running', created_at=datetime.datetime(2024, 8, 19, 22, 22, 31), compute_spec_id=UUID('4cfe632b-f1e1-4329-8618-096e8c30554c'))]
Monitoring your jobπ
You can view the current logs for your currently running training job using the jobs.logs()
command:
>>> job.logs()
On 2024-04-02 20:14:49 you have used the `compute_spec_id` 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2.
We will use following `compute_spec_id`: 8adabfc9-6b7d-4e4b-ae1e-cc290bce7da2
On 2024-04-02 20:45:43 you have used the job ID `2312560a-29a4-4aa4-9166-ba57852b0e03`.
We will use the job ID 2312560a-29a4-4aa4-9166-ba57852b0e03.
"2024-04-02 19:45:47,056 - runner_process - INFO - Runner_process started.\n2024-04-02 19:45:47,095 - CoreCell - INFO - server.2312560a-29a4-4aa4-9166-ba57852b0e03: created backbone internal connector to tcp://localhost:13461 on parent\n2024-04-02 19:45:47,095 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:13461] is starting\n2024-04-02 19:45:47,097 - Cell - INFO - Register blob CB for channel='server_command', topic='*'\n2024-04-02 19:45:47,097 - Cell - INFO - Register blob CB for channel='aux_communication', topic='*'\n2024-04-02 19:45:47,097 - ServerCommandAgent - INFO - ServerCommandAgent cell register_request_cb: server.2312560a-29a4-4aa4-9166-ba57852b0e03\n2024-04-02 19:45:47,102 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 127.0.0.1:55436 => 127.0.0.1:13461] is created: PID: 30\n2024-04-02 19:45:47,153 - ServerRunner - INFO - [identity=Unnamed_project_54d5653d-93af-4433-a32d-7774ca74f3dc, run=2312560a-29a4-4aa4-9166-ba57852b0e03]: Server runner starting ...\n2024
Important
job.logs()
is only applicable for jobs that are currently in the running
state, as it streams the logs directly from the computation container.
To download logs for a job that has completed (either successful, errored or aborted), please use apheris job download-results
.
In order to protect the privacy of the data on the Compute Gateway, you only have access to the Orchestrator logs; however Apheris models will send sanitised logs from the Compute Gateways to the Orchestrator, so you can see what's happening there too.
You can find these with the prefix "ClientLogForwarder", they look like this:
"2024-04-02 19:48:20,444 - ClientLogForwarder - INFO - [identity=Unnamed_project_54d5653d-93af-4433-a32d-7774ca74f3dc, run=2312560a-29a4-4aa4-9166-ba57852b0e03, wf=fingerprint]: Message from '5a873d08-7f58-4a5f-953c-87a1b7726a0b [INFO]': Fingerprint stage complete, sending to server"
If your job errors on the Gateway in an unexpected way, the Apheris wrapper will catch the error and sanitise it before returning it to you via the ClientLogForwarder
. This means you'll be able to see where in the code the error happens, but not the values of any variables.
When a job completes, it will report its status as finished.completed
. Other statuses you might encounter are:
submitted
: The job has been sent to the system and it is queued for running. If you see a job sitting in this state for prolonged periods, it might indicate an issue with the activated infrastructure.running
: The job has been started and is currently computing.finished.completed
: The job has successfully finished.finished.aborted
: The job was aborted (usingapheris job abort
) while running.finished.execution_exception
: The job errored. To understand why, you can useapheris job logs
and investigate the logs.
You can poll for the job status until it shows the job is finished
, then download the result. This should take roughly 10 minutes for a single federation round of training (1 epoch over the Hippocampus dataset):
job.wait_until_job_finished()
Downloading Resultsπ
The result is the workspace of the NVIDIA FLARE server. This contains:
- The logs from the server
- Any files that were stored in the workspace as part of the task. In this case, the aggregated final checkpoint
- The job configurations that were created by the secure_runtime and define the workflow.
- Some meta data about the run (statistics round latency, etc.).
Once the status is finished.completed
, you can download the results using the download_results
command:
download_path = Path("./training_results")
job.download_results(download_path)
This will download the workspace as described above, including the aggregated checkpoint inside the app_server
directory:
$ tree training_results/workspace/
training_results/workspace/
βββ app_server
βΒ Β βββ <name of checkpoint>
βΒ Β βΒ Β βββ nnUNetTrainer__nnUNetPlans__2d
βΒ Β βΒ Β βββ dataset.json
βΒ Β βΒ Β βββ dataset_fingerprint.json
βΒ Β βΒ Β βββ fold_all
βΒ Β βΒ Β βΒ Β βββ checkpoint_final.pth
βΒ Β βΒ Β βββ plans.json
βΒ Β βββ FL_global_model.pt
βΒ Β βββ config
βΒ Β βββ config_fed_client.json
βΒ Β βββ config_fed_server.json
βββ fl_app.txt
βββ log.txt
βββ meta.json
βββ stats_pool_summary.json
In the output above, you can see a directory <name of checkpoint>
. This is the trained checkpoint from nnU-Net v2, that can now be used for local and/or centralised inference or to initialise further centralised training.
Important
The storage attached to a Compute Spec is not persistent. This means that when you deactivate a Compute Spec, any results stored on the compute pod are deleted along with the deployment.
Please ensure to download any results or logs you might need before deactivation.
Clean upπ
Now that your job is completed, it is important to deactivate the Compute Spec to shut down the attached infrastructure:
compute.deactivate()
Summaryπ
In this tutorial, you have been introduced to the nnU-Net machine learning model as an example of how to run your model on the Apheris environment.
You have seen how to use the NVIDIA FLARE Simulator to test your model locally on dummy data, then been guided through how to run the same workload on the Apheris environment with real data using the Apheris Python API.
If you'd like to learn more about how to interact with the Apheris environment, we'd recommend the Getting started with Apheris CLI guide, or to find out more about how to use the Statistics package to analyse data in a secure and privacy-preserving way, check out our guide to Simulating and Running Statistics.