Guide: Getting started with Apheris CLI🔗
In this tutorial, you will explore the Apheris Command Line Interface (CLI). It is a tool for Machine Learning engineers and Data Scientists to define Federated computations, launch them and get the results. It does not provide functionality for Data Custodians. So, in this guide, you take the role of a Data Scientist.
You will first learn how to access the references, then you will see how how to log in and out of Apheris, and how to check your login status.
You will explore the datasets that you have access to. Then you will define the settings for a Federated computation - we call it a "Compute Spec" - and launch it.
This tutorial is written using the terminal version of the CLI, which provides a set of operations you can run from your terminal using the apheris
command.
Also included in the Apheris CLI package, is a lower-level Python API which can be used from a Jupyter Notebook. To see more about that, please see our guide on simulating and running ML workloads.
Installation🔗
Before continuing with this guide, please follow Quickstart: Installing the Apheris CLI to ensure you have the CLI and its dependencies installed.
Access reference🔗
Let's start with a look into the reference. Run apheris
or apheris --help
to get an overview of the commands.
$ apheris --help
Usage: apheris [OPTIONS] COMMAND [ARGS]...
â•â”€ Options ──────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or │
│ customize the installation. │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
â•â”€ Commands ─────────────────────────────────────────────────────────────────────────────╮
│ compute Use the sub-commands to interact with Compute Specs. │
│ datasets Use the sub-commands to interact with datasets. │
│ job Use the sub-commands to interact with jobs. │
│ login Interactive login to the Apheris platform. You will be forwarded to a │
│ website. For machine to machine applications (m2m), make sure the │
│ environment variables `APH_SERVICE_USER_CLIENT_ID` and │
│ `APH_SERVICE_USER_CLIENT_SECRET` are set. Call `apheris login status` to │
│ check your login status. │
│ logout Log out of the Apheris platform. │
│ models Interact with the Model Registry. │
│ version Print the version of the Apheris CLI. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
The first line tells you, that all commands start with apheris
followed by a command and potentially sub-commands or arguments. You see hints on how to install and use auto-complete. You learn that --help
can be appended to show the reference. This is followed by a list of all commands.
You can append --help
to any command to see the reference. As an example, let us have a look into the apheris datasets
command.
$ apheris datasets --help
Usage: apheris datasets [OPTIONS] COMMAND [ARGS]...
Call `apheris datasets list` to show all datasets. Then call `apheris datasets describe
<dataset_id>` to show details for a particular one.
â•â”€ Options ──────────────────────────────────────────────────────────────────────────────╮
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
â•â”€ Commands ─────────────────────────────────────────────────────────────────────────────╮
│ describe Show information on a single dataset. │
│ list List all datasets that you have access to. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
Login🔗
Whenever you interact with the Apheris environment, you need to be logged in. When you call the apheris login
command, you are forwarded to a website where you need to provide the name of the organization your user is associated with (this name has been provided to you by your Apheris representative). Then, depending on the configuration of your organization, you either sign in by inserting the credentials that were provided to you, or sign in using SSO.
$ apheris login
Logging in to your company account...
Apheris:
Authenticating with Apheris Cloud Platform...
Please continue the authorization process in your browser.
Login was successful
You are logged in:
e-mail: your.name@your-company.com
organization: your_organisation
environment: your_environment
You can check your current login status.
$ apheris login status
You are logged in:
e-mail: your.name@your-company.com
organization: your_organisation
environment: your_environment
When you are done with your work, it is recommended to log out.
$ apheris logout
Logging out from Apheris Cloud Platform session
Logging out from Apheris Compute environments session
Successfully logged out
Login for machine-to-machine workflows🔗
It is possible to run the Apheris CLI in a non-interactive way, e.g. for scripted environments. In such cases, you need to avoid manual login via the website.
In this situation you can set environment variables (APH_SERVICE_USER_CLIENT_ID
and APH_SERVICE_USER_CLIENT_SECRET
) and call apheris login
. No user interaction is needed anymore.
Please contact your Apheris representative to obtain the values for these variables.
Explore datasets🔗
On the Apheris platform, Data Custodians can register datasets.
Once the dataset is registered, the Data Custodian can allow Data Scientists of other organizations to run pre-defined models on these datasets by creating an asset policy, with the Data Scientist as a beneficiary.
Let us assume that this has already been done you have been given access to some datasets. You can explore the datasets that you have access to using the CLI.
First, you can have a look at the help documentation:
$ apheris datasets --help
Usage: apheris datasets [OPTIONS] COMMAND [ARGS]...
Use the sub-commands to interact with datasets.
â•â”€ Options ──────────────────────────────────────────────────────────────────────────────╮
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
â•â”€ Commands ─────────────────────────────────────────────────────────────────────────────╮
│ describe Show information on a single dataset. │
│ list List all datasets that you have access to. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
Now call the list
command to see all the datasets you have access to (note that you might see a different list to the one below):
$ apheris datasets list
+-----+---------------------------------------------------------+--------------+---------------------------+
| idx | dataset_id | organization | data custodian |
+-----+---------------------------------------------------------+--------------+---------------------------+
| 0 | cancer-medical-images_gateway-2_org-2 | Org 2 | Orsino Hoek |
| 1 | pneumonia-x-ray-images_gateway-2_org-2 | Org 2 | Orsino Hoek |
| 2 | covid-19-patients_gateway-1_org-1 | Org 1 | Agathe McFarland |
| 3 | medical-decathlon-task004-hippocampus-a_gateway-1_org-1 | Org 1 | Agathe McFarland |
| 4 | medical-decathlon-task004-hippocampus-b_gateway-2_org-2 | Org 2 | Orsino Hoek |
| ... | ... | ... | ... |
+-----+---------------------------------------------------------+--------------+---------------------------+
This shows a list of all datasets that you have access to. Further, you see the organization that the datasets belong to and the name of the person who registered the dataset (here shown as "data custodian").
Let's take a closer look at a single dataset!
$ apheris datasets describe medical-decathlon-task004-hippocampus-b_gateway-2_org-2
{
"slug": "medical-decathlon-task004-hippocampus-b_gateway-2_org-2",
"name": "Medical Decathlon Task004 Hippocampus B",
"description": "HS2 Dataset Patient used for testing",
"owner": {
"email": "o.hoek.demo@apheris.com",
"full_name": "Orsino Hoek"
},
"organization": "Org 2",
"node": {
"id": "...",
"name": "...",
"aws_account": "...",
"public_key": "..."
},
"data": {
"version": 1,
"real_data": {
"files": {
"Dataset004_Hippocampus_B.zip": "s3://..."
}
},
"dummy_data": {
"files": {
"Dataset004_Hippocampus_dummyB.zip": "s3://..."
}
}
},
"updated_at": "2024-03-27T16:09:10.629778Z",
"created_at": "2024-01-17T13:42:36.659787Z"
}
For the dataset in scope, you see a lot of detail information. You see the dataset_id - here called "slug", as well as a text description and owner information.
There is information on the Gateway, where the data resides in the "node" section.
You find the URL of the actual "real" data that the custodian has registered. You cannot download the "real" data as it cannot leave the Gateway.
To allow you to test your model on representative data, the Data Custodian may upload "Dummy Data", which should be similar in structure to the real data, but not sensitive. When you call the "describe" function, the Dummy Data is downloaded to your home folder (~/.apheris/RemoteData/...
) and can be accessed later using the NVIDIA FLARE simulator mode provided with your chosen model.
Models🔗
The Apheris Model Registry contains a number of machine learning and data science models which have been written by Apheris to support federation.
You can find out more about these using the apheris models
commands. First, take a look at the documentation:
$ apheris models --help
Usage: apheris models [OPTIONS] COMMAND [ARGS]...
Interact with the Model Registry.
â•â”€ Options ──────────────────────────────────────────────────────────────────────────────╮
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
â•â”€ Commands ─────────────────────────────────────────────────────────────────────────────╮
│ list List models that are available in the Model Registry. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
With apheris models list
, you can get information on all models that you have access to.
Data Custodians can allow one or more models to run on one or more of their datasets by defining the asset policy.
$ apheris models list
+-----+---------------------------+-------------------------------------+
| id | name | version |
+-----+---------------------------+-------------------------------------+
| 0 | apheris-nnunet | 0.10.0 |
| 1 | apheris-statistics | 0.24.0 |
| ... | ... | ... |
+-----+---------------------------+-------------------------------------+
For example, you have access to the model apheris-nnunet
, which is a Machine Learning model for image segmentation (see here).
You also have access to the Apheris Statistics package, which provides privacy-sensitive federated statistics functionality. You can find more details about supported statistical functions and privacy controls on the controls for Apheris Stats page.
Creating a Compute Spec🔗
A Compute Spec is a specialized contract that encapsulates the execution environment and parameters for securely running statistics functions and machine learning models on specified datasets. Think of it as a blueprint that includes:
- Dataset ID(s): Each Compute Spec is linked to a specific dataset or datasets by an identifier. This ID is used to identify which Compute Gateways the code runs on
- Model: This is defined through a Docker image that contains the pre-configured environment in which the code will execute. It ensures consistency and reproducibility by packaging the code, runtime, system tools, system libraries, and settings.
- Maximum Compute resources: Here, the Compute Spec details the required computing resources; specifically the type and number of virtual CPUs, the amount of RAM, and any GPU requirements. This ensures that the computation will run on infrastructure that is equipped to handle the task's demands.
In essence, a Compute Spec is a comprehensive contract that delineates how, where, and on what data a piece of code can execute within the respective Compute Gateway.
To define the Compute Spec, we specify the dataset(s), hardware infrastructure requirements, and model.
Important
Please note that a federated computation across multiple datasets requires that each dataset resides in a different Gateway.
Let's start with a look at the reference:
$ apheris compute --help
Usage: apheris compute [OPTIONS] COMMAND [ARGS]...
Use the sub-commands to interact with Compute Specs.
â•â”€ Options ──────────────────────────────────────────────────────────────────────────────╮
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
â•â”€ Commands ─────────────────────────────────────────────────────────────────────────────╮
│ activate Activate a Compute Specification. This will spin up a cluster of │
│ Compute Clients and Compute Aggregators. │
│ activate-status Get information on the status of the activation of a compute │
│ specification. │
│ create Create a Compute Specification on the Apheris orchestrator. All │
│ parameters that are not passed as command line arguments will be │
│ interactively queried. │
│ deactivate Deactivate a Compute Specification - stops any running jobs and │
│ shuts down any infrastructure that was brought up for this compute │
│ spec. Use this if you have spun up a cluster of Compute Clients and │
│ Compute Aggregators, and don't need it anymore. Provided the compute │
│ specification remains approved, you can use activate to reinstate │
│ the infrastructure if needed at a later time. │
│ get Get a Compute Specification from the Apheris orchestrator. │
│ list List all your Compute Specifications. │
│ status Get the status of a Compute Specification. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
A Compute Spec can be created interactively by running apheris compute create
. You will be asked for each parameter, one after another. It is also possible to provide all arguments through the command line. If you only provide some arguments via the command line, the remaining ones will be asked interactively.
Let's have a look at all the arguments for Compute Spec creation:
$ apheris compute create --help
Usage: apheris compute create [OPTIONS]
Create a Compute Specification on the Apheris orchestrator. All parameters that are not
passed as command line arguments will be interactively queried.
â•â”€ Options ──────────────────────────────────────────────────────────────────────────────╮
│ --dataset_ids TEXT Comma-separated dataset IDs, e.g. │
│ `-dataset_ids=id1,id2,id3` │
│ [default: None] │
│ --ignore_limits The CLI sets some expected bounds for requested │
│ infrastructure resources. Use this flag to override │
│ the validation checks if your model requires more │
│ resources. │
│ --client_n_cpu FLOAT number of vCPUs of Compute Clients [default: None] │
│ --client_n_gpu INTEGER number of GPUs of Compute Clients [default: None] │
│ --client_memory INTEGER memory of Compute Clients [MByte] [default: None] │
│ --server_n_cpu FLOAT number of vCPUs of Compute Aggregators │
│ [default: None] │
│ --server_n_gpu INTEGER number of GPUs of Compute Aggregators │
│ [default: None] │
│ --server_memory INTEGER memory of Compute Aggregators [MByte] │
│ [default: None] │
│ --model_id TEXT A model ID, e.g. statistics [default: None] │
│ --model_version TEXT The version of the model to use, e.g. v0.0.5 │
│ [default: None] │
│ --json PATH File path to json file that describes a compute │
│ spec. Please use the interactive workflow once to │
│ learn about the expected format. If specified, all │
│ other arguments (except for `force`) must be None to │
│ avoid clashes. │
│ [default: None] │
│ --force -f Do not ask if user is certain, and do not ask for │
│ arguments interactively. │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────╯
Important
In the following example, the stated model version is the latest at time of writing,
but please use apheris models list
to find the versions to which you have access.
$ apheris compute create \
--dataset_ids=medical-decathlon-task004-hippocampus-a_gateway-1_org-1,medical-decathlon-task004-hippocampus-b_gateway-2_org-2 \
--client_n_cpu=1 \
--client_n_gpu=1 \
--client_memory=8000 \
--server_n_cpu=1 \
--server_n_gpu=0 \
--server_memory=2000 \
--model_id=apheris-nnunet \
--model_version=0.10.0 \
-f
# Create Compute Specification.
We successfully created a Compute Specification. Please note the ID defe5013-2c73-4eb9-be52-1ae7aed841ff
{
"id": "defe5013-2c73-4eb9-be52-1ae7aed841ff",
"createdAt": "2024-03-28T12:13:00.259214188Z",
"createdBy": {
...
},
"updatedAt": "2024-03-28T12:13:00.259214188Z",
"datasets": [
"medical-decathlon-task004-hippocampus-a_gateway-1_org-1",
"medical-decathlon-task004-hippocampus-b_gateway-2_org-2"
],
"model": {
"id": "apheris-nnunet",
"version": "0.10.0"
},
"resources": {
"server": {
"cpu": 1,
"memory": 2000,
"gpu": 0,
"Replicas": 1
},
"clients": {
"cpu": 1,
"memory": 8000,
"gpu": 1,
"Replicas": 1
}
}
}
You see all the information that you have just entered plus some information from the response. The response tells us the Compute Spec ID. This is an important reference, as it is used whenever you wish to activate / deactivate your Compute Spec, or submit jobs.
Here the command has an appended -f
to omit the questions. The flag --force
is an alternative.
Another option to create a Compute Spec is to reference a JSON file with all required information.
$ cat > my_compute_spec.json <<EOL
{
"datasets": [
"medical-decathlon-task004-hippocampus-a_gateway-1_org-1",
"medical-decathlon-task004-hippocampus-b_gateway-2_org-2"
],
"model": {
"id": "apheris-nnunet",
"version": "0.10.0"
},
"resources": {
"server": {
"cpu": 1,
"memory":2000,
"gpu": 0,
"Replicas": 1
},
"clients": {
"cpu": 1,
"memory": 8000,
"gpu": 1,
"Replicas": 1
}
}
}
EOL
$ apheris compute create --json my_compute_spec.json -f
# Create Compute Specification.
We successfully created a Compute Specification. Please note the ID defe5013-2c73-4eb9-be52-1ae7aed841ff
{
"id": "defe5013-2c73-4eb9-be52-1ae7aed841ff",
"createdAt": "2024-03-28T12:13:00.259214188Z",
...
You can use the apheris compute status
command to get information on the status of a Compute Spec. Again, you can pass a compute_spec_id
argument to specify a Compute Spec ID or you leave it away to get information on the cached Compute Spec ID.
$ apheris compute status
On 2024-03-28 12:17:06 you have used the `compute_spec_id` defe5013-2c73-4eb9-be52-1ae7aed841ff.
We will use following `compute_spec_id`: defe5013-2c73-4eb9-be52-1ae7aed841ff
Activation status: waiting_activation
Activating a Compute Spec🔗
The next step after creating your Compute Spec is to "activate" it. That means, you can deploy a cluster of Compute Clients and an Orchestrator (central server for aggregation). When you call the apheris compute activate
command without the compute_spec_id
argument, the CLI uses a cached one. This is the last compute_spec_id
that was used successfully.
Important
Once a Compute Spec is activated, the resources assigned to it are reserved and the infrastructure is live. It is good practice to deactivate your Compute Spec once finished with it to avoid capacity issues and/or incurring additional infrastructure costs.
Let's activate a Compute Spec!
$ apheris compute activate -f
On 2024-03-28 12:16:03 you have used the `compute_spec_id` defe5013-2c73-4eb9-be52-1ae7aed841ff.
We will use following `compute_spec_id`: defe5013-2c73-4eb9-be52-1ae7aed841ff
Successfully requested activation of the Compute Specification defe5013-2c73-4eb9-be52-1ae7aed841ff!
{
"workflowId": "deploy_defe5013-2c73-4eb9-be52-1ae7aed841ff",
"runId": "6e8bf2d8-3fcf-467e-8b71-83e5d1b2c276"
}
Now if you run apheris compute status
, you'll see the Activation status
field has changed to creating
:
$ apheris compute status
On 2024-03-28 12:17:06 you have used the `compute_spec_id` defe5013-2c73-4eb9-be52-1ae7aed841ff.
We will use following `compute_spec_id`: defe5013-2c73-4eb9-be52-1ae7aed841ff
Activation status: creating
Interpreting the status of your Compute Spec🔗
When you activate a Compute Spec, the infrastructure is provisioned and deployed. The Activation status
field shows you the current state of the deployed infrastructure that is attached to this Compute Spec.
This will return one of the following values:
waiting_activation
: The Compute Spec has been created, but has never been activated.creating
: The infrastructure has been requested and is starting up. This is the initial status after callingcompute.activate()
.running
: The Compute Spec is ready for workloads to be submitted.deleting
: The infrastructure is shutting down. This is what you see after callingcompute.deactivate()
.shutdown
: The shut down has completed.failed
: An error has occurred while provisioning infrastructure.
Please call the status command again until the Activation status
has switched to running
. This means that the cluster is up and running and ready for you to submit jobs!
You can see more details about the status of your Compute Spec by running
apheris compute status
with the --verbose
flag. This will show you information
about the specific status of the Orchestrator and Compute Gateways, under the
Activation details
heading.
In particular, this can be helpful to troubleshoot issues, such as out-of-capacity errors, which would be shown as below:
$ apheris compute status --verbose defe5013-2c73-4eb9-be52-1ae7aed841ff
Activation status: creating
Activation details:
* [5a873d08-7f58-4a5f-953c-87a1b7726a0b]: The computation is pending due to insufficient resources. It will resume once the necessary resources become available.
* [f44f2052-659a-43fd-84f8-8942627d222c]: The computation has been scheduled: no details
* [orchestrator]: The computation has been scheduled: no details
In this example, you can see three bullet points. Each of these has the same format which is:
* [<GATEWAY ID>]: <STATUS SUMMARY>: <DETAILED INFORMATION>
So in the above example, you can see that the Gateway with ID 5a873d08-7f58-4a5f-953c-87a1b7726a0b
doesn't have any available GPU resources and therefore cannot be activated.
Listing your Compute Specs🔗
To get a list of all your Compute Spec IDs use apheris compute list
. This will provide
you with a list of all your Compute Specs, when they were created and their current
activation status. The list is sorted by creation date in ascending order.
$ apheris compute list
+--------------------------------------+---------------------+--------------------+
| ID | Created | Activation Status |
+--------------------------------------+---------------------+--------------------+
| ... | ... | ... |
| 8f7fc888-2175-4ced-9d8b-81cf24848f8e | 2024-06-11 08:11:04 | shutdown |
| 47f54624-5373-487c-a310-ea522ab99fb4 | 2024-06-20 15:26:09 | waiting_activation |
| 5df38d4c-c7d1-4eff-b4fd-2d8528d67547 | 2024-06-20 15:27:49 | shutdown |
| 17777b1d-e2f2-467e-95cd-190add7c9945 | 2024-06-20 16:23:18 | shutdown |
| 36879f91-7eef-4e4c-9e2b-755d06ca9eac | 2024-06-21 10:20:20 | shutdown |
| 6b36c615-a2b2-4e8d-945d-8aaff8c0d9e5 | 2024-06-24 11:21:34 | waiting_activation |
| 90b78d9c-2839-449f-b8fa-83aabcd25c81 | 2024-06-24 11:22:33 | waiting_activation |
| 15fdbcaa-80cb-4bb2-9b55-fdb1111d6284 | 2024-06-24 11:29:22 | shutdown |
| 0e2ac20b-de6b-4fa3-880b-44ec66368775 | 2024-06-24 11:41:38 | running |
+--------------------------------------+---------------------+--------------------+
As the list of Compute Specs you own grows, you might wish to limit the number of rows
that are displayed, which can be done using the --limit
argument:
$ apheris compute list --limit 5
+--------------------------------------+---------------------+--------------------+
| ID | Created | Activation Status |
+--------------------------------------+---------------------+--------------------+
| 36879f91-7eef-4e4c-9e2b-755d06ca9eac | 2024-06-21 10:20:20 | shutdown |
| 6b36c615-a2b2-4e8d-945d-8aaff8c0d9e5 | 2024-06-24 11:21:34 | waiting_activation |
| 90b78d9c-2839-449f-b8fa-83aabcd25c81 | 2024-06-24 11:22:33 | waiting_activation |
| 15fdbcaa-80cb-4bb2-9b55-fdb1111d6284 | 2024-06-24 11:29:22 | shutdown |
| 0e2ac20b-de6b-4fa3-880b-44ec66368775 | 2024-06-24 11:41:38 | running |
+--------------------------------------+---------------------+--------------------+
Showing 5 out of 127 Compute Specs. To show all, remove the '--limit' flag.
Note
If you are listing a large number of Compute Specs, there may be some delay while the
activation statuses are gathered. You can reduce this time by limiting the number of
Compute Specs you show, using the --limit
argument as shown above.
You can use -v
or --verbose
to get more information about your Compute Specs,
including the model and datasets used and the requested maximum resources.
$ apheris compute list -v --limit 2
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
| ID | Created | Model | Datasets | Resources | Activation |
| | | | | | Status |
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
| 15fdbcaa-80cb-4bb2-9b55-fdb1111d6284 | 2024-06-24 11:29:22 | apheris-nnunet:0.10.0 | 2 datasets | Orchestrator: | shutdown |
| | | | | CPU: 1 | |
| | | | | GPU: 0 | |
| | | | | Memory: 2000MB | |
| | | | | Gateway: | |
| | | | | CPU: 1 | |
| | | | | GPU: 0 | |
| | | | | Memory: 2000MB | |
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
| 0e2ac20b-de6b-4fa3-880b-44ec66368775 | 2024-06-24 11:41:38 | apheris-nnunet:0.10.0 | medical-decat...teway-1_org-1 | Orchestrator: | shutdown |
| | | | | CPU: 1 | |
| | | | | GPU: 0 | |
| | | | | Memory: 2000MB | |
| | | | | Gateway: | |
| | | | | CPU: 1 | |
| | | | | GPU: 1 | |
| | | | | Memory: 8000MB | |
+--------------------------------------+---------------------+----------------------+-------------------------------+------------------+------------+
Showing 2 out of 127 Compute Specs. To show all, remove the '--limit' flag.
For more details on a Compute Spec, call apheris compute get
.
When you don't need the activated cluster anymore, use apheris compute deactivate
to shut the deployment down and free resources:
$ apheris compute deactivate -f
On 2024-03-28 12:17:06 you have used the `compute_spec_id` defe5013-2c73-4eb9-be52-1ae7aed841ff.
We will use following `compute_spec_id`: defe5013-2c73-4eb9-be52-1ae7aed841ff
Successfully shutdown the deployment of the Compute Specification defe5013-2c73-4eb9-be52-1ae7aed841ff!
Submit Jobs🔗
Now that the Compute Spec is active, you can submit a job. Each model takes individual arguments. They are passed via the payload
argument. The following job launches training of an image segmentation job.
$ apheris job run \
--payload '{"mode": "training", "model_configuration": "2d", "dataset_id": 4, "num_rounds": 1}' \
--force
On 2024-03-28 12:17:06 you have used the `compute_spec_id` defe5013-2c73-4eb9-be52-1ae7aed841ff.
We will use following `compute_spec_id`: defe5013-2c73-4eb9-be52-1ae7aed841ff
The job was submitted! The job ID is f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe
This will submit an nnU-Net training job. You can learn more about what these parameters mean in the Guide to Machine Learning on Apheris.
You can check the status of your job using apheris job status
.
When you submit a job, or run one of the job commands below, the ID used is cached and used by default for further commands.
Alternatively, to check the status of a different job simply enter its ID as a parameter.
$ apheris job status
On 2024-03-28 12:17:06 you have used the `compute_spec_id` defe5013-2c73-4eb9-be52-1ae7aed841ff.
We will use following `compute_spec_id`: defe5013-2c73-4eb9-be52-1ae7aed841ff
On 2024-03-28 12:47:23 you have used the job ID `f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe`.
We will use the job ID f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe.
status: running
You can get the current logs for your currently running job using apheris job logs
:
$ apheris job logs
On 2024-03-28 12:17:06 you have used the `compute_spec_id` defe5013-2c73-4eb9-be52-1ae7aed841ff.
We will use following `compute_spec_id`: defe5013-2c73-4eb9-be52-1ae7aed841ff
On 2024-03-28 12:47:23 you have used the job ID `f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe`.
We will use the job ID f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe.
2024-03-28 12:47:26,266 - runner_process - INFO - Runner_process started.
2024-03-28 12:47:26,294 - CoreCell - INFO - server.f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe: created backbone internal connector to tcp://localhost:42225 on parent
2024-03-28 12:47:26,294 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE tcp://localhost:42225] is starting
2024-03-28 12:47:26,295 - Cell - INFO - Register blob CB for channel='server_command', topic='*'
2024-03-28 12:47:26,296 - Cell - INFO - Register blob CB for channel='aux_communication', topic='*'
2024-03-28 12:47:26,296 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connection [CN00002 127.0.0.1:54850 => 127.0.0.1:42225] is created: PID: 34
2024-03-28 12:47:26,296 - ServerCommandAgent - INFO - ServerCommandAgent cell register_request_cb: server.f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe
2024-03-28 12:47:26,345 - ServerRunner - INFO - [identity=Unnamed_project_2ba9e3c1-356e-43b4-827b-cf99c6ffe28d, run=f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe]: Server runner starting ...
2024-03-28 12:47:26,345 - ServerRunner - INFO - [identity=Unnamed_project_2ba9e3c1-356e-43b4-827b-cf99c6ffe28d, run=f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe]: starting workflow fingerprint (<class 'server.fingerprint_workflow.FingerprintWorkflow'>) ...
2024-03-28 12:47:26,346 - ServerRunner - INFO - [identity=Unnamed_project_2ba9e3c1-356e-43b4-827b-cf99c6ffe28d, run=f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe, wf=fingerprint]: Workflow fingerprint (<class 'server.fingerprint_workflow.FingerprintWorkflow'>) started
Important
apheris job logs
is only applicable for jobs that are currently in the running
state, as it streams the logs directly from the computation container.
To download logs for a job that has completed (either successful, errored or aborted), please use apheris job download-results
.
If needed, you could abort the job with apheris job abort
.
You can also list all jobs.
$ apheris job list
+--------------------------------------+------------+--------------------+---------------------+
| id | duration | status | created_at |
+--------------------------------------+------------+--------------------+---------------------+
| 00ab7cac-5901-4cd1-b44c-c5952d970ab4 | 10.618251s | finished.completed | 2024-08-19 13:49:20 |
| 349c8025-fbf1-4a71-b5c5-f4f402ad9105 | 10.377161s | finished.completed | 2024-08-19 13:48:38 |
+--------------------------------------+------------+--------------------+---------------------+
When a job completes, it will report its status as finished.completed
. Other statuses you might encounter are:
submitted
: The job has been sent to the system and it is queued for running. If you see a job sitting in this state for prolonged periods, it might indicate an issue with the activated infrastructure.running
: The job has been started and is currently computing.finished.completed
: The job has successfully finished.finished.aborted
: The job was aborted (usingapheris job abort
) while running.finished.execution_exception
: The job errored. To understand why, you can useapheris job logs
and investigate the logs.
You can poll for the job status until it shows the job is finished
, then download the result. This should take roughly 10 minutes for a single federation round of training (1 epoch over the Hippocampus dataset):
$ apheris job status
On 2024-03-28 12:17:06 you have used the `compute_spec_id` defe5013-2c73-4eb9-be52-1ae7aed841ff.
We will use following `compute_spec_id`: defe5013-2c73-4eb9-be52-1ae7aed841ff
On 2024-03-28 12:47:23 you have used the job ID `f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe`.
We will use the job ID f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe.
status: finished.completed
Now your job is finished, you can download the results to your machine.
The result is the workspace of the NVFlare server. This contains:
- The logs from the server
- Any files that were stored in the workspace as part of the task. In this case, the aggregated final checkpoint
- The job configurations that were created by the secure_runtime and define the workflow.
- Some meta data about the run (statistics round latency, etc.).
By default. the workspace is downloaded to <current_directory>/job_results
, but you can
change this behavior by providing the path as the next argument:
$ apheris job download-results /path/to/store/results
On 2024-03-28 12:17:06 you have used the `compute_spec_id` defe5013-2c73-4eb9-be52-1ae7aed841ff.
We will use following `compute_spec_id`: defe5013-2c73-4eb9-be52-1ae7aed841ff
On 2024-03-28 12:47:23 you have used the job ID `f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe`.
We will use the job ID f77d5dc7-a2e7-4a2a-827d-49a2131b1ffe.
Successfully downloaded job outputs to /path/to/store/results
Important
The storage attached to a Compute Spec is not persistent. This means that when you deactivate a Compute Spec, any results stored on the compute pod are deleted along with the deployment.
Please ensure to download any results or logs you might need before deactivation.
You can take a look at the contents here:
$ tree /path/to/store/results/workspace
/path/to/store/results/workspace
├── app_server
│  ├── Dataset004_Dataset004_Hippocampus_B_Dataset004_Hippocampus_A
│  │  └── nnUNetTrainer__nnUNetPlans__2d
│  │  ├── dataset.json
│  │  ├── dataset_fingerprint.json
│  │  ├── fold_all
│  │  │  └── checkpoint_final.pth
│  │  └── plans.json
│  ├── FL_global_model.pt
│  └── config
│  ├── config_fed_client.json
│  └── config_fed_server.json
├── fl_app.txt
├── log.txt
├── meta.json
└── stats_pool_summary.json
3 directories, 7 files
Clean up🔗
Now that your job is completed, it is important to deactivate the Compute Spec to shut down the attached infrastructure:
$ apheris compute deactivate --force
On 2024-03-28 12:17:06 you have used the `compute_spec_id` defe5013-2c73-4eb9-be52-1ae7aed841ff.
We will use following `compute_spec_id`: defe5013-2c73-4eb9-be52-1ae7aed841ff
Successfully shutdown the deployment of the Compute Specification defe5013-2c73-4eb9-be52-1ae7aed841ff!