Deploying using the Installerπ
This how-to outlines the setup of an Apheris Compute Gateway using the deploy_gateway
installer script.
It is aimed at operators in charge of setting up a Compute Gateway. The deploy_gateway
script is suited for both Virtual Machine and bare-metal deployments. Please make sure to read through the prerequisites before beginning a setup.
Prerequisitesπ
The single instance Compute Gateway solution is recommended to be deployed on a virtual machine (VM). This can be set up on a dedicated on-prem server or in the cloud (e.g., EC2 on AWS, VM on Azure, GCE on GCP).
All the data files you wish to register with your Gateway must be accessible from the system hosting your Compute Gateway.
Supported Operating Systemsπ
The Apheris Compute Gateway deploy_gateway
installer can be executed on various Linux distributions. Apheris currently supports:
OS | minimum | recommended |
---|---|---|
Red Hat | 8 | 9 |
Ubuntu | 20.04 | 22.04 |
VM Specificationπ
The VM resource specifications vary based on the type of workloads you plan to run on the system and the number of jobs you intend to run simultaneously. A recommended starting point is to allocate the following resources:
recommended | |
---|---|
CPU | 8 |
RAM | 32 GB |
Disk | 200 GB |
Take into account the CPU and memory requirements specified in the container's resource request for each workload, as well as the GPU requirements if the job demands it.
Throughput considerationsπ
Here are some guidelines to help make sure the instance is sized appropriately:
- If the computations require GPU usage, specific support from Apheris may be required. We currently support the setup of the NVIDIA drivers and plugins for Ubuntu distributions. Please contact your Apheris representative for more information.
-
During operations, the size required for each computation can be monitored in agent logs or in computation pods themselves. Some helpful commands to assess these are:
kubectl logs -n apheris deploy/gateway-agent-agent | jq 'select(.request and .request.id and .request.resources) | {request_id: .request.id, resources: .request.resources}'
- these logs are in json and will show computation requests, including fieldscpu,memory
for resources;
kubectl get po -n apheris -l apheris_job_id -o jsonpath='{.items[*].spec.containers[*].resources}'
- this will show the resources requested by each running computation pod.
The frequency and size of these requests should be weighed against cluster capacity. The cluster capacity can be monitored with
kubectl describe node
. The output will show how much of the node allocatable resources is occupied CPU and memory-wise.
If the cluster is at capacity the computation will be scheduled and shown as pending until resources are available. If this keeps happening, it might make sense to increase the VM size if possible or verify if old workloads were not deactivated. It should be increased in increments of memory requests. For example, if the available size is 7 GB and incoming computations require 4 GB, increasing the VM size by 1 GB will allow for one more computation to run concurrently.
CPU requests are less likely to break a computation, since the computations will scale to the available CPU. However, if the CPU requests are too low, the computation timings might become impractical.
Softwareπ
To run the deploy_gateway
script you will need bash
and curl
installed.
You can install both with the system default package management tool:
- Red Hat
yum install bash curl
- Ubuntu
apt install bash curl
Networkingπ
During setup and upgrades, the Gateway accesses endpoints to download additional components.
During normal runtime, the Gateway communicates with:
- Apheris-managed services
- Auth0 for user authentication
- External Docker registries to pull Apheris / Customer custom model images
The exact list of endpoints that must be accessible from the Gateway is dependent on your platform setup, please find the list of endpoints for the default options at endpoints.
Please donβt hesitate to contact your Apheris representative if you need more details and to receive the exact endpoints that are required to be accessible for your platform setup.
Hardening notesπ
During setup and upgrades, the deploy_gateway
script modifies some OS kernel settings in order to meet the required few CIS controls.
Additionally, it configures the Pod Security Admission to enforce the restricted policy in the apheris
namespace.
Firewallπ
During setup and upgrades, the deploy_gateway
script modifies the firewall rules if
- Red Hat:
firewalld
is active
- Ubuntu:
ufw
is active
Please see the official K3s documentation for more details.
If there is any other type of a firewall utility used, for example, iptables
, please make sure that the traffic is allowed to port 6443
and that the traffic can be routed from the internal Kubernetes service kubernetes.default.svc.cluster.local:443
to the control plane port localhost:6443
, according to Inbound Rules for K3s Server Nodes.
Firewall troubleshootingπ
For the troubleshooting purposes you may want to temporarily disable your firewall by performing the following set of steps:
-
Backup your existing
iptables
rules:sudo iptables-save > /$HOME/firewall_rules.backup
-
Flush the existing
iptables
rules:sudo iptables -P INPUT ACCEPT sudo iptables -P OUTPUT ACCEPT sudo iptables -P FORWARD ACCEPT sudo iptables -F sudo iptables -X
-
Troubleshoot
-
Restore
iptables
rules:sudo iptables-restore < /$HOME/firewall_rules.backup
Please donβt hesitate to reach out to Apheris support if you need assistance.
HTTP proxyπ
If your network requires a proxy to access the internet, you will need to configure the proxy settings for the Gateway. Please refer to the Helm chart's README, specifically the Proxies section, for guidance on setting the appropriate values.
Additional details about configuring K3s to work behind an HTTP proxy can be found here
Deliverablesπ
Your Apheris representative provides you with the following files:
- deploy_gateway
- The script that deploys the Compute Gateway. This should not be modified!
- values.yaml
- Configuration specific to your Compute Gateway.
Apheris shares the Gateway-specific values.yaml
securely via a password protected Bitwarden Send link.
Please note that the Bitwarden link has an expiration time, is accessible exactly once and that the password is always shared
via a different channel than the link itself.
Configurationπ
You can customize the configuration in values.yaml
to fit your specific setup. The following subsections explain the possibilities you have. If you change values in the values.yaml
, you need to run the deploy_gateway
script again (cf. Setup) to apply the new values.
Exampleπ
An complete example values.yaml
file looks roughly like:
cilium:
egressToHost: true
tenant: APHERIS_PROVIDED_TENANT_ID
gpu:
enabled: false
auth:
domain: auth.app.apheris.net
orchestrator:
clientId: "REDACTED"
clientSecret: "REDACTED"
dal:
sources:
files:
folder: "/home/datastore"
helmRepoUsername: "REDACTED"
helmRepoPassword: "REDACTED"
The REDACTED
as well as the tenant
values will be provided by Apheris.
Specifying where the real data residesπ
Your real data needs to be in a specific location that is accessible to the gateway. You can specify the path in the dal.sources
value in the values.yaml
file. We explain two scenarios here: using real data residing on a local filesystem, and residing in an external S3 bucket.
Local filesystemπ
The Apheris Data Access Layer (DAL) can access any folder that is locally accessible as a source for datasets. This enables scenarios where the datasets reside on a NAS or SAN and are attached to the host that the Apheris Data Access Layer runs on.
To use a folder as dataset source, it needs to be configured in the values.yaml
via the dal.sources.files.folder
value. Note that dal.sources.files.folder
specifies the root folder, which can contain many datasets.
An example values.yaml
snippet looks like:
dal:
sources:
files:
folder: /ABSOLUTE/PATH/TO/THE/DATASETS/ROOT
External S3 bucketπ
The Apheris DAL can access any bucket hosted on an S3 API compatible solution.
To use an S3 bucket as dataset source, it needs to be configured in the values.yaml
via the dal.sources.s3
value.
An example values.yaml
snippet looks like:
dal:
sources:
s3:
- EXAMPLE_BUCKET_NAME_1
- EXAMPLE_BUCKET_NAME_2
- ...
You need to ensure that the DAL has access to the respective S3 bucket. We provide an example specific for AWS S3 below. For S3 compatible solutions, please refer to the documentation of the S3 solution you are using (for instance aws, minio, ...) for the details on how to provide access.
Example for AWS: Giving DAL access to AWS S3π
You will need to grant Apheris Data Access Layer (DAL) access to the respective S3 bucket via the buckets IAM Policy.
An example bucket policy for AWS S3, assuming you are using IAM roles for service accounts, looks like:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::ACCOUNT_ID:role/EXAMPLE_DAL_IAM_ROLE_NAME"
]
},
"Action": [
"s3:ListBucket",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::EXAMPLE_BUCKET_NAME",
"arn:aws:s3:::EXAMPLE_BUCKET_NAME/*"
]
}
]
}
Optional: Disabling Ciliumπ
For some specific networking use cases you may want to disable Cilium, which can be achieved by adding the following lines in values.yaml
of your Compute Gateway:
cilium:
enabled: false
You can easily enable them again by either removing those values from values.yaml
or changing false
to true
, and rerunning the deploy_gateway
script.
Note: Under some circumstances, Cilium might not be able to identify the K3s kubernetes control plane as the kube-apiserver
entity. This issue will prevent the Compute Gateway agent from communicating with the Kubernetes API. As a workaround, you can enable communications to the host port where K3s binds the control plane (6443). This can be achieved by setting the following value:
cilium:
egressToHost: true
Optional: Enabling NVIDIA GPUπ
For systems equipped with an NVIDIA GPU card, the deploy_gateway can install the required NVIDIA driver and Kubernetes plugin by enabling this option in the values.yaml
:
gpu:
enabled: true
Optional: Enabling DAL storageπ
The DAL can be enabled to store intermediate data written by computations. This data does not leave the gateway.
You can specify a local storage path for the DAL to use following the example below:
dal:
persistence:
enabled: true
local:
hostPath: "/path/to/local/storage"
Optional: Enable Asset Policy Signature Validationπ
Please refer to the guide.
Setupπ
The setup consists of running the deploy_gateway
script on a target system pointing to the correct values.yaml
. You need to perform the setup for each Compute Gateway separately specifying the correct Gateway-specific values in the YAML file.
To do so you will need to transfer the deliverables .zip
to the target system first. We recommend moving it to a dedicated directory.
Execute the following inside of the directory that the .zip
archive is in:
-
Unpack the files from the
.zip
archive manually or using theunzip
utility, an example command would be:unzip apheris-gateway-<VERSION>.zip -d apheris-gateway-<VERSION>
where you should replace
<VERSION>
with the shipped version of the.zip
. -
In the
values.yaml
, replaceclientId
andclientSecret
(in theauth.orchestrator
section), as well ashelmRepoUsername
andhelmRepoPassword
with the actual secrets pair from the Bitwarden link. -
Check that the file
deploy_gateway
is executable - if not, please make it executable by runningchmod +x deploy_gateway
. -
Execute the
deploy_gateway
script asroot
../deploy_gateway
The
deploy_gateway
script defaults to using thevalues.yaml
file in the directory it is executed in. You can use the--helm-values-file-path
parameter to point to an alternative values file.In addition to deploying the gateway, the script will download public dataset from Apheris to the local filesystem, either to
/home/datastore
or to the folder path specified in thevalues.yaml
. These datasets are used to run smoke tests in order to verify that the gateway works properly. -
Validate the deployed versions. The following message should be displayed in the console:
Deployed versions: ------------------ agent: X.Y.Z ------------------
-
Validate the health checks completion. The following message should be displayed in the console if the checks have been successful:
All services are ready.
After these steps are completed, your gateway deployment is completed. You can then proceed to register the datasets in the Governance Portal, as described in the managing datasets page. Please also contact your Apheris representative, so that the Apheris team can help with running smoke tests on the freshly deployed Apheris Compute Gateway to verify it is properly set up.
Upgradeπ
To perform an upgrade, please execute the installer in a newer version again on the host.
Please only perform upgrades from one release to the next and do not jump several minor or major versions.
Please donβt hesitate to contact your Apheris representative if you need more details.
Uninstallπ
Execute /usr/bin/k3s-uninstall.sh
to remove k3s from the host.
In case you set up the Compute Gateway with Cilium (the default), execute the following to remove Cilium related modifications:
ip link delete cilium_host
ip link delete cilium_net 2>/dev/null || true
ip link delete cilium_vxlan
if [[ -f "/etc/redhat-release" || -f "/etc/centos-release" ]]; then
yum install -y iptables >/dev/null
else
apt-get install -y iptables >/dev/null
fi
iptables-save | grep -iv cilium | iptables-restore
ip6tables-save | grep -iv cilium | ip6tables-restore
Operatingπ
The following is a loose collection of operations that may help you when operating the gateway.
Versions of Apheris componentsπ
The following command should give you the versions of:
- the Gateway agent
used on your Gateway:
kubectl get deploy -n apheris -l apheris-component=agent -o go-template='{{if and (index .items 0).spec.template.spec.containers (index (index .items 0).spec.template.spec.containers 0).image }}{{ (index (index .items 0).spec.template.spec.containers 0).image }}{{ "\n" }}{{end}}' 2>/dev/null | sed 's/.*:/agent:\t/';
Example output looks like this:
agent: X1.Y1.Z1
Tail Gateway agent logsπ
To view the Gateway agent logs, run the following command:
kubectl logs --namespace=apheris --selector=apheris-component=agent --follow
The Gateway agent logs are in JSON format (jsonlines
to be precise).
We recommend piping the output of the above command into jq
to have a pretty printed output that is easier to work with.
Keep in mind that the Gateway agent logs get rotated by the Kubernetes node, so you might not have the full view in case your agent runs for a while.
What else is running on my Gateway?π
To check what is running on a Kubernetes cluster on which the Compute Gateway is running, run the following command:
kubectl get pod --all-namespaces
You should get a similar output on an idle Gateway:
NAMESPACE NAME READY STATUS RESTARTS AGE
cilium cilium-operator-76c55fc6b6-hlfr5 1/1 Running 0 9m17s
cilium cilium-wzl8x 1/1 Running 0 9m17s
kube-system local-path-provisioner-957fdf8bc-kzgz6 1/1 Running 0 9m17s
kube-system coredns-77ccd57875-vzghv 1/1 Running 0 9m17s
apheris apheris-gateway-agent-76b585556f-mfhx6 1/1 Running 0 7m48s
Please mind that Cilium can be disabled on your Compute Gateways due to the specifics of your networking setup and agreement with Apheris (navigate to Disabling Cilium for more details).
What about those computations?π
To list all the pods running in apheris
namespace for a particular APHERIS_JOB_ID
, execute the following command:
kubectl get pod --namespace=apheris -l 'apheris_job_id=<APHERIS_JOB_ID>'
An example output might look similar to:
NAME READY STATUS RESTARTS AGE
c1239352-6e3f-4254-947c-53121439aaf7-774b5b98cb-m6zwz 2/2 Running 0 139m
...
Troubleshootingπ
If you struggle to find the information you need or are facing any other issues, please reach out to Apheris support via your dedicated support channel or by mailing to support@apheris.com.