Deploying using the Installer🔗

This how-to outlines the setup of an Apheris Compute Gateway using the deploy_gateway installer script.

It is aimed at operators in charge of setting up a Compute Gateway. The deploy_gateway script is suited for both Virtual Machine and bare-metal deployments. Please make sure to read through the prerequisites before beginning a setup.

Prerequisites🔗

The single instance Compute Gateway solution is recommended to be deployed on a virtual machine (VM). This can be set up on a dedicated on-prem server or in the cloud (e.g., EC2 on AWS, VM on Azure, GCE on GCP).

All the data files you wish to register with your Gateway must be accessible from the system hosting your Compute Gateway.

Supported Operating Systems🔗

The Apheris Compute Gateway deploy_gateway installer can be executed on various Linux distributions. Apheris currently supports:

OS	minimum	recommended
Red Hat	8	9
Ubuntu	20.04	22.04

VM Specification🔗

The VM resource specifications vary based on the type of workloads you plan to run on the system and the number of jobs you intend to run simultaneously. A recommended starting point is to allocate the following resources:

	recommended
CPU	8
RAM	32 GB
Disk	200 GB

You will need to take into account the specific resource requirements of the computations you plan to execute on that Apheris Compute Gateway. This includes CPU and memory requirements and might include GPU requirements.

The Apheris Compute Gateway Installer currently supports the setup of NVIDIA GPU drivers for Ubuntu distributions only. Please contact your Apheris representative for more information, if you require additional assistance or if you plan to use GPU on RHEL based distributions.

Software🔗

To run the deploy_gateway script you will need bash and curl installed.

You can install both with the system default package management tool:

Red Hat
```
yum install bash curl
```

Ubuntu
```
apt install bash curl
```

Networking🔗

During setup and upgrades, the Gateway accesses endpoints to download additional components.

During normal runtime, the Gateway communicates with:

Apheris-managed services
Auth0 for user authentication
External Docker registries to pull Apheris / Customer custom model images

The exact list of endpoints that must be accessible from the Gateway is dependent on your platform setup, please find the list of endpoints for the default options at endpoints.

Please don’t hesitate to contact your Apheris representative if you need more details and to receive the exact endpoints that are required to be accessible for your platform setup.

Hardening notes🔗

During setup and upgrades, the deploy_gateway script modifies some OS kernel settings in order to meet the required few CIS controls.

Additionally, it configures the Pod Security Admission to enforce the restricted policy in the apheris namespace.

Firewall🔗

During setup and upgrades, the deploy_gateway script modifies the firewall rules if

Red Hat: firewalld is active

Ubuntu: ufw is active

Please see the official K3s documentation for more details.

If there is any other type of firewall utility used, for example, iptables, please make sure that the traffic is allowed to port 6443 and that the traffic can be routed from the internal Kubernetes service kubernetes.default.svc.cluster.local:443 to the control plane port localhost:6443, according to Inbound Rules for K3s Server Nodes.

HTTP proxy🔗

If your network requires a proxy to access the internet, you will need to configure the proxy settings for the Gateway. Please refer to the Helm chart's README, specifically the Proxies section, for guidance on setting the appropriate values.

Find additional details about configuring K3s to work behind an HTTP proxy in the k3s documentation.

Deliverables🔗

Your Apheris representative provides you with the following files:

deploy_gateway: The script that deploys the Compute Gateway. This should not be modified!
values.yaml: Configuration specific to your Compute Gateway.

Apheris shares the Gateway-specific values.yaml securely via a password protected Bitwarden Send link. Please note that the Bitwarden link has an expiration time, is accessible exactly once and that the password is always shared via a different channel than the link itself.

Configuration🔗

You can customize the configuration in values.yaml to fit your specific setup. The following subsections explain the possibilities you have. If you change values in the values.yaml, you need to run the deploy_gateway script again (cf. Setup) to apply the new values.

Example🔗

A complete example values.yaml file looks roughly like:

cilium:
  egressToHost: true

tenant: APHERIS_PROVIDED_TENANT_ID

job:
  gpu: false

auth:
  domain: auth.app.apheris.net
  orchestrator:
    clientId: "REDACTED"
    clientSecret: "REDACTED"

dal:
  sources:
    files:
      folder: "/home/datastore"

helmRepoUsername: "REDACTED"
helmRepoPassword: "REDACTED"

The REDACTED as well as the tenant values will be provided by Apheris.

Specifying where the real data resides🔗

Your real data needs to be in a specific location that is accessible to the gateway. You can specify the path in the dal.sources value in the values.yaml file. We explain two scenarios here: using real data residing on a local filesystem, and residing in an external S3 bucket.

Local filesystem🔗

The Apheris Data Access Layer (DAL) can access any folder that is locally accessible as a source for datasets. This enables scenarios where the datasets reside on a NAS or SAN and are attached to the host that the Apheris Data Access Layer runs on.

To use a folder as dataset source, it needs to be configured in the values.yaml via the dal.sources.files.folder value. Note that dal.sources.files.folder specifies the root folder, which can contain many datasets.

An example values.yaml snippet looks like:

dal:
  sources:
    files:
      folder: /ABSOLUTE/PATH/TO/THE/DATASETS/ROOT

External S3 bucket🔗

The Apheris DAL can access any bucket hosted on an S3 API compatible solution.

To use an S3 bucket as dataset source, it needs to be configured in the values.yaml via the dal.sources.s3 value.

An example values.yaml snippet looks like:

dal:
  sources:
    s3:
      - EXAMPLE_BUCKET_NAME_1
      - EXAMPLE_BUCKET_NAME_2
      - ...

You need to ensure that the DAL has access to the respective S3 bucket. We provide an example specific for AWS S3 below. For S3 compatible solutions, please refer to the documentation of the S3 solution you are using (for instance aws, minio, ...) for the details on how to provide access.

Example for AWS: Giving DAL access to AWS S3🔗

You will need to grant Apheris Data Access Layer (DAL) access to the respective S3 bucket via the buckets IAM Policy.

An example bucket policy for AWS S3, assuming you are using IAM roles for service accounts, looks like:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::ACCOUNT_ID:role/EXAMPLE_DAL_IAM_ROLE_NAME"
        ]
      },
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::EXAMPLE_BUCKET_NAME",
        "arn:aws:s3:::EXAMPLE_BUCKET_NAME/*"
      ]
    }
  ]
}

Optional: Disabling Cilium🔗

For some specific networking use cases you may want to disable Cilium, which can be achieved by adding the following lines in values.yaml of your Compute Gateway:

cilium:
  enabled: false

You can easily enable them again by either removing those values from values.yaml or changing false to true, and rerunning the deploy_gateway script.

Note: Under some circumstances, Cilium might not be able to identify the K3s kubernetes control plane as the kube-apiserver entity. This issue will prevent the Compute Gateway agent from communicating with the Kubernetes API. As a workaround, you can enable communications to the host port where K3s binds the control plane (6443). This can be achieved by setting the following value:

cilium:
  egressToHost: true

Optional: Enabling NVIDIA GPU🔗

For systems equipped with an NVIDIA GPU card, the deploy_gateway can install the required NVIDIA driver and Kubernetes plugin by enabling this option in the values.yaml:

job:
  gpu: true

Optional: Enabling DAL storage🔗

The DAL can be enabled to store intermediate data written by computations. This data does not leave the gateway.

You can specify a local storage path for the DAL to use following the example below:

dal:
  persistence:
    enabled: true
    local:
      hostPath: "/path/to/local/storage"

Optional: Enable Asset Policy Signature Validation🔗

Please refer to the guide.

Setup🔗

The setup consists of running the deploy_gateway script on the target system with the correct values.yaml file.

Check that deploy_gateway is executable - if not, please make it executable by running chmod +x deploy_gateway.
Execute deploy_gateway as root.
```
./deploy_gateway
```
deploy_gateway defaults to using the values.yaml file in the directory it is executed in. You can use the --helm-values-file-path parameter to point to an alternative values file.

In addition to deploying the gateway, the script will download public datasets from Apheris to the local filesystem. These datasets are used to run smoke tests to verify that the gateway works properly.

Validate the deployed versions. The following message should be displayed in the console:

Deployed versions:
------------------
agent chart: X.Y.Z
agent:       X.Y.Z
------------------

Validate the health checks completion. The following message should be displayed in the console if the checks have been successful:
```
All services are ready.
```

After these steps are completed, your gateway deployment is completed. You can then proceed to register the datasets in the Governance Portal, as described in the managing datasets page.

Please also contact your Apheris representative, so that the Apheris team can help with running smoke tests on the freshly deployed Apheris Compute Gateway to verify it is properly set up.

Upgrade🔗

To perform an upgrade, please execute the installer in a newer version again on the host.

Please only perform upgrades from one release to the next and do not jump several minor or major versions.

Please don’t hesitate to contact your Apheris representative if you need more details.

Uninstall🔗

Execute /usr/bin/k3s-uninstall.sh to remove k3s from the host.

In case you set up the Compute Gateway with Cilium (the default), execute the following to remove Cilium related modifications:

ip link delete cilium_host
ip link delete cilium_net 2>/dev/null || true
ip link delete cilium_vxlan

if [[ -f "/etc/redhat-release" || -f "/etc/centos-release" ]]; then
  yum install -y iptables >/dev/null
else
  apt-get install -y iptables >/dev/null
fi
iptables-save | grep -iv cilium | iptables-restore
ip6tables-save | grep -iv cilium | ip6tables-restore

Operating🔗

The following is a loose collection of operations that may help you when operating the gateway.

Versions of Apheris components🔗

The following command should give you the versions of:

the Gateway agent

used on your Gateway:

kubectl get deploy -n apheris -l apheris-component=agent -o go-template='{{if and (index .items 0).spec.template.spec.containers (index (index .items 0).spec.template.spec.containers 0).image }}{{ (index (index .items 0).spec.template.spec.containers 0).image }}{{ "\n" }}{{end}}' 2>/dev/null | sed 's/.*:/agent:\t/';

Example output looks like this:

agent:   X1.Y1.Z1

Tail Gateway agent logs🔗

To view the Gateway agent logs, run the following command:

kubectl logs --namespace=apheris --selector=apheris-component=agent --follow

The Gateway agent logs are in JSON format (jsonlines to be precise).

We recommend piping the output of the above command into jq to have a pretty printed output that is easier to work with.

Keep in mind that the Gateway agent logs get rotated by the Kubernetes node, so you might not have the full view in case your agent runs for a while.

What else is running on my Gateway?🔗

To check what is running on a Kubernetes cluster on which the Compute Gateway is running, run the following command:

kubectl get pod --all-namespaces

You should get a similar output on an idle Gateway:

NAMESPACE     NAME                                             READY   STATUS    RESTARTS        AGE
cilium        cilium-operator-76c55fc6b6-hlfr5                 1/1     Running   0               9m17s
cilium        cilium-wzl8x                                     1/1     Running   0               9m17s
kube-system   local-path-provisioner-957fdf8bc-kzgz6           1/1     Running   0               9m17s
kube-system   coredns-77ccd57875-vzghv                         1/1     Running   0               9m17s
apheris       apheris-gateway-agent-76b585556f-mfhx6           1/1     Running   0               7m48s

Please mind that Cilium can be disabled on your Compute Gateways due to the specifics of your networking setup and agreement with Apheris (navigate to Disabling Cilium for more details).

What about those computations?🔗

To list all the pods running in apheris namespace for a particular APHERIS_JOB_ID, execute the following command:

kubectl get pod --namespace=apheris --selector='apheris_job_id=<APHERIS_JOB_ID>'

An example output might look similar to:

NAME                                          READY   STATUS      RESTARTS   AGE
c1239352-6e3f-4254-947c-53121439aaf7-774b5b98cb-m6zwz   2/2     Running   0          139m
...

How much resources are those using?🔗

To see the resource requests and limits for each running computation pod, execute the following command:

kubectl get pod --namespace=apheris --selector=apheris_job_id --output=jsonpath='{.items[*].spec.containers[*].resources}'

Troubleshooting🔗

If you struggle to find the information you need or are facing any other issues, please open a ticket on the Apheris Help Center, reach out to Apheris support via your dedicated support channel or by mailing to support@apheris.com.