Skip to content

Deploying using the InstallerπŸ”—

This HOWTO outlines the setup of an Apheris Compute Gateway using the deploy_gateway installer script.

It is aimed at operators in charge of setting up a Compute Gateway. The deploy_gateway script is suited for both Virtual Machine and bare-metal deployments. Please make sure to read through the prerequisites before beginning a setup.

PrerequisitesπŸ”—

The single instance Compute Gateway solution is recommended to be deployed on a virtual machine (VM). This can be set up on a dedicated on-prem server or in the cloud (e.g., EC2 on AWS, VM on Azure, GCE on GCP).

All the data files you wish to register with your Gateway must be accessible from the system hosting your Compute Gateway.

Supported Operating SystemsπŸ”—

The Apheris Compute Gateway deploy_gateway installer can be executed on various Linux distributions. Apheris currently supports:

OS minimum recommended
Red Hat 8 9
Ubuntu 20.04 22.04

VM SpecificationπŸ”—

The VM resource specifications vary based on the type of workloads you plan to run on the system and the number of jobs you intend to run simultaneously. A recommended starting point is to allocate the following resources:

recommended
CPU 8
RAM 32 GB
Disk 200 GB

Take into account the CPU and memory requirements specified in the container's resource request for each workload, as well as the GPU requirements if the job demands it.

Throughput considerationsπŸ”—

Here are some guidelines to help make sure the instance is sized appropriately:

  • If the computations require GPU usage, specific support from Apheris may be required. We currently support the setup of the NVIDIA drivers and plugins for Ubuntu distributions. Please contact your Apheris representative for more information.
  • During operations, the size required for each computation can be monitored in agent logs or in computation pods themselves. Some helpful commands to assess these are:

    * kubectl logs -n apheris deploy/gateway-agent-agent | jq 'select(.request and .request.id and .request.resources) | {request_id: .request.id, resources: .request.resources}' - these logs are in json and will show computation requests, including fields cpu,memory for resources;

    * kubectl get po -n apheris -l apheris_job_id -o jsonpath='{.items[*].spec.containers[*].resources}' - this will show the resources requested by each running computation pod.

    The frequency and size of these requests should be weighed against cluster capacity. The cluster capacity can be monitored with kubectl describe node. The output will show how much of the node allocatable resources is occupied CPU and memory-wise.

If the cluster is at capacity the computation will be scheduled and shown as pending until resources are available. If this keeps happening, it might make sense to increase the VM size if possible or verify if old workloads were not deactivated. It should be increased in increments of memory requests. For example, if the available size is 7 GB and incoming computations require 4 GB, increasing the VM size by 1 GB will allow for one more computation to run concurrently.

CPU requests are less likely to break a computation, since the computations will scale to the available CPU. However, if the CPU requests are too low, the computation timings might become impractical.

SoftwareπŸ”—

To run the deploy_gateway script you will need bash and curl installed.

You can install both with the system default package management tool:

  • Red Hat
yum install bash curl
  • Ubuntu
apt install bash curl

NetworkingπŸ”—

During setup and upgrades, the Gateway accesses endpoints to download additional components.

During normal runtime, the Gateway communicates with:

  • Apheris-managed services
  • Auth0 for user authentication
  • External Docker registries to pull Apheris / Customer custom model images

The exact list of endpoints that must be accessible from the Gateway is dependent on your platform setup. Currently, the following list of domains should be allowed when running the deploy_gateway script with default options:

amazonaws.com (public tutorial dataset)
<tenant>.apheris.net (orchestrator)
auth0.com (authentication)
cilium.io (cilium helm chart)
cloudfront.net (dockerhub)
docker.com (dockerhub)
docker.io (dockerhub)
github.com (github hosted files)
github.io (github hosted files)
githubusercontent.com (github hosted files)
helm.sh (helm binary)
k3s.io (K3s installer)
keybase.io (gpg keys)
quay.io (Apheris and Cilium images)

Please don’t hesitate to contact your Apheris representative if you need more details and to receive the exact endpoints that are required to be accessible for your platform setup.

Hardening notesπŸ”—

During setup and upgrades, the deploy_gateway script modifies some OS kernel settings in order to meet the required few CIS controls.

Additionally, it configures the Pod Security Admission to enforce the restricted policy in the apheris namespace.

FirewallπŸ”—

During setup and upgrades, the deploy_gateway script modifies the firewall rules if

  • Red Hat: firewalld is active
  • Ubuntu: ufw is active

Please see the official K3s documentation for more details.

If there is any other type of a firewall utility used, for example, iptables, please make sure that the traffic is allowed to port 6443 and that the traffic can be routed from the internal Kubernetes service kubernetes.default.svc.cluster.local:443 to the control plane port localhost:6443, according to Inbound Rules for K3s Server Nodes.

Firewall troubleshootingπŸ”—

For the troubleshooting purposes you may want to temporarily disable your firewall by performing the following set of steps:

  1. Backup your existing iptables rules:

    sudo iptables-save > /$HOME/firewall_rules.backup
    
  2. Flush the existing iptables rules:

    sudo iptables -P INPUT ACCEPT
    sudo iptables -P OUTPUT ACCEPT
    sudo iptables -P FORWARD ACCEPT
    sudo iptables -F
    sudo iptables -X
    
  3. Troubleshoot

  4. Restore iptables rules:

    sudo iptables-restore < /$HOME/firewall_rules.backup
    

Please don’t hesitate to reach out to Apheris support if you need assistance.

HTTP proxyπŸ”—

If your network requires a proxy to access the internet, you will need to configure the proxy settings for the Gateway. Please refer to the Helm chart's README, specifically the Proxies section, for guidance on setting the appropriate values.

Additional details about configuring K3s to work behind an HTTP proxy can be found here

DeliverablesπŸ”—

Your Apheris representative provides you with the following files:

  • deploy_gateway
  • values.yaml

The deploy_gateway is the script that deploys the gateway, and should not be modified, while the values.yaml contains the configuration values specific to your gateway, and can be modified.

Apheris shares those files as a .zip archive via your dedicated support channel.

The Gateways-specific secrets are shared by Apheris via Bitwarden Send link, and the password to the Bitwarden link is shared via a separate channel for security reasons. The Bitwarden link has an expiration time and is accessible exactly once.

The secrets have a JSON format like:

[
   {
      "clientId": "clientId",
      "clientSecret": "clientSecret",
      "helmRepoUsername": "helmRepoUsername",
      "helmRepoPassword": "helmRepoPassword"
   },
   ...
]

ConfigurationπŸ”—

You can customize the configuration in values.yaml to fit your specific setup. The following subsections explain the possibilities you have. If you change values in the values.yaml, you need to run the deploy_gateway script again (cf. Setup) to apply the new values.

Setting your tenant IDπŸ”—

Apheris will provide the tenant ID related to your Compute Gateway. It must be set in the values.yaml file as follows:

tenant: tenantID

Specifying where the real data residesπŸ”—

Your real data needs to be in a specific location that is accessible to the gateway. You can specify the path in the dal.sources value in the values.yaml file. We explain two scenarios here: using real data residing on a local filesystem, and residing in an external S3 bucket.

Local filesystemπŸ”—

The Apheris Data Access Layer (DAL) can access any folder that is locally accessible as a source for datasets. This enables scenarios where the datasets reside on a NAS or SAN and are attached to the host that the Apheris Data Access Layer runs on.

To use a folder as dataset source, it needs to be configured in the values.yaml via the dal.sources.files.folder value. Note that dal.sources.files.folder specifies the root folder, which can contain many datasets.

An example values.yaml snippet looks like:

dal:
  sources:
    files:
      folder: /ABSOLUTE/PATH/TO/THE/DATASETS/ROOT

External S3 bucketπŸ”—

The Apheris DAL can access any bucket hosted on an S3 API compatible solution.

To use an S3 bucket as dataset source, it needs to be configured in the values.yaml via the dal.sources.s3 value.

An example values.yaml snippet looks like:

dal:
  sources:
    s3:
      - EXAMPLE_BUCKET_NAME_1
      - EXAMPLE_BUCKET_NAME_2
      - ...

You need to ensure that the DAL has access to the respective S3 bucket. We provide an example specific for AWS S3 below. For S3 compatible solutions, please refer to the documentation of the S3 solution you are using (for instance aws, minio, ...) for the details on how to provide access.

Example for AWS: Giving DAL access to AWS S3πŸ”—

You will need to grant Apheris Data Access Layer (DAL) access to the respective S3 bucket via the buckets IAM Policy.

An example bucket policy for AWS S3, assuming you are using IAM roles for service accounts, looks like:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::ACCOUNT_ID:role/EXAMPLE_DAL_IAM_ROLE_NAME"
        ]
      },
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::EXAMPLE_BUCKET_NAME",
        "arn:aws:s3:::EXAMPLE_BUCKET_NAME/*"
      ]
    }
  ]
}

Optional: Disabling CiliumπŸ”—

For some specific networking use cases you may want to disable Cilium, which can be achieved by adding the following lines in values.yaml of your Compute Gateway:

cilium:
  enabled: false

You can easily enable them again by either removing those values from values.yaml or changing false to true, and rerunning the deploy_gateway script.

Note: Under some circumstances, Cilium might not be able to identify the K3s kubernetes control plane as the kube-apiserver entity. This issue will prevent the Compute Gateway agent from communicating with the Kubernetes API. As a workaround, you can enable communications to the host port where K3s binds the control plane (6443). This can be achieved by setting the following value:

cilium:
  egressToHost: true

Optional: Enabling NVIDIA GPUπŸ”—

For systems equipped with an NVIDIA GPU card, the deploy_gateway can install the required NVIDIA driver and Kubernetes plugin by enabling this option in the values.yaml:

gpu:
  enabled: true

Optional: Enable Asset Policy Signature ValidationπŸ”—

Please refer to the guide

Configuration referenceπŸ”—

The following values.yaml file can be used as a reference to guide you while setting its contents:

cilium:
  egressToHost: true

tenant: tenantID

gpu:
  enabled: false

auth:
  domain: auth.app.apheris.net
  orchestrator:
    clientId: "REDACTED"
    clientSecret: "REDACTED"

dal:
  sources:
    files:
      folder: "/home/datastore"

helmRepoUsername: "REDACTED"
helmRepoPassword: "REDACTED"

SetupπŸ”—

The setup consists of running the deploy_gateway script on a target system pointing to the correct values.yaml. You need to perform the setup for each Compute Gateway separately specifying the correct Gateway-specific values in the YAML file.

To do so you will need to transfer the deliverables .zip to the target system first. We recommend moving it to a dedicated directory.

Execute the following inside of the directory that the .zip archive is in:

  1. Unpack the files from the .zip archive manually or using the unzip utility, an example command would be:

    unzip apheris-gateway-<VERSION>.zip -d apheris-gateway-<VERSION>
    

    where you should replace <VERSION> with the shipped version of the .zip.

  2. In the values.yaml, replace clientId and clientSecret (in the auth:orchestrator section), as well as helmRepoUsername and helmRepoPassword with the actual secrets pair from the Bitwarden link.

  3. Check that the file deploy_gateway is executable - if not, please make it executable by running chmod +x deploy_gateway.

  4. Execute the deploy_gateway script as root.

    ./deploy_gateway
    

    The deploy_gateway script defaults to using the values.yaml file in the directory it is executed in. You can use the --helm-values-file-path parameter to point to an alternative values file.

    In addition to deploying the gateway, the script will download public dataset from Apheris to the local filesystem, either to /home/datastore or to the folder path specified in the values.yaml. These datasets are used to run smoke tests in order to verify that the gateway works properly.

  5. Validate the deployed versions. The following message should be displayed in the console:

    Deployed versions:
    ------------------
    agent:   X.Y.Z
    ------------------
    
  6. Validate the health checks completion. The following message should be displayed in the console if the checks have been successful:

    All services are ready.
    

After these steps are completed, your gateway deployment is completed. You can then proceed to register the datasets in the Governance Portal, as described in the managing datasets page. Please also contact your Apheris representative, so that the Apheris team can help with running smoke tests on the freshly deployed Apheris Compute Gateway to verify it is properly set up.

OperatingπŸ”—

The following is a loose collection of operations that may help you when operating the gateway.

Versions of Apheris componentsπŸ”—

The following command should give you the versions of:

  • the Gateway agent

used on your Gateway:

kubectl get deploy -n apheris -l apheris-component=agent -o go-template='{{if and (index .items 0).spec.template.spec.containers (index (index .items 0).spec.template.spec.containers 0).image }}{{ (index (index .items 0).spec.template.spec.containers 0).image }}{{ "\n" }}{{end}}' 2>/dev/null | sed 's/.*:/agent:\t/';

Example output looks like this:

agent:   X1.Y1.Z1

Tail Gateway agent logsπŸ”—

To view the Gateway agent logs, run the following command:

kubectl logs --namespace=apheris --selector=apheris-component=agent --follow

The Gateway agent logs are in JSON format (jsonlines to be precise).

We recommend piping the output of the above command into jq to have a pretty printed output that is easier to work with.

Keep in mind that the Gateway agent logs get rotated by the Kubernetes node, so you might not have the full view in case your agent runs for a while.

What else is running on my Gateway?πŸ”—

To check what is running on a Kubernetes cluster on which the Compute Gateway is running, run the following command:

kubectl get pod --all-namespaces

You should get a similar output on an idle Gateway:

NAMESPACE     NAME                                             READY   STATUS    RESTARTS        AGE
cilium        cilium-operator-76c55fc6b6-hlfr5                 1/1     Running   0               9m17s
cilium        cilium-wzl8x                                     1/1     Running   0               9m17s
kube-system   local-path-provisioner-957fdf8bc-kzgz6           1/1     Running   0               9m17s
kube-system   coredns-77ccd57875-vzghv                         1/1     Running   0               9m17s
apheris       apheris-gateway-agent-76b585556f-mfhx6           1/1     Running   0               7m48s

Please mind that Cilium can be disabled on your Compute Gateways due to the specifics of your networking setup and agreement with Apheris (navigate to Disabling Cilium for more details).

What about those computations?πŸ”—

To list all the pods running in apheris namespace for a particular APHERIS_JOB_ID, execute the following command:

kubectl get pod --namespace=apheris -l 'apheris_job_id=<APHERIS_JOB_ID>'

An example output might look similar to:

NAME                                          READY   STATUS      RESTARTS   AGE
c1239352-6e3f-4254-947c-53121439aaf7-774b5b98cb-m6zwz   2/2     Running   0          139m
...

TroubleshootingπŸ”—

If you struggle to find the information you need or are facing any other issues, please reach out to Apheris support via your dedicated support channel or by mailing to support@apheris.com.