Monitoring Compute Gateways🔗
Basic overview in the Governance Portal🔗
The Governance Portal allows you to see the most important settings of your Compute Gateway(s) as well as it's running status.
Compute Gateway logs🔗
Log Event Ingestion🔗
The Apheris Compute Gateway components emit logs in jsonline
format to stdout
/stderr
.
This integrates with any logging system that is tailored towards Kubernetes.
Note
No Apheris component maintains log files.
Log shipping, ingestion and indexing is out of scope of this guide as we cannot reasonably provide meaningful documentation for setting this up. Please find information about setting up log shipping, ingestion and indexing in the documentation for your specific logging system.
If you have further questions, please contact your Apheris representative or reach out via support@apheris.com.
Log Event Format🔗
All Compute Gateway components produce logs in jsonline
format (one json document per log event on a single line) and emit them to the containers (and pods) stdout.
The logs are leveled, the default level is info
. The log levels can be set via the agent.logLevel
and dal.logLevel
helm values.
Fields🔗
field | description |
---|---|
level | the log level of the event |
ts | timestamp of the event in unix epoch |
msg | main message |
error | (optional) error message if present |
stacktrace | (optional) stacktrace if present |
Gateway Agent Logs🔗
The following examples are reformatted for readability.
An example error event:
{
"level": "error",
"ts": 1686125489.6039624,
"caller": "app/result\_adapter.go:19",
"msg": "receiving event",
"agent\_id": "35d1f1d5-318a-458e-9432-97d892c6c296",
"error": "Get \\"<http://orchestrator/computations\\>": dial tcp: lookup orchestrator on 10.96.0.10:53: server misbehaving",
"stacktrace": "main.resultAdapter.func1\\n\\t/go/src/app/result\_adapter.go:19"
}
An example computation request event:
{
"level": "info",
"ts": 1710169487.2425287,
"caller": "agent/computation\_pipeline.go:186",
"msg": "computation request",
"agent\_id": "c4e84dc3-3248-44b2-890b-b4b6f0b472d0",
"request": {
"id": "a1f76a60-300c-43cd-af9a-f7f3cfec9e69",
"resources": {
"cpu": 0.5,
"memory": 500
},
"authentication": {
"userSession": "..."
},
"execution": {
"image": "quay.io/apheris/statistics:0.3.0",
"dataSources": \[
{
"path": "s3://apheris-tutorials-data/whas/worcester/data.csv",
"key": "whas1\_gateway-1\_org-1"
}
\],
"Parameters": {
"NvflareParameters": {
"arguments": "-u -m nvflare.private.fed.app.client.client\_train -m /workspace -s fed\_client.json --set secure\_train=true uid=f44f2052-659a-43fd-84f8-8942627d222c org=org\_yJz0JV5nAkFTkyl9 config\_folder=config",
"deploymentID": "88aaf187-3ca2-4460-9271-359b1a4ef57d"
}
},
"Statement": {
"NvflareStatement": {
"command": "/usr/local/bin/python3"
}
}
},
"replicas": 1
}
}
An example heartbeat error event:
{
"level": "error",
"ts": 1687431881.4542866,
"caller": "app/main.go:179",
"msg": "heartbeat",
"agent\_id": "972a5b9d-d67e-4474-a3fb-1240cbfedd67",
"error": "error response from server: <html>\\r\\n<head><title>504 Gateway Time-out</title></head>\\r\\n<body>\\r\\n<center><h1>504 Gateway Time-out</h1></center>\\r\\n</body>\\r\\n</html>\\r\\n",
"stacktrace": "main.main.func4\\n\\t/go/src/app/main.go:179\\ngithub.com/apheris/node-agent/pkg/orchestrator.Client.GatewayHeartbeat.func1\\n\\t/go/src/app/pkg/orchestrator/orchestrator.go:155"
}
Notable events🔗
msg field | level | when | description |
---|---|---|---|
"configuration" | info | once at startup | agent configuration |
"computation request" | info | for every computation request event | the entire payload of the computation request event |
"heartbeat" | error | for every heartbeat error event | the error message and the stacktrace of the heartbeat error event |
Data Access Layer (DAL) Logs🔗
The following examples are reformatted for readability.
An example data access log event:
{
"level": "info",
"ts": 1709735744.1490877,
"caller": "dal/http\_middleware.go:58",
"msg": "request",
"instance\_id": "64646a19-62f0-43c1-9c6a-30844a31f749",
"http\_status": 200,
"http\_method": "GET",
"url": "/datasets/s3://apheris-tutorials-data/whas/worcester/data.csv",
"request\_duration": 0.327430324,
"error": ""
}
Notable events🔗
msg field | level | when | description |
---|---|---|---|
"configuration" | info | once at startup | agent configuration |
"request" | info | for every request for a dataset that DAL (Data Access Layer) serves | includes the dataset url (as url field) |