XGBoost Model Tutorial🔗
This model contains a training workflow for a Gradient Boosting Classifier, based on the XGBoost library. XGBoost (Extreme Gradient Boosting) is a powerful, scalable machine learning library designed for structured/tabular data, renowned for its speed and performance in predictive modeling competitions like Kaggle. It builds an ensemble of decision trees using gradient boosting, optimizing for accuracy and efficiency while supporting parallelization and distributed computing. The federation is already inherently implemented in the XGBoost model training architecture, and features tight integration with NVIDIA FLARE using their XGBoost components.
Installation🔗
First, please follow Quickstart: Installing the Apheris CLI to ensure you have the CLI and its dependencies installed.
The Apheris XGBoost model can take FederatedDataFrames
as input.
Therefore, please also install the Apheris Statistics wheel, as described in Simulating and Running Statistics Workflows.
Finally, install the Apheris XGBoost wheel as below (replace x.y.z
with the version number of your wheel file):
pip install apheris_xgboost-x.y.z-py3-none-any.whl
The Guide🔗
import apheris_xgboost
import json
from pathlib import Path
from tempfile import TemporaryDirectory
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.pylab import rcParams
import pandas as pd
from apheris_xgboost.api_client import fit_xgboost
from apheris_xgboost.histogram_v2 import apheris_fed_xgb_histogram_executor
from xgboost import plot_tree, plot_importance
from xgboost.core import Booster
from apheris_xgboost.session import XGBoostLocalSession, XGBoostRemoteSession
from apheris_stats.simple_stats.util import FederatedDataFrame
from apheris_auth.remote_data import RemoteData
from aphcli.api import compute as compute_api
from aphcli.api import job as jobapi
from aphcli import models, job, compute, datasets
from apheris_auth import login
from apheris_xgboost.secure_runtime.job import Job
from apheris_xgboost.secure_runtime import template
def visualize_model(model, num_tree):
with TemporaryDirectory() as tmp_model_dir:
modelfile = Path(tmp_model_dir)/"model.json"
modelfile.write_text(json.dumps(model))
booster = Booster()
booster.load_model(fname=str(modelfile))
rcParams['figure.figsize'] = 80,50
plot_tree(booster, num_trees=num_tree)
def convert_model(model: dict)-> pd.DataFrame:
with TemporaryDirectory() as tmp_model_dir:
modelfile = Path(tmp_model_dir)/"model.json"
modelfile.write_text(json.dumps(model))
booster = Booster()
booster.load_model(fname=str(modelfile))
return booster.trees_to_dataframe()
compute_spec_id = None
First, create a Compute Spec using the Apheris CLI (see here for more information on the CLI):
if compute_spec_id == None:
compute.create(
dataset_ids="whas1_gateway-1_org-1,whas2_gateway-2_org-2",
model_id='apheris-xgboost',
model_version="0.6.0",
client_memory=1024,
client_n_cpu=1,
client_n_gpu=0,
server_memory=1024,
server_n_cpu=1,
server_n_gpu=0,
)
Now, check if the Compute Spec is running, otherwise activate it with:
if not compute_api.get_activation_status(compute_spec_id) == 'running':
compute.activate(compute_spec_id)
Create XGBoostSession🔗
For test runs in simulator mode you can use a local XGBoostLocalSession
, which is created with a mapping of client IDs to dataset filenames. The local path to the datasets is set in dataset_root
variable. In our case the local data is in a data folder in the home directory.
The remote session XGBoostRemoteSession
is created with a list of dataset IDs and a Compute Spec ID.
# insert your local file path here
local_path = f"{Path.home()}/data"
filenames_per_client = {
"site-1": ["whas1-data.csv"],
"site-2": ["whas2-data.csv"],
}
local_session = XGBoostLocalSession(
dataset_mapping=filenames_per_client,
workspace="/tmp/xgboost",
dataset_root=local_path
)
remote_session = XGBoostRemoteSession(
compute_spec_id,
dataset_ids=["whas1_gateway-1_org-1", "whas2_gateway-2_org-2"]
)
Run Model Training via Python API🔗
To start the training with configurable XGBoost parameters, use the API function fit_xgboost
. The session
parameter controls if a local simulator run or a remote run is triggered. The configurable XGBoost parameters are:
num_rounds
eta
max_depth
objective
eval_metric
num_parallel_tree
enable_categorical
tree_method
nthread
The number of trees in the final model is the product of num_rounds
and num_parallel_tree
. For further information about XGBoost parameters see XGBoost documentation.
target_col = "fstat"
cols = [
"afb", "age", "bmi", "chf"
]
# local training
model = fit_xgboost(
num_rounds=1,
target_col=target_col,
cols=cols,
eta=0.2,
eval_metric="mae",
max_depth=3,
num_parallel_tree=5,
session=local_session,
)
model
# remote training
model = fit_xgboost(
num_rounds=1,
target_col=target_col,
cols=cols,
eta=0.2,
eval_metric="mae",
max_depth=3,
num_parallel_tree=5,
session=remote_session,
)
model
{'learner': {'attributes': {'best_iteration': '1',
'best_score': '0.3502602116332028'},
'feature_names': ['afb', 'age', 'bmi', 'chf'],
'feature_types': ['float', 'float', 'float', 'float'],
'gradient_booster': {'model': {'gbtree_model_param': {'num_parallel_tree': '5',
'num_trees': '5'},
'iteration_indptr': [0, 5],
'tree_info': [0,
0,
0,
0,
0,],
'trees': [{'base_weights': [
6.802704e-09,
-0.09351389,
...
0.0054957066],
'categories': [],
'categories_nodes': [],
'categories_segments': [],
'categories_sizes': [],
'default_left': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'id': 0,
'left_children': [1, 3, 5, 7, 9, 11, 13, -1, -1, -1, -1, -1, -1, -1, -1],
'loss_changes': [ 2.6578097, 1.9690572, ... 0.0],
'parents': ...,
'right_children': ...,
'split_conditions': ...,
'split_indices': ...,
'split_type': ...,
'sum_hessian': ...,
'tree_param': {'num_deleted': '0',
'num_feature': '4',
'num_nodes': '15',
'size_leaf_vector': '1'}},
...
]},
'name': 'gbtree'},
'learner_model_param': {'base_score': '6.557377E-1',
'boost_from_average': '1',
'num_class': '0',
'num_feature': '4',
'num_target': '1'},
'objective': {'name': 'reg:squarederror',
'reg_loss_param': {'scale_pos_weight': '1'}}},
'version': [2, 1, 1]}
Inspect Trees from resulting Model🔗
visualize_model(model,num_tree=1)
convert_model(model)
Tree | Node | ID | Feature | Split | Yes | No | Missing | Gain | Cover | Category | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0-0 | age | 72.00000 | 0-1 | 0-2 | 0-2 | 19.258240 | 384.0 | NaN |
1 | 0 | 1 | 0-1 | chf | 1.00000 | 0-3 | 0-4 | 0-4 | 5.252416 | 183.0 | NaN |
2 | 0 | 2 | 0-2 | age | 86.00000 | 0-5 | 0-6 | 0-6 | 3.807554 | 201.0 | NaN |
3 | 0 | 3 | 0-3 | bmi | 23.40377 | 0-7 | 0-8 | 0-8 | 0.895349 | 149.0 | NaN |
4 | 0 | 4 | 0-4 | age | 68.00000 | 0-9 | 0-10 | 0-10 | 1.468441 | 34.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
75 | 4 | 10 | 4-10 | Leaf | NaN | NaN | NaN | NaN | -0.010831 | 2.0 | NaN |
76 | 4 | 11 | 4-11 | Leaf | NaN | NaN | NaN | NaN | 0.001604 | 120.0 | NaN |
77 | 4 | 12 | 4-12 | Leaf | NaN | NaN | NaN | NaN | -0.000931 | 205.0 | NaN |
78 | 4 | 13 | 4-13 | Leaf | NaN | NaN | NaN | NaN | 0.003282 | 25.0 | NaN |
79 | 4 | 14 | 4-14 | Leaf | NaN | NaN | NaN | NaN | 0.007388 | 10.0 | NaN |
Error on client🔗
cols = [
"afb", "age", "bmi", "chff"
]
try:
model = fit_xgboost(
num_rounds=10,
target_col=target_col,
cols=cols,
eta=0.2,
eval_metric="mae",
max_depth=3,
num_parallel_tree=5,
session=local_session,
)
except RuntimeError as e:
print(e)
try:
model = fit_xgboost(
num_rounds=10,
target_col=target_col,
cols=cols,
eta=0.2,
eval_metric="mae",
max_depth=3,
num_parallel_tree=5,
session=remote_session,
)
except RuntimeError as e:
print(e)
Preprocessing with FederatedDataFrame🔗
With the datasets
parameter in fit_xgboost
, you can specify a list of FederatedDataFrame
s. If no preprocessing is needed, this parameter can be None
which means the training will run over all datasets from the Compute Spec.
A FederatedDataFrame
always refers to a dataset ID. In remote sessions, you can use the RemoteData
object from apheris_auth
to obtain all relevant information that is associated with a dataset ID. In a local session, the dataset IDs can be chosen arbitrarily, but you need to provide mappings to their local paths and to the Gateway IDs, which can also be named arbitrarily in a local session as they are only simulated.
To use the FederatedDataFrame
in the same way in both remote and local sessions, the example below names the dataset IDs identically to the remote dataset IDs.
local_session = XGBoostLocalSession(
dataset_mapping=filenames_per_client,
ds2gw_dict = {'whas1_gateway-1_org-1': 'site-1', 'whas2_gateway-2_org-2': 'site-2'},
workspace="/tmp/xgboost",
dataset_root=local_path
)
Once the FederatedDataFrame
s are created, you can proceed with the preprocessing as usual.
In the following, you'll use the FederatedDataFrame
to filter the data to patients with age larger than 72.
fdf1 = FederatedDataFrame("whas1_gateway-1_org-1")
fdf2 = FederatedDataFrame("whas2_gateway-2_org-2")
fdf1 = fdf1[fdf1["age"] > 72]
fdf2 = fdf2[fdf2["age"] > 72]
datasets = [fdf1, fdf2]
cols = [
"afb", "age", "bmi", "chf"
]
model = fit_xgboost(
datasets=datasets,
num_rounds=10,
target_col=target_col,
cols=cols,
eta=0.2,
eval_metric="mae",
max_depth=3,
num_parallel_tree=5,
session=local_session,
)
model = fit_xgboost(
datasets=datasets,
num_rounds=10,
target_col=target_col,
cols=cols,
eta=0.2,
eval_metric="mae",
max_depth=3,
num_parallel_tree=5,
session=remote_session,
)
Comparing the first trees of the model trained on filtered data, you can see that the age decision bounds reflect the new distribution.
visualize_model(model, num_tree=1)
Summary🔗
In this tutorial, you have been introduced to the Apheris XGBoost model, and learned how to train some basic tree-based models.
If you'd like to learn more about how to interact with the Apheris environment, we'd recommend the Getting started with Apheris CLI guide, or to find out more about how to use the Statistics package to analyse data in a secure and privacy-preserving way, check out our guide to Simulating and Running Statistics.