Skip to content

XGBoost Model Tutorial🔗

This model contains a training workflow for a Gradient Boosting Classifier, based on the XGBoost library. XGBoost (Extreme Gradient Boosting) is a powerful, scalable machine learning library designed for structured/tabular data, renowned for its speed and performance in predictive modeling competitions like Kaggle. It builds an ensemble of decision trees using gradient boosting, optimizing for accuracy and efficiency while supporting parallelization and distributed computing. The federation is already inherently implemented in the XGBoost model training architecture, and features tight integration with NVIDIA FLARE using their XGBoost components.

Installation🔗

First, please follow Quickstart: Installing the Apheris CLI to ensure you have the CLI and its dependencies installed. The Apheris XGBoost model can take FederatedDataFrames as input. Therefore, please also install the Apheris Statistics wheel, as described in Simulating and Running Statistics Workflows.

Finally, install the Apheris XGBoost wheel as below (replace x.y.z with the version number of your wheel file):

pip install apheris_xgboost-x.y.z-py3-none-any.whl

The Guide🔗

import apheris_xgboost
import json
from pathlib import Path
from tempfile import TemporaryDirectory
import matplotlib.pyplot as plt
import matplotlib
from matplotlib.pylab import rcParams
import pandas as pd

from apheris_xgboost.api_client import fit_xgboost
from apheris_xgboost.histogram_v2 import apheris_fed_xgb_histogram_executor
from xgboost import plot_tree, plot_importance
from xgboost.core import Booster
from apheris_xgboost.session import XGBoostLocalSession, XGBoostRemoteSession

from apheris_stats.simple_stats.util import FederatedDataFrame
from apheris_auth.remote_data import RemoteData
from aphcli.api import compute as compute_api
from aphcli.api import job as jobapi
from aphcli import models, job, compute, datasets
from apheris_auth import login

from apheris_xgboost.secure_runtime.job import Job
from apheris_xgboost.secure_runtime import template
def visualize_model(model, num_tree):
    with TemporaryDirectory() as tmp_model_dir:
        modelfile = Path(tmp_model_dir)/"model.json"
        modelfile.write_text(json.dumps(model))
        booster = Booster()
        booster.load_model(fname=str(modelfile))
        rcParams['figure.figsize'] = 80,50
        plot_tree(booster, num_trees=num_tree)

def convert_model(model: dict)-> pd.DataFrame:
    with TemporaryDirectory() as tmp_model_dir:
        modelfile = Path(tmp_model_dir)/"model.json"
        modelfile.write_text(json.dumps(model))
        booster = Booster()
        booster.load_model(fname=str(modelfile))
        return booster.trees_to_dataframe()
compute_spec_id =  None

First, create a Compute Spec using the Apheris CLI (see here for more information on the CLI):

if compute_spec_id == None:
    compute.create(
        dataset_ids="whas1_gateway-1_org-1,whas2_gateway-2_org-2",
        model_id='apheris-xgboost',
        model_version="0.6.0",
        client_memory=1024,
        client_n_cpu=1,
        client_n_gpu=0,
        server_memory=1024,
        server_n_cpu=1,
        server_n_gpu=0,
    )

Now, check if the Compute Spec is running, otherwise activate it with:

if not compute_api.get_activation_status(compute_spec_id) == 'running':
    compute.activate(compute_spec_id)

Create XGBoostSession🔗

For test runs in simulator mode you can use a local XGBoostLocalSession, which is created with a mapping of client IDs to dataset filenames. The local path to the datasets is set in dataset_root variable. In our case the local data is in a data folder in the home directory. The remote session XGBoostRemoteSession is created with a list of dataset IDs and a Compute Spec ID.

# insert your local file path here
local_path = f"{Path.home()}/data"
filenames_per_client = {
    "site-1": ["whas1-data.csv"],
    "site-2": ["whas2-data.csv"],
}
local_session = XGBoostLocalSession(
    dataset_mapping=filenames_per_client, 
    workspace="/tmp/xgboost",
    dataset_root=local_path
)
remote_session =  XGBoostRemoteSession(
    compute_spec_id, 
    dataset_ids=["whas1_gateway-1_org-1", "whas2_gateway-2_org-2"]
)

Run Model Training via Python API🔗

To start the training with configurable XGBoost parameters, use the API function fit_xgboost. The session parameter controls if a local simulator run or a remote run is triggered. The configurable XGBoost parameters are:

  • num_rounds
  • eta
  • max_depth
  • objective
  • eval_metric
  • num_parallel_tree
  • enable_categorical
  • tree_method
  • nthread

The number of trees in the final model is the product of num_rounds and num_parallel_tree. For further information about XGBoost parameters see XGBoost documentation.

target_col = "fstat"
cols = [
"afb", "age", "bmi", "chf"
]
# local training
model = fit_xgboost(
    num_rounds=1,
    target_col=target_col,
    cols=cols,
    eta=0.2,
    eval_metric="mae",
    max_depth=3,
    num_parallel_tree=5,
    session=local_session,
)

model
# remote training
model = fit_xgboost(
    num_rounds=1,
    target_col=target_col,
    cols=cols,
    eta=0.2,
    eval_metric="mae",
    max_depth=3,
    num_parallel_tree=5,
    session=remote_session,
)
model
    {'learner': {'attributes': {'best_iteration': '1',
       'best_score': '0.3502602116332028'},
      'feature_names': ['afb', 'age', 'bmi', 'chf'],
      'feature_types': ['float', 'float', 'float', 'float'],
      'gradient_booster': {'model': {'gbtree_model_param': {'num_parallel_tree': '5',
         'num_trees': '5'},
        'iteration_indptr': [0, 5],
        'tree_info': [0,
         0,
         0,
         0,
         0,],
        'trees': [{'base_weights': [
            6.802704e-09,
            -0.09351389,
            ...
           0.0054957066],
          'categories': [],
          'categories_nodes': [],
          'categories_segments': [],
          'categories_sizes': [],
          'default_left': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
          'id': 0,
          'left_children': [1, 3, 5, 7, 9, 11, 13, -1, -1, -1, -1, -1, -1, -1, -1],
          'loss_changes': [ 2.6578097, 1.9690572, ... 0.0],
          'parents': ...,
          'right_children': ...,
          'split_conditions': ...,
          'split_indices': ...,
          'split_type': ...,
          'sum_hessian': ...,
          'tree_param': {'num_deleted': '0',
           'num_feature': '4',
           'num_nodes': '15',
           'size_leaf_vector': '1'}},
        ...
      ]},
       'name': 'gbtree'},
      'learner_model_param': {'base_score': '6.557377E-1',
       'boost_from_average': '1',
       'num_class': '0',
       'num_feature': '4',
       'num_target': '1'},
      'objective': {'name': 'reg:squarederror',
       'reg_loss_param': {'scale_pos_weight': '1'}}},
     'version': [2, 1, 1]}

Inspect Trees from resulting Model🔗

visualize_model(model,num_tree=1)

output_15_0

convert_model(model)
Tree Node ID Feature Split Yes No Missing Gain Cover Category
0 0 0 0-0 age 72.00000 0-1 0-2 0-2 19.258240 384.0 NaN
1 0 1 0-1 chf 1.00000 0-3 0-4 0-4 5.252416 183.0 NaN
2 0 2 0-2 age 86.00000 0-5 0-6 0-6 3.807554 201.0 NaN
3 0 3 0-3 bmi 23.40377 0-7 0-8 0-8 0.895349 149.0 NaN
4 0 4 0-4 age 68.00000 0-9 0-10 0-10 1.468441 34.0 NaN
... ... ... ... ... ... ... ... ... ... ... ...
75 4 10 4-10 Leaf NaN NaN NaN NaN -0.010831 2.0 NaN
76 4 11 4-11 Leaf NaN NaN NaN NaN 0.001604 120.0 NaN
77 4 12 4-12 Leaf NaN NaN NaN NaN -0.000931 205.0 NaN
78 4 13 4-13 Leaf NaN NaN NaN NaN 0.003282 25.0 NaN
79 4 14 4-14 Leaf NaN NaN NaN NaN 0.007388 10.0 NaN

Error on client🔗

cols = [
"afb", "age", "bmi", "chff"
]

try:

    model = fit_xgboost(
        num_rounds=10,
        target_col=target_col,
        cols=cols,
        eta=0.2,
        eval_metric="mae",
        max_depth=3,
        num_parallel_tree=5,
        session=local_session,
    )

except RuntimeError as e:
    print(e)
try:
    model = fit_xgboost(
        num_rounds=10,
        target_col=target_col,
        cols=cols,
        eta=0.2,
        eval_metric="mae",
        max_depth=3,
        num_parallel_tree=5,
        session=remote_session,
    )
except RuntimeError as e:
    print(e)

Preprocessing with FederatedDataFrame🔗

With the datasets parameter in fit_xgboost, you can specify a list of FederatedDataFrames. If no preprocessing is needed, this parameter can be None which means the training will run over all datasets from the Compute Spec.

A FederatedDataFrame always refers to a dataset ID. In remote sessions, you can use the RemoteData object from apheris_auth to obtain all relevant information that is associated with a dataset ID. In a local session, the dataset IDs can be chosen arbitrarily, but you need to provide mappings to their local paths and to the Gateway IDs, which can also be named arbitrarily in a local session as they are only simulated.

To use the FederatedDataFrame in the same way in both remote and local sessions, the example below names the dataset IDs identically to the remote dataset IDs.

local_session = XGBoostLocalSession(
    dataset_mapping=filenames_per_client,
    ds2gw_dict = {'whas1_gateway-1_org-1': 'site-1', 'whas2_gateway-2_org-2': 'site-2'},
    workspace="/tmp/xgboost",
    dataset_root=local_path
)

Once the FederatedDataFrames are created, you can proceed with the preprocessing as usual. In the following, you'll use the FederatedDataFrame to filter the data to patients with age larger than 72.

fdf1 = FederatedDataFrame("whas1_gateway-1_org-1")
fdf2 = FederatedDataFrame("whas2_gateway-2_org-2")

fdf1 = fdf1[fdf1["age"] > 72]
fdf2 = fdf2[fdf2["age"] > 72]
datasets = [fdf1, fdf2]

cols = [
"afb", "age", "bmi", "chf"
]
model = fit_xgboost(
    datasets=datasets,
    num_rounds=10,
    target_col=target_col,
    cols=cols,
    eta=0.2,
    eval_metric="mae",
    max_depth=3,
    num_parallel_tree=5,
    session=local_session,
)
model = fit_xgboost(
    datasets=datasets,
    num_rounds=10,
    target_col=target_col,
    cols=cols,
    eta=0.2,
    eval_metric="mae",
    max_depth=3,
    num_parallel_tree=5,
    session=remote_session,
)

Comparing the first trees of the model trained on filtered data, you can see that the age decision bounds reflect the new distribution.

visualize_model(model, num_tree=1)

output_27_0

Summary🔗

In this tutorial, you have been introduced to the Apheris XGBoost model, and learned how to train some basic tree-based models.

If you'd like to learn more about how to interact with the Apheris environment, we'd recommend the Getting started with Apheris CLI guide, or to find out more about how to use the Statistics package to analyse data in a secure and privacy-preserving way, check out our guide to Simulating and Running Statistics.