Using fairlib with Text inputs

Welcome to the fairlib interactive tutorial

In this tutorial, we will:

Show how to install fairlib, and prepare a preprocessed sentiment analysis dataset.
Show how to train a model with or without debiasing.
Show how to analyze the results, including creating tables and figures
Show how to run experiments over customized datasets.

1. Installation

!pip install fairlib

Collecting fairlib
  Downloading fairlib-0.0.3-py3-none-any.whl (63 kB)
[K     |████████████████████████████████| 63 kB 1.2 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 14.6 MB/s 
[?25hRequirement already satisfied: PyYAML in /usr/local/lib/python3.7/dist-packages (from fairlib) (3.13)
Collecting pickle5
  Downloading pickle5-0.0.12-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (256 kB)
[K     |████████████████████████████████| 256 kB 36.9 MB/s 
[?25hRequirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from fairlib) (1.21.6)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from fairlib) (3.2.2)
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (from fairlib) (0.11.2)
Requirement already satisfied: docopt in /usr/local/lib/python3.7/dist-packages (from fairlib) (0.6.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from fairlib) (4.64.0)
Requirement already satisfied: torch in /usr/local/lib/python3.7/dist-packages (from fairlib) (1.11.0+cu113)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from fairlib) (1.3.5)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from fairlib) (1.0.2)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fairlib) (2.8.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fairlib) (1.4.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fairlib) (0.11.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fairlib) (3.0.8)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib->fairlib) (4.2.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib->fairlib) (1.15.0)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->fairlib) (2022.1)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->fairlib) (1.4.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->fairlib) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->fairlib) (1.1.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from transformers->fairlib) (21.3)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 44.3 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 5.2 MB/s 
[?25hRequirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers->fairlib) (3.6.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers->fairlib) (2019.12.20)
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 18.7 MB/s 
[?25hRequirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers->fairlib) (4.11.3)
Collecting PyYAML
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 38.3 MB/s 
[?25hRequirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers->fairlib) (2.23.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers->fairlib) (3.8.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers->fairlib) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers->fairlib) (2021.10.8)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers->fairlib) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers->fairlib) (3.0.4)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers->fairlib) (7.1.2)
Installing collected packages: PyYAML, tokenizers, sacremoses, huggingface-hub, transformers, pickle5, fairlib
  Attempting uninstall: PyYAML
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed PyYAML-6.0 fairlib-0.0.3 huggingface-hub-0.5.1 pickle5-0.0.12 sacremoses-0.0.49 tokenizers-0.12.1 transformers-4.18.0

import fairlib

2. Prepare Dataset

In this notebook, we will be using the Moji dataset, where each tweet is annotated with a binary sentiment label (happy verse sad) and a binary race label (AAE verse SAE). Followings are random examples from the Moji dataset.

Text	Sentiment	Race
Dfl somebody said to me yesterday that how can u u have a iPhone or an S3 an ur phone off dfl	Positive	AAE
smh I bet maybe u just don’t care bout poor boo no more	Negative	AAE
I actually put jeans on today and I already wanna go put on leggings or yogas	Positive	SAE
I’m sitting next to the most awkward couple on the plane like they are making out and holding hands , I just can’t	Negative	SAE

For simplification, here we directly use the encoded Moji dataset provided by Ravfogel et al. (2020). Original tweets are encoded with the pre-trained DeepMoji model as 2304d vectors and grouped by the target classes and race labels . The following cell creates a data directory for saving the pre-processed data.

!mkdir -p data/deepmoji
!wget 'https://storage.googleapis.com/ai2i/nullspace/deepmoji/pos_pos.npy' -P 'data/deepmoji'
!wget 'https://storage.googleapis.com/ai2i/nullspace/deepmoji/pos_neg.npy' -P 'data/deepmoji'
!wget 'https://storage.googleapis.com/ai2i/nullspace/deepmoji/neg_pos.npy' -P 'data/deepmoji'
!wget 'https://storage.googleapis.com/ai2i/nullspace/deepmoji/neg_neg.npy' -P 'data/deepmoji'

--2022-05-02 02:46:21--  https://storage.googleapis.com/ai2i/nullspace/deepmoji/pos_pos.npy
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.135.128, 74.125.142.128, 74.125.195.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.135.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 405494864 (387M) [application/octet-stream]
Saving to: ‘data/deepmoji/pos_pos.npy’

pos_pos.npy         100%[===================>] 386.71M  5.59MB/s    in 67s     

2022-05-02 02:47:29 (5.77 MB/s) - ‘data/deepmoji/pos_pos.npy’ saved [405494864/405494864]

--2022-05-02 02:47:29--  https://storage.googleapis.com/ai2i/nullspace/deepmoji/pos_neg.npy
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.142.128, 74.125.195.128, 2607:f8b0:400e:c0d::80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.142.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 405504080 (387M) [application/octet-stream]
Saving to: ‘data/deepmoji/pos_neg.npy’

pos_neg.npy         100%[===================>] 386.72M  6.03MB/s    in 61s     

2022-05-02 02:48:31 (6.34 MB/s) - ‘data/deepmoji/pos_neg.npy’ saved [405504080/405504080]

--2022-05-02 02:48:31--  https://storage.googleapis.com/ai2i/nullspace/deepmoji/neg_pos.npy
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.195.128, 173.194.202.128, 74.125.20.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.195.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 405494864 (387M) [application/octet-stream]
Saving to: ‘data/deepmoji/neg_pos.npy’

neg_pos.npy         100%[===================>] 386.71M  5.79MB/s    in 67s     

2022-05-02 02:49:40 (5.75 MB/s) - ‘data/deepmoji/neg_pos.npy’ saved [405494864/405494864]

--2022-05-02 02:49:40--  https://storage.googleapis.com/ai2i/nullspace/deepmoji/neg_neg.npy
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.195.128, 173.194.202.128, 74.125.20.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.195.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 405504080 (387M) [application/octet-stream]
Saving to: ‘data/deepmoji/neg_neg.npy’

neg_neg.npy         100%[===================>] 386.72M  6.88MB/s    in 59s     

2022-05-02 02:50:40 (6.55 MB/s) - ‘data/deepmoji/neg_neg.npy’ saved [405504080/405504080]

We split the dataset into the train, dev, and test sets.

fairlib.utils.seed_everything(2022)

import numpy as np
import os

def read_data_file(input_file: str):
    vecs = np.load(input_file)

    np.random.shuffle(vecs)

    return vecs[:40000], vecs[40000:42000], vecs[42000:44000]

in_dir = "data/deepmoji"
out_dir = "data/deepmoji"

os.makedirs(out_dir, exist_ok=True)

for split in ['pos_pos', 'pos_neg', 'neg_pos', 'neg_neg']:
    train, dev, test = read_data_file(in_dir + '/' + split + '.npy')
    for split_dir, data in zip(['train', 'dev', 'test'], [train, dev, test]):
        os.makedirs(out_dir + '/' + split_dir, exist_ok=True)
        np.save(out_dir + '/' + split_dir + '/' + split + '.npy', data)

3. Standard Usage

So far, we have installed the fairlib and prepared the dataset for training. Now let’s take a look at an example of training a standard sentiment analysis model naively without debiasing.

Before moving to the training, we first define a list of hyperparameters that will be repeatedly used in this tutorial.

Shared_options = {
    # The name of the dataset, corresponding dataloader will be used,
    "dataset":  "Moji",

    # Specifiy the path to the input data
    "data_dir": "data/deepmoji",

    # Device for computing, -1 is the cpu; non-negative numbers indicate GPU id.
    "device_id":    -1,

    # The default path for saving experimental results
    "results_dir":  r"results",

    # Will be used for saving experimental results
    "project_dir":  r"dev",

    # We will focusing on TPR GAP, implying the Equalized Odds for binary classification.
    "GAP_metric_name":  "TPR_GAP",

    # The overall performance will be measured as accuracy
    "Performance_metric_name":  "accuracy",

    # Model selections are based on distance to optimum, see section 4 in our paper for more details
    "selection_criterion":  "DTO",

    # Default dirs for saving checkpoints
    "checkpoint_dir":   "models",
    "checkpoint_name":  "checkpoint_epoch",

    # Loading experimental results
    "n_jobs":   1,
}

!rm -rf results

Without an explicitly specified debiasing approach, fairlib by default trains and evaluates a binary MLP classifier. As a result, we only need to define:

Path to the dataset.
Dataset name, which will be used to initialize built-in dataloaders.
Experiment id, which is the identifier of the current experiment, and experimental results with respect to the same exp_id will be saved in the same dir.

args = {
    "dataset":Shared_options["dataset"], 
    "data_dir":Shared_options["data_dir"],
    "device_id":Shared_options["device_id"],

    # Give a name to the exp, which will be used in the path
    "exp_id":"vanilla",
}

# Init the argument
options = fairlib.BaseOptions()
state = options.get_state(args=args, silence=True)

INFO:root:Unexpected args: ['-f', '/root/.local/share/jupyter/runtime/kernel-94fdc7f5-0523-4a20-889a-536d5b502976.json']
INFO:root:Logging to ./results/dev/Moji/vanilla/output.log


2022-05-02 02:50:46 [INFO ]  ======================================== 2022-05-02 02:50:46 ========================================
2022-05-02 02:50:46 [INFO ]  Base directory is ./results/dev/Moji/vanilla
Loaded data shapes: (99998, 2304), (99998,), (99998,)
Loaded data shapes: (8000, 2304), (8000,), (8000,)
Loaded data shapes: (7998, 2304), (7998,), (7998,)

Train a model without explicitly debiasing

Given the 2304d encoded text representations, the default model in fairlib is a 3-layer MLP classifier with Tanh activation functions in between.

To customize the MLP architecture, we can specify the hyperparameters in the state as follows

state.hidden_size = 300
state.n_hidden = 2
state.activation_function = "Tanh"

Please see the model architecture section for more details about the hyperparameters corresponding to the model architecture.

fairlib.utils.seed_everything(2022)

# Init Model
model = fairlib.networks.get_main_model(state)

2022-05-02 02:50:47 [INFO ]  MLP( 
2022-05-02 02:50:47 [INFO ]    (output_layer): Linear(in_features=300, out_features=2, bias=True)
2022-05-02 02:50:47 [INFO ]    (AF): Tanh()
2022-05-02 02:50:47 [INFO ]    (hidden_layers): ModuleList(
2022-05-02 02:50:47 [INFO ]      (0): Linear(in_features=2304, out_features=300, bias=True)
2022-05-02 02:50:47 [INFO ]      (1): Tanh()
2022-05-02 02:50:47 [INFO ]      (2): Linear(in_features=300, out_features=300, bias=True)
2022-05-02 02:50:47 [INFO ]      (3): Tanh()
2022-05-02 02:50:47 [INFO ]    )
2022-05-02 02:50:47 [INFO ]    (criterion): CrossEntropyLoss()
2022-05-02 02:50:47 [INFO ]  )
2022-05-02 02:50:47 [INFO ]  Total number of parameters: 782402 

A list of hyperparameters has been predefined in fairlib, so we can now directly train a model with the model class’s built-in train_self method.

Please see the link for all hyperparameters associated with model training.

model.train_self()

2022-05-02 02:50:48 [INFO ]  Epoch:    0 [      0/  99998 ( 0%)]	Loss: 0.6906	 Data Time: 0.02s	Train Time: 0.20s
2022-05-02 02:50:51 [INFO ]  Epoch:    0 [  51200/  99998 (51%)]	Loss: 0.3926	 Data Time: 0.37s	Train Time: 3.35s
2022-05-02 02:50:56 [INFO ]  Evaluation at Epoch 0
2022-05-02 02:50:56 [INFO ]  Validation accuracy: 72.55	macro_fscore: 72.44	micro_fscore: 72.55	TPR_GAP: 40.07	FPR_GAP: 40.07	PPR_GAP: 39.10	
2022-05-02 02:50:56 [INFO ]  Test accuracy: 71.41	macro_fscore: 71.30	micro_fscore: 71.41	TPR_GAP: 39.01	FPR_GAP: 39.01	PPR_GAP: 37.84	
2022-05-02 02:50:56 [INFO ]  Epoch:    1 [      0/  99998 ( 0%)]	Loss: 0.4105	 Data Time: 0.02s	Train Time: 0.06s
2022-05-02 02:50:59 [INFO ]  Epoch:    1 [  51200/  99998 (51%)]	Loss: 0.4156	 Data Time: 0.39s	Train Time: 3.32s
2022-05-02 02:51:03 [INFO ]  Evaluation at Epoch 1
2022-05-02 02:51:03 [INFO ]  Validation accuracy: 72.36	macro_fscore: 72.32	micro_fscore: 72.36	TPR_GAP: 39.81	FPR_GAP: 39.81	PPR_GAP: 39.27	
2022-05-02 02:51:03 [INFO ]  Test accuracy: 71.01	macro_fscore: 70.98	micro_fscore: 71.01	TPR_GAP: 39.40	FPR_GAP: 39.40	PPR_GAP: 38.64	
2022-05-02 02:51:03 [INFO ]  Epoch:    2 [      0/  99998 ( 0%)]	Loss: 0.3433	 Data Time: 0.01s	Train Time: 0.06s
2022-05-02 02:51:07 [INFO ]  Epoch:    2 [  51200/  99998 (51%)]	Loss: 0.3734	 Data Time: 0.37s	Train Time: 3.31s
2022-05-02 02:51:11 [INFO ]  Epochs since last improvement: 1
2022-05-02 02:51:11 [INFO ]  Evaluation at Epoch 2
2022-05-02 02:51:11 [INFO ]  Validation accuracy: 72.42	macro_fscore: 72.37	micro_fscore: 72.42	TPR_GAP: 40.91	FPR_GAP: 40.91	PPR_GAP: 40.20	
2022-05-02 02:51:11 [INFO ]  Test accuracy: 70.98	macro_fscore: 70.93	micro_fscore: 70.98	TPR_GAP: 40.21	FPR_GAP: 40.21	PPR_GAP: 39.39	
2022-05-02 02:51:11 [INFO ]  Epoch:    3 [      0/  99998 ( 0%)]	Loss: 0.3773	 Data Time: 0.02s	Train Time: 0.06s
2022-05-02 02:51:15 [INFO ]  Epoch:    3 [  51200/  99998 (51%)]	Loss: 0.3479	 Data Time: 0.37s	Train Time: 3.28s
2022-05-02 02:51:19 [INFO ]  Epochs since last improvement: 2
2022-05-02 02:51:19 [INFO ]  Evaluation at Epoch 3
2022-05-02 02:51:19 [INFO ]  Validation accuracy: 72.09	macro_fscore: 71.92	micro_fscore: 72.09	TPR_GAP: 41.54	FPR_GAP: 41.54	PPR_GAP: 40.17	
2022-05-02 02:51:19 [INFO ]  Test accuracy: 71.17	macro_fscore: 71.02	micro_fscore: 71.17	TPR_GAP: 40.32	FPR_GAP: 40.32	PPR_GAP: 38.96	
2022-05-02 02:51:19 [INFO ]  Epoch:    4 [      0/  99998 ( 0%)]	Loss: 0.3839	 Data Time: 0.01s	Train Time: 0.06s
2022-05-02 02:51:23 [INFO ]  Epoch:    4 [  51200/  99998 (51%)]	Loss: 0.3499	 Data Time: 0.39s	Train Time: 3.30s
2022-05-02 02:51:26 [INFO ]  Epochs since last improvement: 3
2022-05-02 02:51:27 [INFO ]  Evaluation at Epoch 4
2022-05-02 02:51:27 [INFO ]  Validation accuracy: 71.50	macro_fscore: 71.43	micro_fscore: 71.50	TPR_GAP: 42.76	FPR_GAP: 42.76	PPR_GAP: 42.00	
2022-05-02 02:51:27 [INFO ]  Test accuracy: 70.49	macro_fscore: 70.43	micro_fscore: 70.49	TPR_GAP: 41.37	FPR_GAP: 41.37	PPR_GAP: 40.51	
2022-05-02 02:51:27 [INFO ]  Epoch:    5 [      0/  99998 ( 0%)]	Loss: 0.3746	 Data Time: 0.01s	Train Time: 0.07s
2022-05-02 02:51:31 [INFO ]  Epoch:    5 [  51200/  99998 (51%)]	Loss: 0.3748	 Data Time: 0.38s	Train Time: 3.30s
2022-05-02 02:51:34 [INFO ]  Epochs since last improvement: 4
2022-05-02 02:51:35 [INFO ]  Evaluation at Epoch 5
2022-05-02 02:51:35 [INFO ]  Validation accuracy: 72.67	macro_fscore: 72.60	micro_fscore: 72.67	TPR_GAP: 39.17	FPR_GAP: 39.17	PPR_GAP: 38.35	
2022-05-02 02:51:35 [INFO ]  Test accuracy: 71.69	macro_fscore: 71.62	micro_fscore: 71.69	TPR_GAP: 37.97	FPR_GAP: 37.97	PPR_GAP: 36.91	
2022-05-02 02:51:35 [INFO ]  Epoch:    6 [      0/  99998 ( 0%)]	Loss: 0.3624	 Data Time: 0.01s	Train Time: 0.07s
2022-05-02 02:51:38 [INFO ]  Epoch:    6 [  51200/  99998 (51%)]	Loss: 0.3529	 Data Time: 0.37s	Train Time: 3.28s
2022-05-02 02:51:42 [INFO ]  Epochs since last improvement: 5
2022-05-02 02:51:42 [INFO ]  Evaluation at Epoch 6
2022-05-02 02:51:42 [INFO ]  Validation accuracy: 72.70	macro_fscore: 72.62	micro_fscore: 72.70	TPR_GAP: 38.29	FPR_GAP: 38.29	PPR_GAP: 37.50	
2022-05-02 02:51:42 [INFO ]  Test accuracy: 71.76	macro_fscore: 71.70	micro_fscore: 71.76	TPR_GAP: 37.59	FPR_GAP: 37.59	PPR_GAP: 36.79	

After each iteration (epoch), evaluation results over the validation set and test set will be printed, including metrics for both performance and fairness.

Performance metrics: accuracy, macro F1 score, and micro F1 score
Bias metrics： RMS aggregated TPR GAP, RMS aggregated TNR GAP, and RMS aggregated PPR GAP.

Briefly, these bias metrics measure how protected groups perform differently from each other. For example, the TPR GAP measures the True Positive Rate difference between AAE and SAE. All these three bias metrics should be 0 for a fair model.

Moreover, GAP metrics could be aligned with particular well-known fairness metrics. For example, TPR GAP corresponds to Equal Opportunity Fairness, and Both TPR GAP and FPR GAP measure equalized Odds criterion. Please refer to Barocas et al. 2019.

It can be seen that the naively trained model achieves around 72% accuracy and 39% GAP, which is not ideal.

Bias mitigation through balanced training and adversarial training

To mitigate bias in the sentiment analysis, we show an example of employing balanced training and adversarial training simultaneously.

In terms of the balanced training, we resample each group of instances with different probabilities corresponding to the Equal Opportunity fairness criterion (Han et al. 2021).

Adversarial training (Li et al. 2018) is applied at the training time, where an extra adversary component is trained to identify protected labels (AAE verse SAE in this tutorial) from the intermediate representations of the sentiment analysis model. The sentiment analysis model will be trained to unlearn the adversary, i.e., remove race information from its intermediate representations, and thus makes fairer predictions.

Balanced training and adversarial training are built-in methods in fairlib, so we can adopt these two methods by simply specifying corresponding arguments, as shown in the following cell.

A list of supported bias mitigation methods is shown here
The usage file introduces further options associated with each debiasing method, such as the adversary architecture and balanced training strategies.

debiasing_args = {
    "dataset":Shared_options["dataset"], 
    "data_dir":Shared_options["data_dir"],
    "device_id":Shared_options["device_id"],

    # Exp name
    "exp_id":"BT_Adv",

    # Perform adversarial training if True
    "adv_debiasing":True,

    # Specify the hyperparameters for Balanced Training
    "BT":"Resampling",
    "BTObj":"EO",
}

debias_options = fairlib.BaseOptions()
debias_state = debias_options.get_state(args=debiasing_args, silence=True)

fairlib.utils.seed_everything(2022)

debias_model = fairlib.networks.get_main_model(debias_state)

2022-05-02 02:51:42 [INFO ]  Unexpected args: ['-f', '/root/.local/share/jupyter/runtime/kernel-94fdc7f5-0523-4a20-889a-536d5b502976.json']
2022-05-02 02:51:42 [INFO ]  Logging to ./results/dev/Moji/BT_Adv/output.log
2022-05-02 02:51:42 [INFO ]  ======================================== 2022-05-02 02:51:42 ========================================
2022-05-02 02:51:42 [INFO ]  Base directory is ./results/dev/Moji/BT_Adv
Loaded data shapes: (39996, 2304), (39996,), (39996,)
Loaded data shapes: (8000, 2304), (8000,), (8000,)
Loaded data shapes: (7996, 2304), (7996,), (7996,)
2022-05-02 02:51:51 [INFO ]  SubDiscriminator( 
2022-05-02 02:51:51 [INFO ]    (grad_rev): GradientReversal()
2022-05-02 02:51:51 [INFO ]    (output_layer): Linear(in_features=300, out_features=2, bias=True)
2022-05-02 02:51:51 [INFO ]    (AF): ReLU()
2022-05-02 02:51:51 [INFO ]    (hidden_layers): ModuleList(
2022-05-02 02:51:51 [INFO ]      (0): Linear(in_features=300, out_features=300, bias=True)
2022-05-02 02:51:51 [INFO ]      (1): ReLU()
2022-05-02 02:51:51 [INFO ]      (2): Linear(in_features=300, out_features=300, bias=True)
2022-05-02 02:51:51 [INFO ]      (3): ReLU()
2022-05-02 02:51:51 [INFO ]    )
2022-05-02 02:51:51 [INFO ]    (criterion): CrossEntropyLoss()
2022-05-02 02:51:51 [INFO ]  )
2022-05-02 02:51:51 [INFO ]  Total number of parameters: 181202 

2022-05-02 02:51:51 [INFO ]  Discriminator built!
2022-05-02 02:51:51 [INFO ]  MLP( 
2022-05-02 02:51:51 [INFO ]    (output_layer): Linear(in_features=300, out_features=2, bias=True)
2022-05-02 02:51:51 [INFO ]    (AF): Tanh()
2022-05-02 02:51:51 [INFO ]    (hidden_layers): ModuleList(
2022-05-02 02:51:51 [INFO ]      (0): Linear(in_features=2304, out_features=300, bias=True)
2022-05-02 02:51:51 [INFO ]      (1): Tanh()
2022-05-02 02:51:51 [INFO ]      (2): Linear(in_features=300, out_features=300, bias=True)
2022-05-02 02:51:51 [INFO ]      (3): Tanh()
2022-05-02 02:51:51 [INFO ]    )
2022-05-02 02:51:51 [INFO ]    (criterion): CrossEntropyLoss()
2022-05-02 02:51:51 [INFO ]  )
2022-05-02 02:51:51 [INFO ]  Total number of parameters: 782402 

It can be seen from the last cell that the training dataset size is smaller than before (40k verse 100k) due to the preprocessing for balanced training, and an MLP adversary is initialized for adversarial debiasing.

The training process will be the same as the vanilla method. We call the train_self function again to train a model with bias mitigation.

# Around 90s
debias_model.train_self()

2022-05-02 02:51:51 [INFO ]  Epoch:    0 [      0/  39996 ( 0%)]	Loss: 0.0007	 Data Time: 0.02s	Train Time: 0.20s
2022-05-02 02:51:59 [INFO ]  Evaluation at Epoch 0
2022-05-02 02:51:59 [INFO ]  Validation accuracy: 74.26	macro_fscore: 73.70	micro_fscore: 74.26	TPR_GAP: 19.22	FPR_GAP: 19.22	PPR_GAP: 16.42	
2022-05-02 02:51:59 [INFO ]  Test accuracy: 74.32	macro_fscore: 73.82	micro_fscore: 74.32	TPR_GAP: 18.77	FPR_GAP: 18.77	PPR_GAP: 15.08	
2022-05-02 02:52:00 [INFO ]  Epoch:    1 [      0/  39996 ( 0%)]	Loss: -0.1897	 Data Time: 0.01s	Train Time: 0.21s
2022-05-02 02:52:08 [INFO ]  Epochs since last improvement: 1
2022-05-02 02:52:08 [INFO ]  Evaluation at Epoch 1
2022-05-02 02:52:08 [INFO ]  Validation accuracy: 74.70	macro_fscore: 74.62	micro_fscore: 74.70	TPR_GAP: 9.40	FPR_GAP: 9.40	PPR_GAP: 5.25	
2022-05-02 02:52:08 [INFO ]  Test accuracy: 74.09	macro_fscore: 73.95	micro_fscore: 74.09	TPR_GAP: 9.68	FPR_GAP: 9.68	PPR_GAP: 2.50	
2022-05-02 02:52:09 [INFO ]  Epoch:    2 [      0/  39996 ( 0%)]	Loss: -0.1648	 Data Time: 0.01s	Train Time: 0.20s
2022-05-02 02:52:17 [INFO ]  Evaluation at Epoch 2
2022-05-02 02:52:17 [INFO ]  Validation accuracy: 75.55	macro_fscore: 75.54	micro_fscore: 75.55	TPR_GAP: 12.97	FPR_GAP: 12.97	PPR_GAP: 10.40	
2022-05-02 02:52:17 [INFO ]  Test accuracy: 75.49	macro_fscore: 75.49	micro_fscore: 75.49	TPR_GAP: 12.18	FPR_GAP: 12.18	PPR_GAP: 7.55	
2022-05-02 02:52:17 [INFO ]  Epoch:    3 [      0/  39996 ( 0%)]	Loss: -0.2571	 Data Time: 0.01s	Train Time: 0.20s
2022-05-02 02:52:25 [INFO ]  Epochs since last improvement: 1
2022-05-02 02:52:26 [INFO ]  Evaluation at Epoch 3
2022-05-02 02:52:26 [INFO ]  Validation accuracy: 75.81	macro_fscore: 75.80	micro_fscore: 75.81	TPR_GAP: 15.98	FPR_GAP: 15.98	PPR_GAP: 14.17	
2022-05-02 02:52:26 [INFO ]  Test accuracy: 75.35	macro_fscore: 75.35	micro_fscore: 75.35	TPR_GAP: 15.25	FPR_GAP: 15.25	PPR_GAP: 11.83	
2022-05-02 02:52:26 [INFO ]  Epoch:    4 [      0/  39996 ( 0%)]	Loss: -0.1475	 Data Time: 0.01s	Train Time: 0.21s
2022-05-02 02:52:35 [INFO ]  Evaluation at Epoch 4
2022-05-02 02:52:35 [INFO ]  Validation accuracy: 75.83	macro_fscore: 75.80	micro_fscore: 75.83	TPR_GAP: 11.29	FPR_GAP: 11.29	PPR_GAP: 7.70	
2022-05-02 02:52:35 [INFO ]  Test accuracy: 75.48	macro_fscore: 75.46	micro_fscore: 75.48	TPR_GAP: 11.94	FPR_GAP: 11.94	PPR_GAP: 5.33	
2022-05-02 02:52:35 [INFO ]  Epoch:    5 [      0/  39996 ( 0%)]	Loss: -0.1744	 Data Time: 0.01s	Train Time: 0.20s
2022-05-02 02:52:43 [INFO ]  Evaluation at Epoch 5
2022-05-02 02:52:43 [INFO ]  Validation accuracy: 75.56	macro_fscore: 75.38	micro_fscore: 75.56	TPR_GAP: 15.65	FPR_GAP: 15.65	PPR_GAP: 13.37	
2022-05-02 02:52:43 [INFO ]  Test accuracy: 75.26	macro_fscore: 75.11	micro_fscore: 75.26	TPR_GAP: 15.42	FPR_GAP: 15.42	PPR_GAP: 11.56	
2022-05-02 02:52:43 [INFO ]  Epoch:    6 [      0/  39996 ( 0%)]	Loss: -0.2199	 Data Time: 0.01s	Train Time: 0.20s
2022-05-02 02:52:52 [INFO ]  Epochs since last improvement: 1
2022-05-02 02:52:52 [INFO ]  Evaluation at Epoch 6
2022-05-02 02:52:52 [INFO ]  Validation accuracy: 75.70	macro_fscore: 75.68	micro_fscore: 75.70	TPR_GAP: 9.72	FPR_GAP: 9.72	PPR_GAP: 5.60	
2022-05-02 02:52:52 [INFO ]  Test accuracy: 75.36	macro_fscore: 75.36	micro_fscore: 75.36	TPR_GAP: 10.30	FPR_GAP: 10.30	PPR_GAP: 3.00	
2022-05-02 02:52:52 [INFO ]  Epoch:    7 [      0/  39996 ( 0%)]	Loss: -0.1880	 Data Time: 0.01s	Train Time: 0.20s
2022-05-02 02:53:00 [INFO ]  Epochs since last improvement: 2
2022-05-02 02:53:01 [INFO ]  Evaluation at Epoch 7
2022-05-02 02:53:01 [INFO ]  Validation accuracy: 75.76	macro_fscore: 75.71	micro_fscore: 75.76	TPR_GAP: 22.58	FPR_GAP: 22.58	PPR_GAP: 21.02	
2022-05-02 02:53:01 [INFO ]  Test accuracy: 74.92	macro_fscore: 74.88	micro_fscore: 74.92	TPR_GAP: 21.33	FPR_GAP: 21.33	PPR_GAP: 18.93	
2022-05-02 02:53:01 [INFO ]  Epoch:    8 [      0/  39996 ( 0%)]	Loss: -0.2015	 Data Time: 0.01s	Train Time: 0.20s
2022-05-02 02:53:09 [INFO ]  Epochs since last improvement: 3
2022-05-02 02:53:10 [INFO ]  Evaluation at Epoch 8
2022-05-02 02:53:10 [INFO ]  Validation accuracy: 75.64	macro_fscore: 75.46	micro_fscore: 75.64	TPR_GAP: 17.73	FPR_GAP: 17.73	PPR_GAP: 15.37	
2022-05-02 02:53:10 [INFO ]  Test accuracy: 75.38	macro_fscore: 75.23	micro_fscore: 75.38	TPR_GAP: 17.23	FPR_GAP: 17.23	PPR_GAP: 13.93	
2022-05-02 02:53:10 [INFO ]  Epoch:    9 [      0/  39996 ( 0%)]	Loss: -0.2016	 Data Time: 0.01s	Train Time: 0.20s
2022-05-02 02:53:18 [INFO ]  Epochs since last improvement: 4
2022-05-02 02:53:18 [INFO ]  Evaluation at Epoch 9
2022-05-02 02:53:18 [INFO ]  Validation accuracy: 75.44	macro_fscore: 75.27	micro_fscore: 75.44	TPR_GAP: 17.00	FPR_GAP: 17.00	PPR_GAP: 14.62	
2022-05-02 02:53:18 [INFO ]  Test accuracy: 75.45	macro_fscore: 75.30	micro_fscore: 75.45	TPR_GAP: 17.51	FPR_GAP: 17.51	PPR_GAP: 13.83	
2022-05-02 02:53:19 [INFO ]  Epoch:   10 [      0/  39996 ( 0%)]	Loss: -0.2201	 Data Time: 0.01s	Train Time: 0.20s
2022-05-02 02:53:27 [INFO ]  Epochs since last improvement: 5
2022-05-02 02:53:27 [INFO ]  Evaluation at Epoch 10
2022-05-02 02:53:27 [INFO ]  Validation accuracy: 74.81	macro_fscore: 74.40	micro_fscore: 74.81	TPR_GAP: 18.69	FPR_GAP: 18.69	PPR_GAP: 16.07	
2022-05-02 02:53:27 [INFO ]  Test accuracy: 74.42	macro_fscore: 74.02	micro_fscore: 74.42	TPR_GAP: 18.79	FPR_GAP: 18.79	PPR_GAP: 15.03	

It can be seen that GAP scores drop significantly, confirming that the debiasing method indeed improves fairness.

Here we provide two more examples for employing different debiasing methods.

Only using the adversarial training for bias mitigation

We need to remove the options that are not related to adversarial training.

debiasing_args = {
    "dataset":Shared_options["dataset"], 
    "data_dir":Shared_options["data_dir"],
    "device_id":Shared_options["device_id"],

    # Exp name
    "exp_id":"Adv",

    # Perform adversarial training if True
    "adv_debiasing":True,

    # Remove the hyperparameters for Balanced Training
    # "BT":"Resampling",
    # "BTObj":"EO",
}

Use more debiasing methods simultaneously

In addition to balanced training and adversarial training, employ FairBatch (Roh te al. 2021) for bias mitigation

We can direct add FairBatch options to the argument dict as follows,

    debiasing_args = {
        "dataset":Shared_options["dataset"], 
        "data_dir":Shared_options["data_dir"],
        "device_id":Shared_options["device_id"],

        # Exp name
        "exp_id":"BT_Adv_FairBatch",

        # Perform adversarial training if True
        "adv_debiasing":True,

        # Specify the hyperparameters for Balanced Training
        "BT":"Resampling",
        "BTObj":"EO",

        # Specify the hyperparameters for FairBatch
        "DyBT": "FairBatch", 
        "DyBTObj": "stratified_y" # Equivalent to the EO FairBatch in the original paper
    }

4. Analysis

Previous sections have demonstrated how to train a model for bias mitigation under different settings. Besides, some other important aspects need to be considered, including,

how to select the desired model when considering both fairness and performance?
how to compare different debiasing methods systematically?
how to present experimental results?

The analysis component in fairlib aims to address problems, which can be used to retrieve results, select models, and compare models.

Saved material

During the model training, fairlib saves results for later analysis, so let’s explore what has been stored.

The arguments specify the saving dir when initialling the state for training as: results_dir/project_dir/dataset/exp_id.

The following example shows the information that has been stored for the first epoch (i.e. epoch 0):

import torch

path = "{results_dir}/{project_dir}/{dataset}/{exp_id}/{checkpoint_dir}/{checkpoint_name}{epoch}.pth.tar"

# Path to the first epoch
path_vanilla_epoch0 = path.format(
    exp_id = "vanilla",
    epoch = "0",
    results_dir=Shared_options["results_dir"],
    project_dir=Shared_options["project_dir"],
    dataset=Shared_options["dataset"],
    checkpoint_dir=Shared_options["checkpoint_dir"],
    checkpoint_name=Shared_options["checkpoint_name"],
)

epoch_results = torch.load(path_vanilla_epoch0)
# The keys for saved items
print(epoch_results.keys())

dict_keys(['epoch', 'epochs_since_improvement', 'loss', 'valid_confusion_matrices', 'test_confusion_matrices', 'dev_evaluations', 'test_evaluations'])

The printed information during the model training, such as evaluation results over validation and test set, have been stored for each epoch.

print(epoch_results["dev_evaluations"])

{'accuracy': 0.7255, 'macro_fscore': 0.7244493079894455, 'micro_fscore': 0.7255, 'TPR_GAP': 0.4006709956925445, 'FPR_GAP': 0.40067100170260955, 'PPR_GAP': 0.39099999902250004}

Moreover, fairlib will also save the confusion matrix for each epoch. Any confusion-matrix based scores can be post-calculated in later analysis without the need for retraining the models.

from fairlib import analysis

How to select the desired model when considering both fairness and performance?

As discussed in Section 4 of our paper, there are different ways of selecting the best model, and here we use the DTO metric for epoch selection.

model_selection retrieves experimental results for a single method, selects the desired epoch for each run, and saves the resulting df for a later process.

analysis.model_selection(
    # exp_id started with model_id will be treated as the same method, e.g, vanilla, and adv
    model_id= ("vanilla"),

    # the tuned hyperparameters of a methods, which will be used to group multiple runs together.
    # This option is generally used for differentiating models with the same debiasing method but 
    # with different method-specific hyperparameters, such as the strength of adversarial loss for Adv
    # Random seeds should not be included here, such that, random runs with same hyperparameters can
    # be aggregated to present the statistics of the results. 
    index_column_names = ["BT", "BTObj", "adv_debiasing"],

    # to convenient the further analysis, we will store the resulting DataFrame to the specified path
    save_path = r"results/Vanilla_df.pkl",

    # Follwoing options are predefined
    results_dir= Shared_options["results_dir"],
    project_dir= Shared_options["project_dir"]+"/"+Shared_options["dataset"],
    GAP_metric_name = Shared_options["GAP_metric_name"],
    Performance_metric_name = Shared_options["Performance_metric_name"],
    # We use DTO for epoch selection
    selection_criterion = Shared_options["selection_criterion"],
    checkpoint_dir= Shared_options["checkpoint_dir"],
    checkpoint_name= Shared_options["checkpoint_name"],
    # If retrive results in parallel
    n_jobs=Shared_options["n_jobs"],
)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished

			epoch	dev_fairness	dev_performance	dev_DTO	test_fairness	test_performance	test_DTO	opt_dir
BT	BTObj	adv_debiasing
NaN	NaN	False	0	0.617075	0.727	0.0	0.624079	0.717554	0.0	results/dev/Moji/vanilla/opt.yaml

  <script>
    const buttonEl =
      document.querySelector('#df-e3634b56-3952-4a96-af30-9e3193bf3f5f button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-e3634b56-3952-4a96-af30-9e3193bf3f5f');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

analysis.model_selection(
    model_id= ("BT_Adv"),
    index_column_names = ["BT", "BTObj", "adv_debiasing"],
    save_path = r"results/BT_ADV_df.pkl",
    # Follwoing options are predefined
    results_dir= Shared_options["results_dir"],
    project_dir= Shared_options["project_dir"]+"/"+Shared_options["dataset"],
    GAP_metric_name = Shared_options["GAP_metric_name"],
    Performance_metric_name = Shared_options["Performance_metric_name"],
    selection_criterion = Shared_options["selection_criterion"],
    checkpoint_dir= Shared_options["checkpoint_dir"],
    checkpoint_name= Shared_options["checkpoint_name"],
    n_jobs=Shared_options["n_jobs"],
)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished

			epoch	dev_fairness	dev_performance	dev_DTO	test_fairness	test_performance	test_DTO	opt_dir
BT	BTObj	adv_debiasing
Resampling	EO	True	0	0.902757	0.757	0.003919	0.896981	0.753627	0.007047	results/dev/Moji/BT_Adv/opt.yaml

  <script>
    const buttonEl =
      document.querySelector('#df-cf34b44f-d60f-4db9-a1c5-50533a85f966 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-cf34b44f-d60f-4db9-a1c5-50533a85f966');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

We have preprocessed the results with the model_selection function, and the resulting dfs can be downloaded as follows:

!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1M0G6PyPuDC8Y_2nL9XKYCt10IUzbSvfl' -O retrived_results.tar.gz

--2022-05-02 02:53:27--  https://docs.google.com/uc?export=download&id=1M0G6PyPuDC8Y_2nL9XKYCt10IUzbSvfl
Resolving docs.google.com (docs.google.com)... 74.125.195.102, 74.125.195.138, 74.125.195.139, ...
Connecting to docs.google.com (docs.google.com)|74.125.195.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0g-0k-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/0bdo761p58r1ii9ush0jp9bsgtg7n5hh/1651459950000/17527887236587461918/*/1M0G6PyPuDC8Y_2nL9XKYCt10IUzbSvfl?e=download [following]
Warning: wildcards not supported in HTTP.
--2022-05-02 02:53:29--  https://doc-0g-0k-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/0bdo761p58r1ii9ush0jp9bsgtg7n5hh/1651459950000/17527887236587461918/*/1M0G6PyPuDC8Y_2nL9XKYCt10IUzbSvfl?e=download
Resolving doc-0g-0k-docs.googleusercontent.com (doc-0g-0k-docs.googleusercontent.com)... 74.125.142.132, 2607:f8b0:400e:c08::84
Connecting to doc-0g-0k-docs.googleusercontent.com (doc-0g-0k-docs.googleusercontent.com)|74.125.142.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 790461 (772K) [application/x-gzip]
Saving to: ‘retrived_results.tar.gz’

retrived_results.ta 100%[===================>] 771.93K  --.-KB/s    in 0.006s  

2022-05-02 02:53:29 (128 MB/s) - ‘retrived_results.tar.gz’ saved [790461/790461]

!tar -xf retrived_results.tar.gz

Here we demonstrate the application of final_results_df, which loads cached results with retrive_results for all methods, select the best hyperparameter combinations for each technique, and present the result in a DataFrame

?analysis.final_results_df

Moji_results = analysis.retrive_results("Moji", log_dir="analysis/results")

Moji_main_results = analysis.final_results_df(
    results_dict = Moji_results,
    pareto = False,
    pareto_selection = "test",
    selection_criterion = "DTO",
    return_dev = True,
    # Fairness_threshold=0.95,
    # return_conf=True,
    # save_conf_dir=r"D:\Project\Fair_NLP_Classification\analysis\reproduce\Moji"
    )
Moji_main_results

	Models	test_performance mean	test_performance std	test_fairness mean	test_fairness std	dev_performance mean	dev_performance std	dev_fairness mean	dev_fairness std	DTO	is_pareto
0	GatedDAdv	0.750163	0.006945	0.908679	0.021678	0.745600	0.004828	0.928670	0.022488	0.266004	False
1	Adv	0.756414	0.007271	0.893286	0.005623	0.747425	0.004549	0.912125	0.008507	0.265936	True
2	FairBatch	0.751488	0.005772	0.904373	0.008213	0.746050	0.003896	0.914526	0.006020	0.266276	True
3	DAdv	0.755464	0.004076	0.904023	0.011218	0.748550	0.002405	0.915601	0.005007	0.262697	True
4	DelayedCLS_Adv	0.761015	0.003081	0.882425	0.015918	0.751675	0.003481	0.899346	0.011417	0.266341	True
5	BTEO	0.753927	0.001433	0.877469	0.003756	0.746325	0.000998	0.896874	0.005401	0.274892	True
6	GDEO	0.752763	0.004999	0.892255	0.007860	0.749350	0.003494	0.912672	0.002766	0.269694	True
7	BTGatedAdv	0.735459	0.028830	0.866150	0.028232	0.730150	0.024594	0.886862	0.030537	0.296476	True
8	FairSCL	0.757314	0.003441	0.878219	0.004314	0.752825	0.001872	0.898325	0.002579	0.271527	False
9	OldFairBatch	0.750638	0.006012	0.905537	0.005046	0.744525	0.004995	0.917734	0.004761	0.266655	True
10	GatedBTEO	0.762106	0.002592	0.900764	0.014701	0.759775	0.003798	0.909445	0.006631	0.257762	True
11	BTFairBatch	0.746837	0.003407	0.899351	0.004936	0.743975	0.004236	0.919254	0.004731	0.272437	False
12	GatedAdv	0.753113	0.005196	0.890065	0.013302	0.748975	0.003805	0.910838	0.010314	0.270257	False
13	INLP	0.733433	NaN	0.855982	NaN	0.727625	NaN	0.859686	NaN	0.302983	True
14	GDMean	0.752163	0.002130	0.901389	0.003916	0.749050	0.001368	0.922430	0.005829	0.266735	True
15	Vanilla	0.722981	0.004576	0.611870	0.014356	0.726650	0.003673	0.632302	0.013370	0.476849	True

  <script>
    const buttonEl =
      document.querySelector('#df-64334087-1f85-431d-a8cb-1e462a2d3bfe button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-64334087-1f85-431d-a8cb-1e462a2d3bfe');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

print(Moji_main_results.to_latex(index=False))

\begin{tabular}{lrrrrrrrrrl}
\toprule
        Models &  test\_performance mean &  test\_performance std &  test\_fairness mean &  test\_fairness std &  dev\_performance mean &  dev\_performance std &  dev\_fairness mean &  dev\_fairness std &      DTO &  is\_pareto \\
\midrule
     GatedDAdv &               0.750163 &              0.006945 &            0.908679 &           0.021678 &              0.745600 &             0.004828 &           0.928670 &          0.022488 & 0.266004 &      False \\
           Adv &               0.756414 &              0.007271 &            0.893286 &           0.005623 &              0.747425 &             0.004549 &           0.912125 &          0.008507 & 0.265936 &       True \\
     FairBatch &               0.751488 &              0.005772 &            0.904373 &           0.008213 &              0.746050 &             0.003896 &           0.914526 &          0.006020 & 0.266276 &       True \\
          DAdv &               0.755464 &              0.004076 &            0.904023 &           0.011218 &              0.748550 &             0.002405 &           0.915601 &          0.005007 & 0.262697 &       True \\
DelayedCLS\_Adv &               0.761015 &              0.003081 &            0.882425 &           0.015918 &              0.751675 &             0.003481 &           0.899346 &          0.011417 & 0.266341 &       True \\
          BTEO &               0.753927 &              0.001433 &            0.877469 &           0.003756 &              0.746325 &             0.000998 &           0.896874 &          0.005401 & 0.274892 &       True \\
          GDEO &               0.752763 &              0.004999 &            0.892255 &           0.007860 &              0.749350 &             0.003494 &           0.912672 &          0.002766 & 0.269694 &       True \\
    BTGatedAdv &               0.735459 &              0.028830 &            0.866150 &           0.028232 &              0.730150 &             0.024594 &           0.886862 &          0.030537 & 0.296476 &       True \\
       FairSCL &               0.757314 &              0.003441 &            0.878219 &           0.004314 &              0.752825 &             0.001872 &           0.898325 &          0.002579 & 0.271527 &      False \\
  OldFairBatch &               0.750638 &              0.006012 &            0.905537 &           0.005046 &              0.744525 &             0.004995 &           0.917734 &          0.004761 & 0.266655 &       True \\
     GatedBTEO &               0.762106 &              0.002592 &            0.900764 &           0.014701 &              0.759775 &             0.003798 &           0.909445 &          0.006631 & 0.257762 &       True \\
   BTFairBatch &               0.746837 &              0.003407 &            0.899351 &           0.004936 &              0.743975 &             0.004236 &           0.919254 &          0.004731 & 0.272437 &      False \\
      GatedAdv &               0.753113 &              0.005196 &            0.890065 &           0.013302 &              0.748975 &             0.003805 &           0.910838 &          0.010314 & 0.270257 &      False \\
          INLP &               0.733433 &                   NaN &            0.855982 &                NaN &              0.727625 &                  NaN &           0.859686 &               NaN & 0.302983 &       True \\
        GDMean &               0.752163 &              0.002130 &            0.901389 &           0.003916 &              0.749050 &             0.001368 &           0.922430 &          0.005829 & 0.266735 &       True \\
       Vanilla &               0.722981 &              0.004576 &            0.611870 &           0.014356 &              0.726650 &             0.003673 &           0.632302 &          0.013370 & 0.476849 &       True \\
\bottomrule
\end{tabular}

import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path
def make_plot(plot_df, figure_name=None):
    plot_df["Fairness"] = plot_df["test_fairness mean"]
    plot_df["Accuracy"] = plot_df["test_performance mean"]

    figure = plt.figure(figsize=(6, 6), dpi = 150) 
    with sns.axes_style("white"):
        sns.lineplot(
            data=plot_df,
            x="Accuracy",
            y="Fairness",
            hue="Models",
            markers=True,
            style="Models",
        )
    if figure_name is not None:
        figure.savefig(Path(r"plots") / figure_name, dpi=960, bbox_inches="tight") 

Moji_plot_df = analysis.final_results_df(
    results_dict = Moji_results,
    pareto = True,
    # pareto = False,
    pareto_selection = "test",
    selection_criterion = None,
    return_dev = True,
    # Performance_threshold=0.72
    # num_trail=20,
    )
make_plot(Moji_plot_df)

png

5. Read More

Visualization
- Interactive plots demonstrates creating interactive plots for comparing different methods, and demonstrating DTO and constrained selection. Figure 2 in our paper can be reproduced with this function.
- Plot gallery presents a list of examples for presenting experimental results, e.g., hyperparameter tuning (Figure 1 in our paper) and trade-off plots with zoomed-in area (Figure 3 in our paper).
Customized Dataset and Models
- The dataset document provides instructions for customizing datasets and dataloaders.
- The model document provides instructions for customizing model architectures and register models.
- Besides text inputs, we also provide examples for customizing structured inputs and images.
Customized Metrics
- Single metric can be seen from there.
- Metric aggregation such as the default root mean square aggregation can be found from there.
- The document is coming soon.
Customized Debiasing Methods
- Please see the document for instructions about adding method-specific options and integrating methods with fairlib.