Tutorial: Counterfactual explanations for scorecard with binary target¶
Counterfactual explanations is one of the post-hoc methods used to provide explainability to machine learning models. In this tutorial, we show how to generate optimal counterfactual explanations for scorecard models.
[1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from optbinning import BinningProcess
from optbinning import Scorecard
from optbinning.scorecard import Counterfactual
We load the adult income dataset from the UCI repository (https://archive.ics.uci.edu/ml/datasets/adult). For this example, we select 8 features. The target “income” is transformed to binary (0: low-income, <=50K) and (1: high-income, >50K).
[2]:
df = pd.read_csv("data/adult.data", sep=",", header=None)
columns = ["age", "workclass", "fnlwgt", "education", "education-num",
"marital-status", "occupation", "relationship", "race", "sex",
"capital-gain", "capital-loss", "hours-per-week","native-country",
"income"]
target = "income"
variable_names = ["age", "workclass", "education", "marital-status",
"occupation", "race", "sex", "hours-per-week"]
df.columns = columns
X = df[variable_names]
y = (df[target].values == ' >50K').astype(int)
Scorecard model¶
First, we require a scorecard model. As shown in previous tutorials, we need a binning process and an estimator.
[3]:
binning_process = BinningProcess(variable_names)
estimator = LogisticRegression(solver="lbfgs", class_weight="balanced")
scorecard = Scorecard(binning_process=binning_process,
estimator=estimator, scaling_method="min_max",
scaling_method_params={"min": 300, "max": 850})
scorecard.fit(X, y)
[3]:
Scorecard(binning_process=BinningProcess(variable_names=['age', 'workclass',
'education',
'marital-status',
'occupation', 'race',
'sex',
'hours-per-week']),
estimator=LogisticRegression(class_weight='balanced'),
scaling_method='min_max',
scaling_method_params={'max': 850, 'min': 300})
Generating counterfactual explanations - binary outcome¶
Having fitted the scorecard model, we can start generating counterfactual explanations.
Single counterfactual¶
As an input data point or query, we choose the sample with the lowest probability of having a high income.
[4]:
idx_lowest = np.argmin(scorecard.predict_proba(df)[:, 1])
query = X.iloc[idx_lowest, :].to_dict()
[5]:
query
[5]:
{'age': 17,
'workclass': ' Private',
'education': ' 11th',
'marital-status': ' Never-married',
'occupation': ' Other-service',
'race': ' Black',
'sex': ' Female',
'hours-per-week': 25}
A counterfactual class can simply be instantiated with a scorecard model. The first step is to fit the counterfactual with the data used to develop the scorecard. This allows computing problem data required for the integer programming formulation. This method must be called once before generating counterfactual explanations.
[6]:
cf = Counterfactual(scorecard=scorecard)
[7]:
cf.fit(X)
[7]:
Counterfactual(scorecard=Scorecard(binning_process=BinningProcess(variable_names=['age',
'workclass',
'education',
'marital-status',
'occupation',
'race',
'sex',
'hours-per-week']),
estimator=LogisticRegression(class_weight='balanced'),
scaling_method='min_max',
scaling_method_params={'max': 850,
'min': 300}))
The minimum parameters required to generate counterfactuals are a query
, the desired target outcome y
, the outcome type, and the number of counterfactuals n_cf
. Other parameters such as the maximum number of changes max_changes
can be specified.
[8]:
cf.generate(query=query, y=1, outcome_type="binary", n_cf=1, max_changes=3)
[8]:
Counterfactual(scorecard=Scorecard(binning_process=BinningProcess(variable_names=['age',
'workclass',
'education',
'marital-status',
'occupation',
'race',
'sex',
'hours-per-week']),
estimator=LogisticRegression(class_weight='balanced'),
scaling_method='min_max',
scaling_method_params={'max': 850,
'min': 300}))
The property status
retrieves the status of the optimization solver. Additional information is retrieved by using the information
method, showing three sections:
Solver statistics: number of variables and constraints, and objective value.
Objectives: value of the objectives functions, by default proximity and closeness.
Timing: time statistics.
[9]:
cf.status
[9]:
'OPTIMAL'
[10]:
cf.information(print_level=2)
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Begin options
scorecard yes * U
special_missing False * d
n_jobs 1 * d
verbose False * d
End options
Status : OPTIMAL
Solver statistics
Type mip
Number of variables 42
Number of constraints 95
Objective value 16.1490
Best objective bound 16.1490
Objectives
proximity 3.7558
closeness 12.3932
Timing
Total time 0.22 sec
Fit 0.13 sec ( 60.81%)
Solver 0.08 sec ( 36.29%)
Post-processing 0.01 sec ( 7.99%)
The resulting optimal counterfactual explanation can be shown using display
method. By default, all feature values are displayed. We might prefer to only observe the changes from the query, then set parameter show_only_changes
to True. In addition, to add the counterfactual outcome, set parameter show_outcome
to True.
[11]:
cf.display()
[11]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | |
---|---|---|---|---|---|---|---|---|
0 | [33.50, 35.50) | Private | [ Masters, Prof-school, Doctorate] | [ Married-AF-spouse, Married-civ-spouse] | Other-service | Black | Female | 25 |
[12]:
cf.display(show_only_changes=True)
[12]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | |
---|---|---|---|---|---|---|---|---|
0 | [33.50, 35.50) | - | [ Masters, Prof-school, Doctorate] | [ Married-AF-spouse, Married-civ-spouse] | - | - | - | - |
[13]:
cf.display(show_only_changes=True, show_outcome=True)
[13]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | [33.50, 35.50) | - | [ Masters, Prof-school, Doctorate] | [ Married-AF-spouse, Married-civ-spouse] | - | - | - | - | 0.505995 |
Actionable features¶
Note that, in general, a scorecard includes features that are not mutable or that should not be actionable. For example, we might exclude marital status by choosing only those actionable features using the parameter actionable_features
.
[14]:
cf.generate(query=query, y=1, outcome_type="binary", n_cf=1, max_changes=4,
actionable_features=["age", "workclass", "education", "occupation",
"hours-per-week"]
).display(show_only_changes=True, show_outcome=True)
[14]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | [43.50, 49.50) | - | [ Masters, Prof-school, Doctorate] | - | [ Tech-support, Protective-serv, Prof-specia... | - | - | [39.50, 41.50) | 0.503416 |
Weighted vs hierarchical method¶
Currently, there are two methods (weighted and hierarchical) supported to handle multiple objectives functions. By default, method
is set to “weighted”, due to having an inferior computational cost.
[15]:
cf.generate(query=query, y=1, outcome_type="binary", n_cf=1,
method="weighted", max_changes=4).display()
[15]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | |
---|---|---|---|---|---|---|---|---|
0 | [37.50, 40.50) | Private | [ Bachelors] | [ Married-AF-spouse, Married-civ-spouse] | [ ?, Armed-Forces, Farming-fishing] | Black | Female | 25 |
[16]:
cf.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Status : OPTIMAL
Solver statistics
Type mip
Number of variables 42
Number of constraints 95
Objective value 14.0037
Best objective bound 14.0037
Objectives
proximity 4.1646
closeness 9.8391
Timing
Total time 0.38 sec
Fit 0.13 sec ( 35.64%)
Solver 0.23 sec ( 61.28%)
Post-processing 0.01 sec ( 5.02%)
When generating counterfactuals the objectives are passed as a dictionary. If method="weighted"
, the dictionary values represent the assigned weight to each objective, whereas if method="hierarchical"
, the dictionary values represent the priority of each objective. In the following example, we assign a higher weight to closeness, reducing the objective value compared to the previous one, where both weights are set to 1.
[17]:
cf.generate(query=query, y=1, outcome_type="binary", n_cf=1,
method="weighted", objectives={"proximity": 0.1, "closeness": 0.9},
max_changes=4
).display()
[17]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | |
---|---|---|---|---|---|---|---|---|
0 | [43.50, 49.50) | Private | [ Masters, Prof-school, Doctorate] | Never-married | [ Tech-support, Protective-serv, Prof-specia... | Black | Female | [39.50, 41.50) |
[18]:
cf.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Status : OPTIMAL
Solver statistics
Type mip
Number of variables 42
Number of constraints 95
Objective value 9.0395
Best objective bound 9.0395
Objectives
proximity 5.5872
closeness 9.4231
Timing
Total time 0.36 sec
Fit 0.13 sec ( 36.87%)
Solver 0.22 sec ( 61.00%)
Post-processing 0.01 sec ( 3.50%)
Considering the hierarchical method, the highest priority is assigned to proximity, therefore, we get the minimum objective value with a small degradation ("priority_tol"=0.1
).
[19]:
cf.generate(query=query, y=1, outcome_type="binary", n_cf=1,
method="hierarchical", max_changes=4
).display()
[19]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | |
---|---|---|---|---|---|---|---|---|
0 | [35.50, 37.50) | Private | [ Masters, Prof-school, Doctorate] | [ Married-AF-spouse, Married-civ-spouse] | Other-service | Black | Female | [25.50, 34.50) |
[20]:
cf.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Status : OPTIMAL
Solver statistics
Type mip
Number of variables 44
Number of constraints 97
Objective value 12.0849
Best objective bound 12.0849
Objectives
proximity 3.9645
closeness 12.0849
Timing
Total time 0.23 sec
Fit 0.13 sec ( 58.51%)
Solver 0.09 sec ( 38.33%)
Post-processing 0.01 sec ( 8.23%)
On the contrary, if we switch priorities and reduce the priority degradation, we obtain the same closeness objective value as the one generated with the weight "closeness": 0.9
.
[21]:
cf.generate(query=query, y=1, outcome_type="binary", n_cf=1,
method="hierarchical", objectives={"proximity": 1, "closeness": 2},
priority_tol=0.001,
max_changes=4
).display()
[21]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | |
---|---|---|---|---|---|---|---|---|
0 | [43.50, 49.50) | Private | [ Masters, Prof-school, Doctorate] | Never-married | [ Tech-support, Protective-serv, Prof-specia... | Black | Female | [39.50, 41.50) |
[22]:
cf.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Status : OPTIMAL
Solver statistics
Type mip
Number of variables 44
Number of constraints 97
Objective value 5.5872
Best objective bound 5.5872
Objectives
closeness 9.4231
proximity 5.5872
Timing
Total time 0.74 sec
Fit 0.13 sec ( 18.17%)
Solver 0.60 sec ( 80.79%)
Post-processing 0.01 sec ( 1.29%)
The hierarchical method becomes more interesting when several objectives come into play, as shown going forward.
Multiple counterfactuals¶
Now, let’s focus on the generation of multiple counterfactual explanations for a single input data point. The present implementation is able to generate n_cf
counterfactuals simultaneously while imposing diversity constraints.
For example, we generate three counterfactuals such that a maximum of four features can be changed and enforcing (hard constraint) a diversity of features. The diversity of features to change guarantees different combinations of changed features for each counterfactual.
[23]:
cf.generate(query=query, y=1, outcome_type="binary", n_cf=3, max_changes=4,
hard_constraints=["diversity_features"], time_limit=10
).display(show_only_changes=True, show_outcome=True)
[23]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | [37.50, 40.50) | - | [ Bachelors] | [ Married-AF-spouse, Married-civ-spouse] | [ ?, Armed-Forces, Farming-fishing] | - | - | - | 0.508597 |
0 | [40.50, 43.50) | - | [ Bachelors] | [ Married-AF-spouse, Married-civ-spouse] | - | - | - | [34.50, 39.50) | 0.537689 |
0 | [43.50, 49.50) | - | [ Masters, Prof-school, Doctorate] | - | [ Tech-support, Protective-serv, Prof-specia... | - | - | [39.50, 41.50) | 0.503416 |
The generated counterfactuals change the following features:
age | education | occupation | hours-per-week
age | education | marital-status | hours-per-week
age | education | marital-status | occupation
[24]:
cf.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Status : FEASIBLE
Solver statistics
Type mip
Number of variables 1041
Number of constraints 450
Objective value 43.7051
Best objective bound 38.2977
Objectives
proximity 14.0688
closeness 29.6363
Timing
Total time 10.34 sec
Fit 0.13 sec ( 1.29%)
Solver 10.19 sec ( 98.54%)
Post-processing 0.02 sec ( 0.17%)
The reported status is only feasible, not optimal. When generating multiple counterfactual adding several constraints, the time required to prove optimality might be in the order of 10 - 120s, so one needs to increase time_limit
. However, good feasible solutions are generally achieved within a few seconds. Adding extra 10 seconds to time_limit
, the proximity and closeness objective values are reduced by 0.4% and 0.8%, respectively.
[25]:
cf.generate(query=query, y=1, outcome_type="binary", n_cf=3, max_changes=4,
hard_constraints=["diversity_features"], time_limit=20
).display(show_only_changes=True, show_outcome=True)
[25]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | [37.50, 40.50) | - | [ Bachelors] | [ Married-AF-spouse, Married-civ-spouse] | [ ?, Armed-Forces, Farming-fishing] | - | - | - | 0.508597 |
0 | [40.50, 43.50) | - | [ Bachelors] | [ Married-AF-spouse, Married-civ-spouse] | - | - | - | [34.50, 39.50) | 0.537689 |
0 | [43.50, 49.50) | - | [ Masters, Prof-school, Doctorate] | - | [ Tech-support, Protective-serv, Prof-specia... | - | - | [39.50, 41.50) | 0.503416 |
[26]:
cf.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Status : OPTIMAL
Solver statistics
Type mip
Number of variables 1041
Number of constraints 450
Objective value 43.7051
Best objective bound 43.7051
Objectives
proximity 14.0688
closeness 29.6363
Timing
Total time 17.37 sec
Fit 0.13 sec ( 0.77%)
Solver 17.22 sec ( 99.08%)
Post-processing 0.03 sec ( 0.15%)
The previous conterfactuals, although showing diversity regarding features to change, some feature values appear several times. To guarantee unique combinations of feature changes and feature values, we add the hard constraint “diversity_values”.
[27]:
cf.generate(query=query, y=1, outcome_type="binary", n_cf=3, max_changes=4,
hard_constraints=["diversity_features", "diversity_values"], time_limit=15
).display(show_only_changes=True, show_outcome=True)
[27]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | [43.50, 49.50) | - | [ Bachelors] | - | [ Exec-managerial] | - | - | [49.50, 55.50) | 0.518267 |
0 | [49.50, 54.50) | - | [ Masters, Prof-school, Doctorate] | [ Married-spouse-absent, Widowed, Divorced] | [ Tech-support, Protective-serv, Prof-specia... | - | - | - | 0.510208 |
0 | [40.50, 43.50) | - | [ Assoc-acdm, Assoc-voc] | [ Married-AF-spouse, Married-civ-spouse] | - | - | - | [41.50, 49.50) | 0.529430 |
[28]:
cf.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Status : FEASIBLE
Solver statistics
Type mip
Number of variables 1065
Number of constraints 450
Objective value 47.7067
Best objective bound 39.9925
Objectives
proximity 16.0514
closeness 31.6553
Timing
Total time 15.44 sec
Fit 0.13 sec ( 0.87%)
Solver 15.28 sec ( 98.98%)
Post-processing 0.02 sec ( 0.15%)
Adding this extra constraint, the objective values slightly increase.
Generating counterfactual explanations - probability outcome¶
In some situations, having an outcome = 1 with a probability outcome > 0.5 might not be sufficient. For example, an algorithm might require a probability of 0.7 for acceptance.
[29]:
df_query = pd.DataFrame([query],columns=query.keys())
[30]:
scorecard.predict_proba(df_query)
[30]:
array([[9.99831890e-01, 1.68109773e-04]])
Initially, the probability of the query is close to 0. When outcome_type="probability"
, at least an additional constraint must be provided. The supported constraints are:
soft constraints: “diff_outcome”.
hard constraints: “min_outcome” and “max_outcome”.
In this first try, we aim to obtain a counterfactual with an outcome probability as close as possible to 0.7. The weighted approach is used, assigning unit weights to all objectives.
[31]:
cf.generate(query=df_query, y=0.7, outcome_type="probability", n_cf=2, max_changes=4,
hard_constraints=["diversity_features"],
soft_constraints={"diff_outcome": 1},
).display(show_only_changes=True, show_outcome=True)
[31]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | [23.50, 25.50) | - | [ HS-grad] | - | [ ?, Armed-Forces, Farming-fishing] | - | - | - | 0.007168 |
0 | [23.50, 25.50) | [ Private] | [ HS-grad] | - | [ ?, Armed-Forces, Farming-fishing] | - | - | - | 0.007168 |
A weight=1 is not large enough given the rest of the objective values. A workaround might be to increase the corresponding weight.
[32]:
cf.generate(query=df_query, y=0.7, outcome_type="probability", n_cf=2, max_changes=4,
hard_constraints=["diversity_features"],
soft_constraints={"diff_outcome": 100}
).display(show_only_changes=True, show_outcome=True)
[32]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | [40.50, 43.50) | - | [ Masters, Prof-school, Doctorate] | [ Married-AF-spouse, Married-civ-spouse] | - | - | - | [34.50, 39.50) | 0.697510 |
0 | [43.50, 49.50) | - | [ Masters, Prof-school, Doctorate] | [ Married-AF-spouse, Married-civ-spouse] | [ ?, Armed-Forces, Farming-fishing] | - | - | - | 0.702461 |
[33]:
cf.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Status : OPTIMAL
Solver statistics
Type mip
Number of variables 521
Number of constraints 311
Objective value 32.4569
Best objective bound 32.4569
Objectives
proximity 9.1380
closeness 22.6536
diff_outcome 0.0067
Timing
Total time 2.35 sec
Fit 0.13 sec ( 5.70%)
Solver 2.20 sec ( 93.57%)
Post-processing 0.02 sec ( 0.78%)
We might, however, want to guarantee a minimum probability outcome of 0.7. To do so, we add the hard constraint “min_outcome”.
[34]:
cf.generate(query=df_query, y=0.7, outcome_type="probability", n_cf=2, max_changes=4,
hard_constraints=["diversity_features", "min_outcome"]
).display(show_only_changes=True, show_outcome=True)
[34]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | [49.50, 54.50) | - | [ Masters, Prof-school, Doctorate] | [ Married-AF-spouse, Married-civ-spouse] | - | - | - | [34.50, 39.50) | 0.733405 |
0 | [49.50, 54.50) | - | [ Masters, Prof-school, Doctorate] | [ Married-AF-spouse, Married-civ-spouse] | [ ?, Armed-Forces, Farming-fishing] | - | - | - | 0.715094 |
[35]:
cf.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Status : OPTIMAL
Solver statistics
Type mip
Number of variables 521
Number of constraints 311
Objective value 31.7672
Best objective bound 31.7672
Objectives
proximity 9.2342
closeness 22.5330
Timing
Total time 1.87 sec
Fit 0.13 sec ( 7.17%)
Solver 1.72 sec ( 92.16%)
Post-processing 0.01 sec ( 0.72%)
Moreover, it is possible to combine the previous constraints with diversity constraints.
[36]:
cf.generate(query=df_query, y=0.7, outcome_type="probability", n_cf=2,
max_changes=4, method="hierarchical",
objectives={"proximity": 2, "closeness": 1},
hard_constraints=["min_outcome"],
soft_constraints={"diversity_features": 2, "diversity_values": 1}
).display(show_only_changes=True, show_outcome=True)
[36]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | [49.50, 54.50) | - | [ Bachelors] | [ Married-AF-spouse, Married-civ-spouse] | - | - | - | [55.50, inf) | 0.735013 |
0 | [43.50, 49.50) | - | [ Masters, Prof-school, Doctorate] | [ Married-AF-spouse, Married-civ-spouse] | [ Machine-op-inspct] | - | - | - | 0.719537 |
[37]:
cf.information()
optbinning (Version 0.19.0)
Copyright (c) 2019-2024 Guillermo Navas-Palencia, Apache License 2.0
Status : OPTIMAL
Solver statistics
Type mip
Number of variables 526
Number of constraints 317
Objective value -6.0000
Best objective bound -6.0000
Objectives
proximity 9.9758
diversity_features 2.0000
closeness 23.2287
diversity_values 6.0000
Timing
Total time 5.59 sec
Fit 0.13 sec ( 2.39%)
Solver 5.45 sec ( 97.37%)
Post-processing 0.01 sec ( 0.24%)
Special and missing bins¶
The scorecard models might have bins for special and missing values. The parameter special_missing
permits including these bins as feasible solutions for the generation of counterfactual explanations.
[38]:
cf = Counterfactual(scorecard=scorecard, special_missing=True)
[39]:
cf.fit(X)
[39]:
Counterfactual(scorecard=Scorecard(binning_process=BinningProcess(variable_names=['age',
'workclass',
'education',
'marital-status',
'occupation',
'race',
'sex',
'hours-per-week']),
estimator=LogisticRegression(class_weight='balanced'),
scaling_method='min_max',
scaling_method_params={'max': 850,
'min': 300}),
special_missing=True)
[40]:
cf.generate(query=query, y=1, outcome_type="binary", n_cf=1, max_changes=4
).display(show_only_changes=True)
[40]:
age | workclass | education | marital-status | occupation | race | sex | hours-per-week | |
---|---|---|---|---|---|---|---|---|
0 | [49.50, 54.50) | - | [ Assoc-acdm, Assoc-voc] | [ Married-AF-spouse, Married-civ-spouse] | - | - | - | Special |