Documentation

kuplift.bayesian_decision_tree module

class kuplift.bayesian_decision_tree.BayesianDecisionTree(control_name=None)

Bases: _Tree

The BayesianDecisionTree class implements the UB-DT algorithm described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, May). A Non-Parametric Bayesian Decision Trees for Uplift modelling. In PAKDD.

Parameters:
datapd.Dataframe

Dataframe containing feature variables.

treatment_colpd.Series

Treatment column.

y_colpd.Series

Outcome column.

control_name: int or str

The name of the control value in the treatment column

fit(data, treatment_col, y_col)

Fit an uplift decision tree model using UB-DT

Parameters:
X_trainpd.Dataframe

Dataframe containing feature variables.

treatment_colpd.Series

Treatment column.

y_colpd.Series

Outcome column.

kuplift.bayesian_random_forest module

class kuplift.bayesian_random_forest.BayesianRandomForest(n_trees=10, vars_subset=False, random_state=10)

Bases: object

The BayesianRandomForest class implements the UB-RF algorithm described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, May). A Non-Parametric Bayesian Decision Trees for Uplift modelling. In PAKDD.

Parameters:
datapd.Dataframe

Dataframe containing data.

treatment_colpd.Series

Treatment column.

outcome_colpd.Series

Outcome column.

n_treesint, default 10

Number of trees in a forest.

vars_subsetbool, default False

Use a random subset of the variables for each tree in the forest.

random_stateint, default 10

Seed used by the random number generator.

fit(data, treatment_col, y_col)

Fit a decision tree algorithm.

predict(X_test, weighted_average=False)

Predict the uplift value for each example in X_test.

Parameters:
X_testpd.Dataframe

Dataframe containing test data.

weighted_averagebool, default False

Give a weight for the predictions of each tree according to its cost.

Returns:
y_pred_list(ndarray, shape=(num_samples, 1))

An array containing the predicted uplift for each sample.

kuplift.feature_selection module

class kuplift.feature_selection.FeatureSelection(control_name=None)

Bases: object

The FeatureSelection implements the feature selection algorithm ‘UMODL-FS’ described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, March). A non-parametric bayesian approach for uplift discretization and feature selection. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part V (pp. 239-254). Cham: Springer Nature Switzerland.

filter(data, treatment_col, y_col, parallelized=False, num_processes=5)

This function runs the feature selection algorithm ‘UMODL-FS’, ranking variables based on their importance in the given data.

Parameters:
datapd.Dataframe

Dataframe containing feature variables.

treatment_colpd.Series

Treatment column.

y_colpd.Series

Outcome column.

parallelizedbool, default False

Whether to run the code on several processes.

num_processesint, default 5

Number of processes to use in parallel, ‘parallelized’ argument should be True.

Returns:
Python Dictionary

Variables names and their corresponding importance value (Sorted).

get_features_importance_details()

After launch the feature selection approach, this function helps getting the details of a each feature. How it was discretized, the intervals, the outcome denisities in each interval.

kuplift.univariate_encoding module

class kuplift.univariate_encoding.UnivariateEncoding(control_name=None)

Bases: object

The UnivariateEncoding class implements the UMODL algorithm for uplift data encoding described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, March). A non-parametric bayesian approach for uplift discretization and feature selection. ECML PKDD

fit(data, treatment_col, y_col, parallelized=False, num_processes=5)

fit() learns a discretisation model using the UMODL approach.

Parameters:
datapd.Dataframe

Dataframe containing feature variables.

treatment_colpd.Series

Treatment column.

y_colpd.Series

Outcome column.

parallelizedbool, default False

Whether to run the code on several processes.

num_processesint, default 5

Number of processes to use in parallel.

fit_transform(data, treatment_col, y_col, parallelized=False, num_processes=5)

fit_transform() learns a discretisation model using UMODL and transforms the data.

Parameters:
datapd.Dataframe

Dataframe containing feature variables.

treatment_colpd.Series

Treatment column.

y_colpd.Series

Outcome column.

parallelizedbool, default False

Whether to run the code on several processes.

num_processesint, default 5

Number of processes to use in parallel.

Returns:
pd.Dataframe

Pandas Dataframe that contains encoded data.

get_features_importance_details()
transform(data)

transform() applies the discretisation model learned by the fit() method.

Parameters:
datapd.Dataframe

Dataframe containing feature variables.

Returns:
pd.Dataframe

Pandas Dataframe that contains encoded data.

kuplift.optimized_univariate_encoding module

Optimized Univariate Encoding

This module contains everything needed to make univariate variable transformation optimized through the use of the C++ implementation of ‘umodl’. It calls the ‘umodl’ executable as a subprocess indirectly by the use the ‘umodl’ library.

The main class of this module is ‘OptimizedUnivariateEncoding’.

An example code is in examples/optimized_univariate_encoding.py.

class kuplift.optimized_univariate_encoding.Interval(lower: float | None = None, upper: float | None = None)

Bases: object

property catches_missing
lower: float | None = None
upper: float | None = None
class kuplift.optimized_univariate_encoding.IntervalPartition(intervals: Sequence[Interval])

Bases: Partition

Partition of type ‘intervals’.

Attributes:
intervals: Sequence[Interval]

The intervals. Each interval is a pair defining its lower bound and its upper bound (in that order). The exception to this rule is the empty interval representing ‘MISSING’ values. If present, it must be the first interval of the sequence.

property parts
transform(col)
transform_elem(elem)
class kuplift.optimized_univariate_encoding.OptimizedUnivariateEncoding

Bases: object

The OptimizedUnivariateEncoding class makes use of the external umodl tool hosted at https://github.com/UData-Orange/umodl.

Attributes:
model: dict mapping str to ValGrpPartition or IntervalPartition

The model generated by the ‘umodl’ executable. It describes the partitioning of values of informative variables into groups or intervals. It maps the informative variable names to value partitions.

levels: list of (str, float) pairs

(variable-name, variable-level) pairs in decreasing level order.

variable_cols: DataFrame

The data columns of all variables. This means all the data from the dataset but the treatment and target columns.

treatment_col: Series

The treatment column from the dataset.

target_col: Series

The target column from the dataset.

fit(data, treatment_col, target_col, maxpartnumber=None)

Learn a discretisation model using UMODL.

Parameters:
data: pd.DataFrame

Dataframe containing feature variables. Categorical variables should have the object dtype, otherwise they are processed as numerical variables.

treatment_col: pd.Series

Treatment column.

target_col: pd.Series

Outcome column.

maxpartnumber: int, default=None

The maximal number of intervals or groups. None means default to the ‘umodl’ program default.

fit_transform(data, treatment_col, target_col, maxpartnumber=None)

Learn a discretisation model using UMODL and transform the data.

Parameters:
data: pd.DataFrame

Dataframe containing feature variables. Categorical variables should have the object dtype, otherwise they are processed as numerical variables.

treatment_col: pd.Series

Treatment column.

target_col: pd.Series

Outcome column.

maxpartnumber: int, default=None

The maximal number of intervals or groups. None means default to the ‘umodl’ program default.

Returns:
pd.Dataframe

Pandas Dataframe that contains encoded data.

get_level(variable)

Get the level of a single variable.

Parameters:
variable: str

The variable to get the level from.

Returns:
float

The level of the specified variable.

get_levels()

Get the level of all variables.

Returns:
list[tuple[str, float]]

(variable-name, variable-level) pairs in decreasing level order.

get_partition(variable)

Get the partition corresponding to a single variable of the model.

Parameters:
variable: str

The variable name.

Returns:
ValGrpPartition | IntervalPartition

The partition corresponding to a single variable of the model.

get_target_frequencies(variable)

Get the frequencies for each (target, treatment) pair.

The frequencies are computed for a single variable.

Parameters:
variable: str

The variable name.

Returns:
pd.DataFrame
The frequencies as a Dataframe containing:
  • A column named ‘Part’ listing all the parts of the variable.

  • One column per (target, treatment) pair.

get_target_probabilities(variable)

Get the probabilities P(target|treatment) for each (target, treatment) pair.

The probabilities are computed for a single variable.

Parameters:
variable: str

The variable name.

Returns:
pd.DataFrame
The probabilities as a Dataframe containing:
  • A column named ‘Part’ listing all the parts of the variable.

  • One column per (target, treatment) pair.

get_uplift(reftarget, reftreatment, variable)

Get the uplift for a single variable.

See explanations of the computations in the ‘Returns’ section below.

Parameters:
reftarget

The reference target.

reftreatment

The reference treatment to which all the other treatments are compared.

variable: str

The name of the variable.

Returns:
pd.DataFrame
A Dataframe containing:
  • A column named ‘Part’ listing all the parts of the variable.

  • One column per treatment other than the reference treatment. A column gives the difference P(reftarget|treatment) - P(reftarget|reftreatment), that is, the benefit (or deficit) of probabilities to have ‘reftarget’ as the outcome with the column’s treatment compared to the reference treatment.

property informative_input_variables

list of str

The names of the informative variables.

property input_variables

list of str

The names of the variables.

levels: list[tuple[str, float]]
model: dict[str, ValGrpPartition | IntervalPartition]
property noninformative_input_variables

list of str

The names of the non-informative variables.

property target_modalities

list

All the different targets from the dataset.

property target_name

str

The name of the target column.

property target_treatment_pairs

list of TargetTreatmentPair

All (target, treatment) pairs.

transform(data)

Apply the discretisation model learned by the fit() method.

Parameters:
data: pd.DataFrame

Dataframe containing feature variables.

Returns:
pd.DataFrame

Pandas Dataframe that contains encoded data.

property treatment_modalities

list

All the different treatments from the dataset.

property treatment_name

str

The name of the treatment column.

class kuplift.optimized_univariate_encoding.Partition

Bases: ABC

abstract property parts
class kuplift.optimized_univariate_encoding.TargetTreatmentPair(target: object, treatment: object)

Bases: object

Target-treatment pair.

Used to identify both a target and a treatment. This class only exists for the purpose of formatting.

target: object
treatment: object
class kuplift.optimized_univariate_encoding.ValGrp(lst)

Bases: object

class kuplift.optimized_univariate_encoding.ValGrpPartition(groups: Sequence[ValGrp], defaultgroupindex: int)

Bases: Partition

Partition of type ‘value groups’.

Attributes:
groups: Sequence[ValGrp]

The groups. Each group is an iterable of its values.

defaultgroupindex: int

The group index affected to transformed elements when they do not explicitly appear in any group.

property parts
transform(col)
transform_elem(elem)