Documentation¶

kuplift.bayesian_decision_tree module¶

class kuplift.bayesian_decision_tree.BayesianDecisionTree(control_name=None)¶

Bases: _Tree

The BayesianDecisionTree class implements the UB-DT algorithm described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, May). A Non-Parametric Bayesian Decision Trees for Uplift modelling. In PAKDD.

Parameters:

datapd.Dataframe: Dataframe containing feature variables.
treatment_colpd.Series: Treatment column.
y_colpd.Series: Outcome column.
control_name: int or str: The name of the control value in the treatment column

fit(data, treatment_col, y_col)¶

Fit an uplift decision tree model using UB-DT

Parameters:

X_trainpd.Dataframe: Dataframe containing feature variables.
treatment_colpd.Series: Treatment column.
y_colpd.Series: Outcome column.

kuplift.bayesian_random_forest module¶

class kuplift.bayesian_random_forest.BayesianRandomForest(n_trees=10, vars_subset=False, random_state=10)¶

Bases: object

The BayesianRandomForest class implements the UB-RF algorithm described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, May). A Non-Parametric Bayesian Decision Trees for Uplift modelling. In PAKDD.

Parameters:

datapd.Dataframe: Dataframe containing data.
treatment_colpd.Series: Treatment column.
outcome_colpd.Series: Outcome column.
n_treesint, default 10: Number of trees in a forest.
vars_subsetbool, default False: Use a random subset of the variables for each tree in the forest.
random_stateint, default 10: Seed used by the random number generator.

fit(data, treatment_col, y_col)¶: Fit a decision tree algorithm.

predict(X_test, weighted_average=False)¶

Predict the uplift value for each example in X_test.

Parameters:

X_testpd.Dataframe: Dataframe containing test data.
weighted_averagebool, default False: Give a weight for the predictions of each tree according to its cost.

Returns:

y_pred_list(ndarray, shape=(num_samples, 1)): An array containing the predicted uplift for each sample.

kuplift.feature_selection module¶

class kuplift.feature_selection.FeatureSelection(control_name=None)¶

Bases: object

The FeatureSelection implements the feature selection algorithm ‘UMODL-FS’ described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, March). A non-parametric bayesian approach for uplift discretization and feature selection. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part V (pp. 239-254). Cham: Springer Nature Switzerland.

filter(data, treatment_col, y_col, parallelized=False, num_processes=5)¶

This function runs the feature selection algorithm ‘UMODL-FS’, ranking variables based on their importance in the given data.

Parameters:

datapd.Dataframe: Dataframe containing feature variables.
treatment_colpd.Series: Treatment column.
y_colpd.Series: Outcome column.
parallelizedbool, default False: Whether to run the code on several processes.
num_processesint, default 5: Number of processes to use in parallel, ‘parallelized’ argument should be True.

Returns:

Python Dictionary: Variables names and their corresponding importance value (Sorted).

get_features_importance_details()¶: After launch the feature selection approach, this function helps getting the details of a each feature. How it was discretized, the intervals, the outcome denisities in each interval.

kuplift.univariate_encoding module¶

class kuplift.univariate_encoding.UnivariateEncoding(control_name=None)¶

Bases: object

The UnivariateEncoding class implements the UMODL algorithm for uplift data encoding described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, March). A non-parametric bayesian approach for uplift discretization and feature selection. ECML PKDD

fit(data, treatment_col, y_col, parallelized=False, num_processes=5)¶

fit() learns a discretisation model using the UMODL approach.

Parameters:

datapd.Dataframe: Dataframe containing feature variables.
treatment_colpd.Series: Treatment column.
y_colpd.Series: Outcome column.
parallelizedbool, default False: Whether to run the code on several processes.
num_processesint, default 5: Number of processes to use in parallel.

fit_transform(data, treatment_col, y_col, parallelized=False, num_processes=5)¶

fit_transform() learns a discretisation model using UMODL and transforms the data.

Parameters:

datapd.Dataframe: Dataframe containing feature variables.
treatment_colpd.Series: Treatment column.
y_colpd.Series: Outcome column.
parallelizedbool, default False: Whether to run the code on several processes.
num_processesint, default 5: Number of processes to use in parallel.

Returns:

pd.Dataframe: Pandas Dataframe that contains encoded data.

get_features_importance_details()¶

transform(data)¶

transform() applies the discretisation model learned by the fit() method.

Parameters:

datapd.Dataframe: Dataframe containing feature variables.

Returns:

pd.Dataframe: Pandas Dataframe that contains encoded data.

kuplift.optimized_univariate_encoding module¶

Optimized Univariate Encoding

This module contains everything needed to make univariate variable transformation optimized through the use of the C++ implementation of ‘umodl’. It calls the ‘umodl’ executable as a subprocess indirectly by the use the ‘umodl’ library.

The main class of this module is ‘OptimizedUnivariateEncoding’.

An example code is in examples/optimized_univariate_encoding.py.

class kuplift.optimized_univariate_encoding.Interval(lower: float | None = None, upper: float | None = None)¶

Bases: object

property catches_missing¶

lower: float | None = None¶

upper: float | None = None¶

class kuplift.optimized_univariate_encoding.IntervalPartition(intervals: Sequence[Interval])¶

Bases: Partition

Partition of type ‘intervals’.

Attributes:

intervals: Sequence[Interval]: The intervals. Each interval is a pair defining its lower bound and its upper bound (in that order). The exception to this rule is the empty interval representing ‘MISSING’ values. If present, it must be the first interval of the sequence.

property parts¶

transform(col)¶

transform_elem(elem)¶

class kuplift.optimized_univariate_encoding.OptimizedUnivariateEncoding¶

Bases: object

The OptimizedUnivariateEncoding class makes use of the external umodl tool hosted at https://github.com/UData-Orange/umodl.

Attributes:

model: dict mapping str to ValGrpPartition or IntervalPartition: The model generated by the ‘umodl’ executable. It describes the partitioning of values of informative variables into groups or intervals. It maps the informative variable names to value partitions.
levels: list of (str, float) pairs: (variable-name, variable-level) pairs in decreasing level order.
variable_cols: DataFrame: The data columns of all variables. This means all the data from the dataset but the treatment and target columns.
treatment_col: Series: The treatment column from the dataset.
target_col: Series: The target column from the dataset.

fit(data, treatment_col, target_col, maxpartnumber=None)¶

Learn a discretisation model using UMODL.

Parameters:

data: pd.DataFrame: Dataframe containing feature variables. Categorical variables should have the object dtype, otherwise they are processed as numerical variables.
treatment_col: pd.Series: Treatment column.
target_col: pd.Series: Outcome column.
maxpartnumber: int, default=None: The maximal number of intervals or groups. None means default to the ‘umodl’ program default.

fit_transform(data, treatment_col, target_col, maxpartnumber=None)¶

Learn a discretisation model using UMODL and transform the data.

Parameters:

data: pd.DataFrame: Dataframe containing feature variables. Categorical variables should have the object dtype, otherwise they are processed as numerical variables.
treatment_col: pd.Series: Treatment column.
target_col: pd.Series: Outcome column.
maxpartnumber: int, default=None: The maximal number of intervals or groups. None means default to the ‘umodl’ program default.

Returns:

pd.Dataframe: Pandas Dataframe that contains encoded data.

get_level(variable)¶

Get the level of a single variable.

Parameters:

variable: str: The variable to get the level from.

Returns:

float: The level of the specified variable.

get_levels()¶

Get the level of all variables.

Returns:

list[tuple[str, float]]: (variable-name, variable-level) pairs in decreasing level order.

get_partition(variable)¶

Get the partition corresponding to a single variable of the model.

Parameters:

variable: str: The variable name.

Returns:

ValGrpPartition | IntervalPartition: The partition corresponding to a single variable of the model.

get_target_frequencies(variable)¶

Get the frequencies for each (target, treatment) pair.

The frequencies are computed for a single variable.

Parameters:

variable: str: The variable name.

Returns:

pd.DataFrame

The frequencies as a Dataframe containing:

A column named ‘Part’ listing all the parts of the variable.
One column per (target, treatment) pair.

get_target_probabilities(variable)¶

Get the probabilities P(target|treatment) for each (target, treatment) pair.

The probabilities are computed for a single variable.

Parameters:

variable: str: The variable name.

Returns:

pd.DataFrame

The probabilities as a Dataframe containing:

A column named ‘Part’ listing all the parts of the variable.
One column per (target, treatment) pair.

get_uplift(reftarget, reftreatment, variable)¶

Get the uplift for a single variable.

See explanations of the computations in the ‘Returns’ section below.

Parameters:

reftarget: The reference target.
reftreatment: The reference treatment to which all the other treatments are compared.
variable: str: The name of the variable.

Returns:

pd.DataFrame

A Dataframe containing:

A column named ‘Part’ listing all the parts of the variable.
One column per treatment other than the reference treatment. A column gives the difference P(reftarget|treatment) - P(reftarget|reftreatment), that is, the benefit (or deficit) of probabilities to have ‘reftarget’ as the outcome with the column’s treatment compared to the reference treatment.

property informative_input_variables¶

list of str

The names of the informative variables.

property input_variables¶

list of str

The names of the variables.

levels: list[tuple[str, float]]¶

model: dict[str, ValGrpPartition | IntervalPartition]¶

property noninformative_input_variables¶

list of str

The names of the non-informative variables.

property target_modalities¶

list

All the different targets from the dataset.

property target_name¶

str

The name of the target column.

property target_treatment_pairs¶

list of TargetTreatmentPair

All (target, treatment) pairs.

transform(data)¶

Apply the discretisation model learned by the fit() method.

Parameters:

data: pd.DataFrame: Dataframe containing feature variables.

Returns:

pd.DataFrame: Pandas Dataframe that contains encoded data.

property treatment_modalities¶

list

All the different treatments from the dataset.

property treatment_name¶

str

The name of the treatment column.

class kuplift.optimized_univariate_encoding.Partition¶

Bases: ABC

abstract property parts¶

class kuplift.optimized_univariate_encoding.TargetTreatmentPair(target: object, treatment: object)¶

Bases: object

Target-treatment pair.

Used to identify both a target and a treatment. This class only exists for the purpose of formatting.

target: object¶

treatment: object¶

class kuplift.optimized_univariate_encoding.ValGrp(lst)¶: Bases: object

class kuplift.optimized_univariate_encoding.ValGrpPartition(groups: Sequence[ValGrp], defaultgroupindex: int)¶

Bases: Partition

Partition of type ‘value groups’.

Attributes:

groups: Sequence[ValGrp]: The groups. Each group is an iterable of its values.
defaultgroupindex: int: The group index affected to transformed elements when they do not explicitly appear in any group.

property parts¶

transform(col)¶

transform_elem(elem)¶