Documentation

kuplift.bayesian_decision_tree module

class kuplift.bayesian_decision_tree.BayesianDecisionTree(control_name=None)

Bases: _Tree

The BayesianDecisionTree class implements the UB-DT algorithm described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, May). A Non-Parametric Bayesian Decision Trees for Uplift modelling. In PAKDD.

Parameters:
datapd.Dataframe

Dataframe containing feature variables.

treatment_colpd.Series

Treatment column.

y_colpd.Series

Outcome column.

control_name: int or str

The name of the control value in the treatment column

fit(data, treatment_col, y_col)

Fit an uplift decision tree model using UB-DT

Parameters:
X_trainpd.Dataframe

Dataframe containing feature variables.

treatment_colpd.Series

Treatment column.

y_colpd.Series

Outcome column.

kuplift.bayesian_random_forest module

class kuplift.bayesian_random_forest.BayesianRandomForest(n_trees=10, vars_subset=False, random_state=10)

Bases: object

The BayesianRandomForest class implements the UB-RF algorithm described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, May). A Non-Parametric Bayesian Decision Trees for Uplift modelling. In PAKDD.

Parameters:
datapd.Dataframe

Dataframe containing data.

treatment_colpd.Series

Treatment column.

outcome_colpd.Series

Outcome column.

n_treesint, default 10

Number of trees in a forest.

vars_subsetbool, default False

Use a random subset of the variables for each tree in the forest.

random_stateint, default 10

Seed used by the random number generator.

fit(data, treatment_col, y_col)

Fit a decision tree algorithm.

predict(X_test, weighted_average=False)

Predict the uplift value for each example in X_test.

Parameters:
X_testpd.Dataframe

Dataframe containing test data.

weighted_averagebool, default False

Give a weight for the predictions of each tree according to its cost.

Returns:
y_pred_list(ndarray, shape=(num_samples, 1))

An array containing the predicted uplift for each sample.

kuplift.feature_selection module

class kuplift.feature_selection.FeatureSelection(control_name=None)

Bases: object

The FeatureSelection implements the feature selection algorithm ‘UMODL-FS’ described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, March). A non-parametric bayesian approach for uplift discretization and feature selection. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part V (pp. 239-254). Cham: Springer Nature Switzerland.

filter(data, treatment_col, y_col, parallelized=False, num_processes=5)

This function runs the feature selection algorithm ‘UMODL-FS’, ranking variables based on their importance in the given data.

Parameters:
datapd.Dataframe

Dataframe containing feature variables.

treatment_colpd.Series

Treatment column.

y_colpd.Series

Outcome column.

parallelizedbool, default False

Whether to run the code on several processes.

num_processesint, default 5

Number of processes to use in parallel, ‘parallelized’ argument should be True.

Returns:
Python Dictionary

Variables names and their corresponding importance value (Sorted).

get_features_importance_details()

After launch the feature selection approach, this function helps getting the details of a each feature. How it was discretized, the intervals, the outcome denisities in each interval.

kuplift.univariate_encoding module

class kuplift.univariate_encoding.UnivariateEncoding(control_name=None)

Bases: object

The UnivariateEncoding class implements the UMODL algorithm for uplift data encoding described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, March). A non-parametric bayesian approach for uplift discretization and feature selection. ECML PKDD

fit(data, treatment_col, y_col, parallelized=False, num_processes=5)

fit() learns a discretisation model using the UMODL approach.

Parameters:
datapd.Dataframe

Dataframe containing feature variables.

treatment_colpd.Series

Treatment column.

y_colpd.Series

Outcome column.

parallelizedbool, default False

Whether to run the code on several processes.

num_processesint, default 5

Number of processes to use in parallel.

fit_transform(data, treatment_col, y_col, parallelized=False, num_processes=5)

fit_transform() learns a discretisation model using UMODL and transforms the data.

Parameters:
datapd.Dataframe

Dataframe containing feature variables.

treatment_colpd.Series

Treatment column.

y_colpd.Series

Outcome column.

parallelizedbool, default False

Whether to run the code on several processes.

num_processesint, default 5

Number of processes to use in parallel.

Returns:
pd.Dataframe

Pandas Dataframe that contains encoded data.

get_features_importance_details()
transform(data)

transform() applies the discretisation model learned by the fit() method.

Parameters:
datapd.Dataframe

Dataframe containing feature variables.

Returns:
pd.Dataframe

Pandas Dataframe that contains encoded data.

kuplift.optimized_univariate_encoding module

Optimized Univariate Encoding

This module contains everything needed to make univariate variable transformation optimized through the use of the C++ implementation of ‘umodl’. It calls the ‘umodl’ executable as a subprocess indirectly by the use the ‘umodl’ library.

The main class of this module is ‘OptimizedUnivariateEncoding’.

An example code is in examples/optimized_univariate_encoding.py.

class kuplift.optimized_univariate_encoding.OptimizedUnivariateEncoding

Bases: object

The OptimizedUnivariateEncoding class makes use of the external umodl tool hosted at https://github.com/UData-Orange/umodl.

Attributes:
model: dict mapping str to Part

The model generated by the ‘umodl’ executable. It describes the partitioning of values of informative variables into groups or intervals. It maps the informative variable names to value partitions.

levels: list of (str, float) pairs

(variable-name, variable-level) pairs in decreasing level order.

variable_cols: DataFrame

The data columns of all variables. This means all the data from the dataset but the treatment and target columns.

treatment_col: Series

The treatment column from the dataset.

target_col: Series

The target column from the dataset.

treatment_groups: dict mapping str to dict mapping Part to int

The keys are the variable names. The values are themselves dictionaries, which keys are groups or intervals and which values are numbers.

fit(data, treatment_col, target_col, maxparts=None)

Learn a discretisation model using UMODL.

Parameters:
data: pd.DataFrame

Dataframe containing feature variables. Categorical variables should have the object dtype, otherwise they are processed as numerical variables.

treatment_col: pd.Series

Treatment column.

target_col: pd.Series

Outcome column.

maxparts: int, default=None

The maximal number of intervals or groups. None means default to the ‘umodl’ program default.

fit_transform(data, treatment_col, target_col, maxparts=None)

Learn a discretisation model using UMODL and transform the data.

Parameters:
data: pd.DataFrame

Dataframe containing feature variables. Categorical variables should have the object dtype, otherwise they are processed as numerical variables.

treatment_col: pd.Series

Treatment column.

target_col: pd.Series

Outcome column.

maxparts: int, default=None

The maximal number of intervals or groups. None means default to the ‘umodl’ program default.

Returns:
pd.Dataframe

Pandas Dataframe that contains encoded data.

get_level(variable)

Get the level of a single variable.

Parameters:
variable: str

The variable to get the level from.

Returns:
float

The level of the specified variable.

get_levels()

Get the level of all variables.

Returns:
list[tuple[str, float]]

(variable-name, variable-level) pairs in decreasing level order.

get_partition(variable)

Get the partition corresponding to a single variable of the model.

Parameters:
variable: str

The variable name.

Returns:
Partition

The partition corresponding to a single variable of the model.

get_partitions()

Get the partitions of all informative input variables in the model.

Returns:
dict[str, Partition]

A dictionary mapping the informative input variable names to the partitions.

get_target_frequencies(variable)

Get the frequencies for each (target, treatment) pair.

The frequencies are computed for a single variable.

Parameters:
variable: str

The variable name.

Returns:
pd.DataFrame
The frequencies as a Dataframe containing:
  • A column named ‘Part’ listing all the parts of the variable.

  • One column per (target, treatment) pair.

get_target_probabilities(variable)

Get the probabilities P(target|treatment) for each (target, treatment) pair.

The probabilities are computed for a single variable.

Parameters:
variable: str

The variable name.

Returns:
pd.DataFrame
The probabilities as a Dataframe containing:
  • A column named ‘Part’ listing all the parts of the variable.

  • One column per (target, treatment) pair.

get_treatment_groups(variable: str | None = None) dict[PartInterval | PartValue | PartValueGroup, tuple[tuple[str]]] | dict[str, dict[PartInterval | PartValue | PartValueGroup, tuple[tuple[str]]]]

Get the groups of treatments for one or all variables.

Parameters:
variable: str | None

If set to None, get groups of all variables, otherwise get groups of specified variable.

Returns:
If variable is None, returns a dict mapping variable names to dictionaries mapping parts to treatment groups.
If variable is not None, returns a dict mapping parts to treatment groups.
Treatment groups are in a tuple containing tuples of strings which are the treatment names.
get_uplift(reftarget, reftreatment, variable)

Get the uplift for a single variable.

See explanations of the computations in the ‘Returns’ section below.

Parameters:
reftarget

The reference target.

reftreatment

The reference treatment to which all the other treatments are compared.

variable: str

The name of the variable.

Returns:
pd.DataFrame
A Dataframe containing:
  • A column named ‘Part’ listing all the parts of the variable.

  • One column per treatment other than the reference treatment. A column gives the difference P(reftarget|treatment) - P(reftarget|reftreatment), that is, the benefit (or deficit) of probabilities to have ‘reftarget’ as the outcome with the column’s treatment compared to the reference treatment.

get_variable_type(variable)

Get the type of an input variable.

get_variable_types()

Get the types of all input variables as a mapping from variable names to variable types.

property informative_input_variables

list of str

The names of the informative variables.

property input_variables

list of str

The names of the variables.

property noninformative_input_variables

list of str

The names of the non-informative variables.

property target_modalities

list

All the different targets from the dataset.

property target_name

str

The name of the target column.

property target_treatment_pairs

list of TargetTreatmentPair

All (target, treatment) pairs as “target|treatment”-formatted strings.

transform(data)

Apply the discretisation model learned by the fit() method.

Parameters:
data: pd.DataFrame

Dataframe containing feature variables.

Returns:
pd.DataFrame

Pandas Dataframe that contains encoded data.

property treatment_modalities

list

All the different treatments from the dataset.

property treatment_name

str

The name of the treatment column.

kuplift.mt_univariate_encoding module

Multi-treatment Univariate Encoding

This module contains everything needed to make univariate variable transformation capable of merging treatments that give similar outcome.

class kuplift.mt_univariate_encoding.FileOutput(outputdir: Path, is_persistent: bool)

Bases: object

Compute paths to files and directories to be created.

property analysisresultdir: Path
property datasetfile: Path
property dictfile: Path
is_persistent: bool
property is_temporary: bool
outputdir: Path
property predictor_analysisresultfile: Path
xi_analysisresultfile(xname: str, iname: str) Path
class kuplift.mt_univariate_encoding.MultiTreatmentUnivariateEncoding

Bases: UnivariateEncodingWithGroupsBase

The MultiTreatmentUnivariateEncoding class makes use of the khiops Python wrapper and enables one to fit and transform data while grouping treatments giving similar outcome.

fit(data: DataFrame, treatment_col: Series, target_col: Series, maxparts: int | None = None, maxtreatmentgroups: int | None = None, outputdir: Path | str | None = None, max_cores=None, memory_limit_mb=None) None

Learn a discretization model using Khiops.

Parameters:
data: pandas.DataFrame

Dataframe containing feature variables. Categorical variables must have a string, categorical or object dtype to avoid beeing processed as numerical variables.

treatment_col: pandas.Series

Treatment column.

target_col: pandas.Series

Outcome column.

maxparts: int, default=None

The maximal number of intervals or groups. None means default to the ‘khiops’ program default.

maxtreatmentgroups: int, default=None

The maximal number of groups to define when grouping treatments together. None means automatic.

outputdir: Path-like

Set this if you want khiops’s workfiles to be kept in a specific directory. If None, fallback to the default behaviour which is to have khiops write its files into a temporary directory that is deleted when the work is done.

kuplift.mt_univariate_encoding.add_jtvar_to_khiops_dict(dictionary: Dictionary, datasetinfo: DatasetInfo) str
kuplift.mt_univariate_encoding.add_selectionvar_to_khiops_dict(dictionary: Dictionary, xname: str, xparts: list) str
kuplift.mt_univariate_encoding.build_khiops_dict_from_dataset_file(dictfilepath: Path | str, datasetfilepath: Path | str, datasetinfo: DatasetInfo) Dictionary

Build a Khiops dictionary from a dataset file.

  1. Read a dataset file.

  2. Create a dictionary file from the dataset.

  3. Read the dictionary file. This actually returns a dictionary domain.

  4. Get the dictionary from the dictionary domain.

  5. Fix the types of the variables in the dictionary.

  6. Add a j|t calculated variable in the dictionary.

Parameters:
dictfilepath: Path-like

The path to the dictionary file to be created since we cannot build a dictionary in-memory. Also sometimes we want to inspect this file.

datasetfilepath: Path-like

The path to the dataset file.

Returns:
Dictionary

A dictionary built from the dataset file.

kuplift.mt_univariate_encoding.check_vartypes_in_khiops_dict(dictionary: Dictionary, datasetinfo: DatasetInfo) None

Check that all input variables are either numerical or categorical.

kuplift.mt_univariate_encoding.compute_stats(dataset: DataFrame, datasetinfo: DatasetInfo, fileoutput: FileOutput, maxparts: int | None = None, max_cores=None, memory_limit_mb=None) tuple[Stats, Dictionary]
kuplift.mt_univariate_encoding.fix_vartypes_in_khiops_dict(dictionary: Dictionary, datasetinfo: DatasetInfo) None

Set types of treatment and target variables to “Categorical”.

kuplift.mt_univariate_encoding.group_treatments_for_variable(variable: str, datasetinfo: DatasetInfo, stats: Stats, upliftdict: Dictionary, fileoutput: FileOutput, maxtreatmentgroups: int | None = None, max_cores=None, memory_limit_mb=None) VarStatsWithGroups

Create groups of treatments for a variable.

Create groups of treatments so that all treatments in each group give similar outcomes given the same values of the specified variable.

Parameters:
variable: str

The variable on which treatment grouping will be based.

datasetinfo: DatasetInfo

Information about the dataset.

stats: Stats

The statistics computed with compute_stats.

upliftdict: Dictionary

The dictionary created with compute_stats.

fileoutput: FileOutput

Paths to output files.

maxtreatmentgroups: int or `None`

Maximal number of treatment groups, with None indicating the default of Khiops.

Returns:
VarStatsWithGroups

Variable statistics augmented with treatment groups.

kuplift.mt_decision_tree module

class kuplift.mt_decision_tree.MultiTreatmentDecisionTree(max_depth: int = 15, min_samples_leaf: int = 20, leaf_selection: str = 'best_leaf', random_state: int | None = None, cost_model=None, control_name=None, maxparts: int = 2, maxtreatmentgroups: int | None = None, local_fit_mode: str = 'per_leaf', split_max_features: int | None = None, max_cores=None, memory_limit_mb=None)

Bases: object

Multi-treatment decision tree with local univariate partition fitting.

This implementation grows a binary tree by evaluating, at each candidate leaf, local split candidates based on univariate encoders selected from treatment cardinality:

  • OUE when there are exactly 2 treatment modalities

  • MTUE when there are 3 or more treatment modalities

Notes

  • Raw node datasets are preserved (no global transformed dataset is stored in the tree).

  • Candidate split evaluation is cost-driven through cost_model.

  • A strict pass-through is implemented for KhiopsEnvironmentError so missing Khiops setup is not silently converted into a generic local-fit failure.

fit(data: DataFrame, treatment_col, y_col, positive_target=None) MultiTreatmentDecisionTree

Fit the decision tree on raw features, treatment and target.

Parameters:
datapandas.DataFrame

Feature matrix.

treatment_colarray-like / pandas.Series

Treatment column aligned with data.

y_colarray-like / pandas.Series

Target column aligned with data.

positive_targetAny, default=None

Positive target modality. If None, auto-detected.

Returns:
MultiTreatmentDecisionTree

The fitted estimator.

Raises:
ValueError

If input data is empty or lengths are inconsistent.

get_leaf_paths(sort=None) Series

Return path string for each leaf.

Parameters:
sortAny, default=None

Optional sorting mode forwarded to tree helper.

Returns:
pandas.Series

Leaf path strings.

get_node_by_id(node_id: int)

Return node object by id.

Parameters:
node_idint

Node identifier.

Returns:
Node | None

Matching node or None.

get_node_path_str(node_id: int, separator: str = ' AND ') str

Return human-readable path string for one node id.

Parameters:
node_idint

Node identifier.

separatorstr, default=” AND “

Rule separator.

Returns:
str

Path string.

get_target_frequencies(sort=None) DataFrame

Return leaf-level target-treatment frequency table.

Parameters:
sortAny, default=None

Optional sorting mode forwarded to tree helper.

Returns:
pandas.DataFrame

Frequency table.

get_target_probabilities(sort=None) DataFrame

Return leaf-level target-treatment probability table.

Parameters:
sortAny, default=None

Optional sorting mode forwarded to tree helper.

Returns:
pandas.DataFrame

Probability table.

get_treatment_groups_of_leaves(sort=None) DataFrame

Return treatment grouping metadata for current leaves.

Parameters:
sortAny, default=None

Optional sorting mode forwarded to tree helper.

Returns:
pandas.DataFrame

Leaf-level treatment groups.

get_uplift(sort=None) DataFrame

Return leaf-level uplift table.

Parameters:
sortAny, default=None

Optional sorting mode forwarded to tree helper.

Returns:
pandas.DataFrame

Uplift table.

property internal_nodes

List of internal nodes of the fitted tree.

leaf_ids_sorted_dfs() Index

Return leaf ids in DFS order.

Returns:
pandas.Index

Leaf ids.

property leaf_nodes

List of leaf nodes of the fitted tree.

node_ids_sorted_dfs() Index

Return node ids in DFS order.

Returns:
pandas.Index

Node ids.

node_ids_sorted_dfs_from(node_ids: Index) Index

Filter provided node ids according to DFS order.

Parameters:
node_idspandas.Index

Candidate node ids.

Returns:
pandas.Index

Node ids ordered by DFS.

predict_best_treatment(X: DataFrame) Series

Predict best treatment per sample based on highest positive-class rate in reached leaf.

Parameters:
Xpandas.DataFrame

Input features.

Returns:
pandas.Series

Predicted best treatment per sample.

predict_leaf_id(X: DataFrame) ndarray

Predict leaf id for each sample in X.

Parameters:
Xpandas.DataFrame

Input features.

Returns:
numpy.ndarray

Leaf ids, one per sample.

predict_probabilities(X: DataFrame, result_type: Literal['df', 'ndarray', 'lists'] = 'ndarray') DataFrame

Predict positive-target probabilities per treatment for each sample.

Parameters:
Xpandas.DataFrame or array-like

Input features.

result_type{“df”, “ndarray”, “lists”}, default=”ndarray”

Output format.

Returns:
pandas.DataFrame | numpy.ndarray | list

Predicted probabilities in requested format.

print_tree(show_path: bool = False, max_depth: int | None = None) None

Print textual tree representation to stdout.

Parameters:
show_pathbool, default=False

Whether to append full path for each displayed node.

max_depthint | None, default=None

Optional max rendering depth.

property root_node

Root node of the fitted tree, or None if unfitted.

property target_modalities

Sorted target modalities observed in training data.

property treatment_modalities

Sorted treatment modalities observed in training data.

property treatment_modality_count: int

Number of treatment modalities observed at training.

tree_to_dot(max_depth: int | None = None, show_node_stats: bool = True) str

Export tree to Graphviz DOT format.

Parameters:
max_depthint | None, default=None

Optional max rendering depth.

show_node_statsbool, default=True

Whether to include detailed labels.

Returns:
str

DOT graph source.

tree_to_image(dest: Path | str | None = None, img_format: str = 'png', *args, **kwargs) str

Render tree to image using Graphviz.

Parameters:
destPath | str | None, default=None

Output destination file path. If None, auto-generated in current directory.

img_formatstr, default=”png”

Image format passed to Graphviz.

*args, **kwargs

Forwarded to tree_to_dot().

Returns:
str

Path to rendered image file.

Raises:
RuntimeError

If graphviz Python package is missing.

tree_to_string(show_path: bool = False, max_depth: int | None = None) str

Export tree as a unicode text structure.

Parameters:
show_pathbool, default=False

Whether to append full path for each displayed node.

max_depthint | None, default=None

Optional max rendering depth.

Returns:
str

Pretty-printed tree.

property used_variable_count: int

Number of distinct variables effectively used in applied tree splits.

kuplift.mt_random_forest module

class kuplift.mt_random_forest.MultiTreatmentRandomForest(n_trees: int = 30, max_features: int = 20, random_state: int | None = None, max_depth: int = 15, min_samples_leaf: int = 20, cost_model=None, control_name=None, maxparts: int = 2, maxtreatmentgroups: int | None = None, local_fit_mode: str = 'per_leaf', split_max_features: int | None = None, max_cores=None, memory_limit_mb=None)

Bases: object

MultiTreatmentRandomForest for uplift-style multi-treatment probabilities.

  • Each tree is a MultiTreatmentDecisionTree with leaf_selection=”random”

  • Each tree is trained on all rows, but only a random subset of features (max_features=20 by default, or all if fewer are available)

  • At each split inside each tree, variables can also be sub-sampled via split_max_features (forwarded to MultiTreatmentDecisionTree)

  • predict() averages per-tree positive-class probabilities per treatment

Notes

  • This class does not perform row bootstrap sampling by default. Diversity is induced through random feature subspaces and per-tree seeds.

  • Uplift output requires control_name to be set and present in treatment modalities.

fit(data: DataFrame, treatment_col, y_col, positive_target=None) MultiTreatmentRandomForest

Fit all trees of the forest.

Parameters:
datapandas.DataFrame

Feature matrix.

treatment_colarray-like / pandas.Series

Treatment column aligned with data.

y_colarray-like / pandas.Series

Target column aligned with data.

positive_targetAny, default=None

Positive target modality forwarded to each tree fit.

Returns:
MultiTreatmentRandomForest

Fitted estimator.

Raises:
ValueError

If data is empty, has no feature column, or lengths are inconsistent.

TypeError

If data is not a pandas DataFrame.

predict(X: DataFrame, predict_probabilities: bool = True, predict_best_treatment: bool = True, predict_uplift: bool = True) DataFrame

Predict requested outputs for each sample.

Parameters:
Xpandas.DataFrame or array-like

Input feature matrix.

predict_probabilitiesbool, default=True

Include class probabilities (negative then positive per treatment).

predict_best_treatmentbool, default=True

Include best treatment column according to maximal positive probability.

predict_upliftbool, default=True

Include uplift column as max_t P(Y=positive|t) - P(Y=positive|control).

Returns:
pandas.DataFrame

Concatenated prediction blocks according to requested outputs.

Raises:
RuntimeError

If model is not fitted.

ValueError

If no output is requested, features are missing, or uplift cannot be computed.

predict_probabilities(X: DataFrame, result_type: Literal['df', 'ndarray', 'lists'] = 'ndarray', include_negative_probabilities: bool = False)

Predict treatment-wise probabilities by averaging tree outputs.

Parameters:
Xpandas.DataFrame or array-like

Input feature matrix.

result_type{“df”, “ndarray”, “lists”}, default=”ndarray”

Output format.

include_negative_probabilitiesbool, default=False

If True, prepend negative probabilities as 1 - positive.

Returns:
pandas.DataFrame | numpy.ndarray | list

Predicted probabilities in requested format.

Raises:
RuntimeError

If model is not fitted.

ValueError

If input features are incomplete or result_type is invalid.