Documentation¶
kuplift.bayesian_decision_tree module¶
- class kuplift.bayesian_decision_tree.BayesianDecisionTree(control_name=None)¶
Bases:
_TreeThe BayesianDecisionTree class implements the UB-DT algorithm described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, May). A Non-Parametric Bayesian Decision Trees for Uplift modelling. In PAKDD.
- Parameters:
- datapd.Dataframe
Dataframe containing feature variables.
- treatment_colpd.Series
Treatment column.
- y_colpd.Series
Outcome column.
- control_name: int or str
The name of the control value in the treatment column
- fit(data, treatment_col, y_col)¶
Fit an uplift decision tree model using UB-DT
- Parameters:
- X_trainpd.Dataframe
Dataframe containing feature variables.
- treatment_colpd.Series
Treatment column.
- y_colpd.Series
Outcome column.
kuplift.bayesian_random_forest module¶
- class kuplift.bayesian_random_forest.BayesianRandomForest(n_trees=10, vars_subset=False, random_state=10)¶
Bases:
objectThe BayesianRandomForest class implements the UB-RF algorithm described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, May). A Non-Parametric Bayesian Decision Trees for Uplift modelling. In PAKDD.
- Parameters:
- datapd.Dataframe
Dataframe containing data.
- treatment_colpd.Series
Treatment column.
- outcome_colpd.Series
Outcome column.
- n_treesint, default 10
Number of trees in a forest.
- vars_subsetbool, default False
Use a random subset of the variables for each tree in the forest.
- random_stateint, default 10
Seed used by the random number generator.
- fit(data, treatment_col, y_col)¶
Fit a decision tree algorithm.
- predict(X_test, weighted_average=False)¶
Predict the uplift value for each example in X_test.
- Parameters:
- X_testpd.Dataframe
Dataframe containing test data.
- weighted_averagebool, default False
Give a weight for the predictions of each tree according to its cost.
- Returns:
- y_pred_list(ndarray, shape=(num_samples, 1))
An array containing the predicted uplift for each sample.
kuplift.feature_selection module¶
- class kuplift.feature_selection.FeatureSelection(control_name=None)¶
Bases:
objectThe FeatureSelection implements the feature selection algorithm ‘UMODL-FS’ described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, March). A non-parametric bayesian approach for uplift discretization and feature selection. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part V (pp. 239-254). Cham: Springer Nature Switzerland.
- filter(data, treatment_col, y_col, parallelized=False, num_processes=5)¶
This function runs the feature selection algorithm ‘UMODL-FS’, ranking variables based on their importance in the given data.
- Parameters:
- datapd.Dataframe
Dataframe containing feature variables.
- treatment_colpd.Series
Treatment column.
- y_colpd.Series
Outcome column.
- parallelizedbool, default False
Whether to run the code on several processes.
- num_processesint, default 5
Number of processes to use in parallel, ‘parallelized’ argument should be True.
- Returns:
- Python Dictionary
Variables names and their corresponding importance value (Sorted).
- get_features_importance_details()¶
After launch the feature selection approach, this function helps getting the details of a each feature. How it was discretized, the intervals, the outcome denisities in each interval.
kuplift.univariate_encoding module¶
- class kuplift.univariate_encoding.UnivariateEncoding(control_name=None)¶
Bases:
objectThe UnivariateEncoding class implements the UMODL algorithm for uplift data encoding described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, March). A non-parametric bayesian approach for uplift discretization and feature selection. ECML PKDD
- fit(data, treatment_col, y_col, parallelized=False, num_processes=5)¶
fit() learns a discretisation model using the UMODL approach.
- Parameters:
- datapd.Dataframe
Dataframe containing feature variables.
- treatment_colpd.Series
Treatment column.
- y_colpd.Series
Outcome column.
- parallelizedbool, default False
Whether to run the code on several processes.
- num_processesint, default 5
Number of processes to use in parallel.
- fit_transform(data, treatment_col, y_col, parallelized=False, num_processes=5)¶
fit_transform() learns a discretisation model using UMODL and transforms the data.
- Parameters:
- datapd.Dataframe
Dataframe containing feature variables.
- treatment_colpd.Series
Treatment column.
- y_colpd.Series
Outcome column.
- parallelizedbool, default False
Whether to run the code on several processes.
- num_processesint, default 5
Number of processes to use in parallel.
- Returns:
- pd.Dataframe
Pandas Dataframe that contains encoded data.
- get_features_importance_details()¶
- transform(data)¶
transform() applies the discretisation model learned by the fit() method.
- Parameters:
- datapd.Dataframe
Dataframe containing feature variables.
- Returns:
- pd.Dataframe
Pandas Dataframe that contains encoded data.
kuplift.optimized_univariate_encoding module¶
Optimized Univariate Encoding
This module contains everything needed to make univariate variable transformation optimized through the use of the C++ implementation of ‘umodl’. It calls the ‘umodl’ executable as a subprocess indirectly by the use the ‘umodl’ library.
The main class of this module is ‘OptimizedUnivariateEncoding’.
An example code is in examples/optimized_univariate_encoding.py.
- class kuplift.optimized_univariate_encoding.OptimizedUnivariateEncoding¶
Bases:
objectThe OptimizedUnivariateEncoding class makes use of the external umodl tool hosted at https://github.com/UData-Orange/umodl.
- Attributes:
- model: dict mapping str to Part
The model generated by the ‘umodl’ executable. It describes the partitioning of values of informative variables into groups or intervals. It maps the informative variable names to value partitions.
- levels: list of (str, float) pairs
(variable-name, variable-level) pairs in decreasing level order.
- variable_cols: DataFrame
The data columns of all variables. This means all the data from the dataset but the treatment and target columns.
- treatment_col: Series
The treatment column from the dataset.
- target_col: Series
The target column from the dataset.
- treatment_groups: dict mapping str to dict mapping Part to int
The keys are the variable names. The values are themselves dictionaries, which keys are groups or intervals and which values are numbers.
- fit(data, treatment_col, target_col, maxparts=None)¶
Learn a discretisation model using UMODL.
- Parameters:
- data: pd.DataFrame
Dataframe containing feature variables. Categorical variables should have the object dtype, otherwise they are processed as numerical variables.
- treatment_col: pd.Series
Treatment column.
- target_col: pd.Series
Outcome column.
- maxparts: int, default=None
The maximal number of intervals or groups. None means default to the ‘umodl’ program default.
- fit_transform(data, treatment_col, target_col, maxparts=None)¶
Learn a discretisation model using UMODL and transform the data.
- Parameters:
- data: pd.DataFrame
Dataframe containing feature variables. Categorical variables should have the object dtype, otherwise they are processed as numerical variables.
- treatment_col: pd.Series
Treatment column.
- target_col: pd.Series
Outcome column.
- maxparts: int, default=None
The maximal number of intervals or groups. None means default to the ‘umodl’ program default.
- Returns:
- pd.Dataframe
Pandas Dataframe that contains encoded data.
- get_level(variable)¶
Get the level of a single variable.
- Parameters:
- variable: str
The variable to get the level from.
- Returns:
- float
The level of the specified variable.
- get_levels()¶
Get the level of all variables.
- Returns:
- list[tuple[str, float]]
(variable-name, variable-level) pairs in decreasing level order.
- get_partition(variable)¶
Get the partition corresponding to a single variable of the model.
- Parameters:
- variable: str
The variable name.
- Returns:
- Partition
The partition corresponding to a single variable of the model.
- get_partitions()¶
Get the partitions of all informative input variables in the model.
- Returns:
- dict[str, Partition]
A dictionary mapping the informative input variable names to the partitions.
- get_target_frequencies(variable)¶
Get the frequencies for each (target, treatment) pair.
The frequencies are computed for a single variable.
- Parameters:
- variable: str
The variable name.
- Returns:
- pd.DataFrame
- The frequencies as a Dataframe containing:
A column named ‘Part’ listing all the parts of the variable.
One column per (target, treatment) pair.
- get_target_probabilities(variable)¶
Get the probabilities P(target|treatment) for each (target, treatment) pair.
The probabilities are computed for a single variable.
- Parameters:
- variable: str
The variable name.
- Returns:
- pd.DataFrame
- The probabilities as a Dataframe containing:
A column named ‘Part’ listing all the parts of the variable.
One column per (target, treatment) pair.
- get_treatment_groups(variable: str | None = None) dict[PartInterval | PartValue | PartValueGroup, tuple[tuple[str]]] | dict[str, dict[PartInterval | PartValue | PartValueGroup, tuple[tuple[str]]]]¶
Get the groups of treatments for one or all variables.
- Parameters:
- variable: str | None
If set to None, get groups of all variables, otherwise get groups of specified variable.
- Returns:
- If
variableis None, returns a dict mapping variable names to dictionaries mapping parts to treatment groups. - If
variableis not None, returns a dict mapping parts to treatment groups. - Treatment groups are in a tuple containing tuples of strings which are the treatment names.
- If
- get_uplift(reftarget, reftreatment, variable)¶
Get the uplift for a single variable.
See explanations of the computations in the ‘Returns’ section below.
- Parameters:
- reftarget
The reference target.
- reftreatment
The reference treatment to which all the other treatments are compared.
- variable: str
The name of the variable.
- Returns:
- pd.DataFrame
- A Dataframe containing:
A column named ‘Part’ listing all the parts of the variable.
One column per treatment other than the reference treatment. A column gives the difference P(reftarget|treatment) - P(reftarget|reftreatment), that is, the benefit (or deficit) of probabilities to have ‘reftarget’ as the outcome with the column’s treatment compared to the reference treatment.
- get_variable_type(variable)¶
Get the type of an input variable.
- get_variable_types()¶
Get the types of all input variables as a mapping from variable names to variable types.
- property informative_input_variables¶
list of str
The names of the informative variables.
- property input_variables¶
list of str
The names of the variables.
- property noninformative_input_variables¶
list of str
The names of the non-informative variables.
- property target_modalities¶
list
All the different targets from the dataset.
- property target_name¶
str
The name of the target column.
- property target_treatment_pairs¶
list of TargetTreatmentPair
All (target, treatment) pairs as “target|treatment”-formatted strings.
- transform(data)¶
Apply the discretisation model learned by the fit() method.
- Parameters:
- data: pd.DataFrame
Dataframe containing feature variables.
- Returns:
- pd.DataFrame
Pandas Dataframe that contains encoded data.
- property treatment_modalities¶
list
All the different treatments from the dataset.
- property treatment_name¶
str
The name of the treatment column.
kuplift.mt_univariate_encoding module¶
Multi-treatment Univariate Encoding
This module contains everything needed to make univariate variable transformation capable of merging treatments that give similar outcome.
- class kuplift.mt_univariate_encoding.FileOutput(outputdir: Path, is_persistent: bool)¶
Bases:
objectCompute paths to files and directories to be created.
- class kuplift.mt_univariate_encoding.MultiTreatmentUnivariateEncoding¶
Bases:
UnivariateEncodingWithGroupsBaseThe MultiTreatmentUnivariateEncoding class makes use of the khiops Python wrapper and enables one to fit and transform data while grouping treatments giving similar outcome.
- fit(data: DataFrame, treatment_col: Series, target_col: Series, maxparts: int | None = None, maxtreatmentgroups: int | None = None, outputdir: Path | str | None = None, max_cores=None, memory_limit_mb=None) None¶
Learn a discretization model using Khiops.
- Parameters:
- data: pandas.DataFrame
Dataframe containing feature variables. Categorical variables must have a string, categorical or object dtype to avoid beeing processed as numerical variables.
- treatment_col: pandas.Series
Treatment column.
- target_col: pandas.Series
Outcome column.
- maxparts: int, default=None
The maximal number of intervals or groups. None means default to the ‘khiops’ program default.
- maxtreatmentgroups: int, default=None
The maximal number of groups to define when grouping treatments together. None means automatic.
- outputdir: Path-like
Set this if you want khiops’s workfiles to be kept in a specific directory. If None, fallback to the default behaviour which is to have khiops write its files into a temporary directory that is deleted when the work is done.
- kuplift.mt_univariate_encoding.add_jtvar_to_khiops_dict(dictionary: Dictionary, datasetinfo: DatasetInfo) str¶
- kuplift.mt_univariate_encoding.add_selectionvar_to_khiops_dict(dictionary: Dictionary, xname: str, xparts: list) str¶
- kuplift.mt_univariate_encoding.build_khiops_dict_from_dataset_file(dictfilepath: Path | str, datasetfilepath: Path | str, datasetinfo: DatasetInfo) Dictionary¶
Build a Khiops dictionary from a dataset file.
Read a dataset file.
Create a dictionary file from the dataset.
Read the dictionary file. This actually returns a dictionary domain.
Get the dictionary from the dictionary domain.
Fix the types of the variables in the dictionary.
Add a j|t calculated variable in the dictionary.
- Parameters:
- dictfilepath: Path-like
The path to the dictionary file to be created since we cannot build a dictionary in-memory. Also sometimes we want to inspect this file.
- datasetfilepath: Path-like
The path to the dataset file.
- Returns:
- Dictionary
A dictionary built from the dataset file.
- kuplift.mt_univariate_encoding.check_vartypes_in_khiops_dict(dictionary: Dictionary, datasetinfo: DatasetInfo) None¶
Check that all input variables are either numerical or categorical.
- kuplift.mt_univariate_encoding.compute_stats(dataset: DataFrame, datasetinfo: DatasetInfo, fileoutput: FileOutput, maxparts: int | None = None, max_cores=None, memory_limit_mb=None) tuple[Stats, Dictionary]¶
- kuplift.mt_univariate_encoding.fix_vartypes_in_khiops_dict(dictionary: Dictionary, datasetinfo: DatasetInfo) None¶
Set types of treatment and target variables to “Categorical”.
- kuplift.mt_univariate_encoding.group_treatments_for_variable(variable: str, datasetinfo: DatasetInfo, stats: Stats, upliftdict: Dictionary, fileoutput: FileOutput, maxtreatmentgroups: int | None = None, max_cores=None, memory_limit_mb=None) VarStatsWithGroups¶
Create groups of treatments for a variable.
Create groups of treatments so that all treatments in each group give similar outcomes given the same values of the specified variable.
- Parameters:
- variable: str
The variable on which treatment grouping will be based.
- datasetinfo: DatasetInfo
Information about the dataset.
- stats: Stats
The statistics computed with
compute_stats.- upliftdict: Dictionary
The dictionary created with
compute_stats.- fileoutput: FileOutput
Paths to output files.
- maxtreatmentgroups: int or `None`
Maximal number of treatment groups, with
Noneindicating the default of Khiops.
- Returns:
- VarStatsWithGroups
Variable statistics augmented with treatment groups.
kuplift.mt_decision_tree module¶
- class kuplift.mt_decision_tree.MultiTreatmentDecisionTree(max_depth: int = 15, min_samples_leaf: int = 20, leaf_selection: str = 'best_leaf', random_state: int | None = None, cost_model=None, control_name=None, maxparts: int = 2, maxtreatmentgroups: int | None = None, local_fit_mode: str = 'per_leaf', split_max_features: int | None = None, max_cores=None, memory_limit_mb=None)¶
Bases:
objectMulti-treatment decision tree with local univariate partition fitting.
This implementation grows a binary tree by evaluating, at each candidate leaf, local split candidates based on univariate encoders selected from treatment cardinality:
OUE when there are exactly 2 treatment modalities
MTUE when there are 3 or more treatment modalities
Notes
Raw node datasets are preserved (no global transformed dataset is stored in the tree).
Candidate split evaluation is cost-driven through
cost_model.A strict pass-through is implemented for
KhiopsEnvironmentErrorso missing Khiops setup is not silently converted into a generic local-fit failure.
- fit(data: DataFrame, treatment_col, y_col, positive_target=None) MultiTreatmentDecisionTree¶
Fit the decision tree on raw features, treatment and target.
- Parameters:
- datapandas.DataFrame
Feature matrix.
- treatment_colarray-like / pandas.Series
Treatment column aligned with
data.- y_colarray-like / pandas.Series
Target column aligned with
data.- positive_targetAny, default=None
Positive target modality. If None, auto-detected.
- Returns:
- MultiTreatmentDecisionTree
The fitted estimator.
- Raises:
- ValueError
If input data is empty or lengths are inconsistent.
- get_leaf_paths(sort=None) Series¶
Return path string for each leaf.
- Parameters:
- sortAny, default=None
Optional sorting mode forwarded to tree helper.
- Returns:
- pandas.Series
Leaf path strings.
- get_node_by_id(node_id: int)¶
Return node object by id.
- Parameters:
- node_idint
Node identifier.
- Returns:
- Node | None
Matching node or None.
- get_node_path_str(node_id: int, separator: str = ' AND ') str¶
Return human-readable path string for one node id.
- Parameters:
- node_idint
Node identifier.
- separatorstr, default=” AND “
Rule separator.
- Returns:
- str
Path string.
- get_target_frequencies(sort=None) DataFrame¶
Return leaf-level target-treatment frequency table.
- Parameters:
- sortAny, default=None
Optional sorting mode forwarded to tree helper.
- Returns:
- pandas.DataFrame
Frequency table.
- get_target_probabilities(sort=None) DataFrame¶
Return leaf-level target-treatment probability table.
- Parameters:
- sortAny, default=None
Optional sorting mode forwarded to tree helper.
- Returns:
- pandas.DataFrame
Probability table.
- get_treatment_groups_of_leaves(sort=None) DataFrame¶
Return treatment grouping metadata for current leaves.
- Parameters:
- sortAny, default=None
Optional sorting mode forwarded to tree helper.
- Returns:
- pandas.DataFrame
Leaf-level treatment groups.
- get_uplift(sort=None) DataFrame¶
Return leaf-level uplift table.
- Parameters:
- sortAny, default=None
Optional sorting mode forwarded to tree helper.
- Returns:
- pandas.DataFrame
Uplift table.
- property internal_nodes¶
List of internal nodes of the fitted tree.
- leaf_ids_sorted_dfs() Index¶
Return leaf ids in DFS order.
- Returns:
- pandas.Index
Leaf ids.
- property leaf_nodes¶
List of leaf nodes of the fitted tree.
- node_ids_sorted_dfs() Index¶
Return node ids in DFS order.
- Returns:
- pandas.Index
Node ids.
- node_ids_sorted_dfs_from(node_ids: Index) Index¶
Filter provided node ids according to DFS order.
- Parameters:
- node_idspandas.Index
Candidate node ids.
- Returns:
- pandas.Index
Node ids ordered by DFS.
- predict_best_treatment(X: DataFrame) Series¶
Predict best treatment per sample based on highest positive-class rate in reached leaf.
- Parameters:
- Xpandas.DataFrame
Input features.
- Returns:
- pandas.Series
Predicted best treatment per sample.
- predict_leaf_id(X: DataFrame) ndarray¶
Predict leaf id for each sample in X.
- Parameters:
- Xpandas.DataFrame
Input features.
- Returns:
- numpy.ndarray
Leaf ids, one per sample.
- predict_probabilities(X: DataFrame, result_type: Literal['df', 'ndarray', 'lists'] = 'ndarray') DataFrame¶
Predict positive-target probabilities per treatment for each sample.
- Parameters:
- Xpandas.DataFrame or array-like
Input features.
- result_type{“df”, “ndarray”, “lists”}, default=”ndarray”
Output format.
- Returns:
- pandas.DataFrame | numpy.ndarray | list
Predicted probabilities in requested format.
- print_tree(show_path: bool = False, max_depth: int | None = None) None¶
Print textual tree representation to stdout.
- Parameters:
- show_pathbool, default=False
Whether to append full path for each displayed node.
- max_depthint | None, default=None
Optional max rendering depth.
- property root_node¶
Root node of the fitted tree, or None if unfitted.
- property target_modalities¶
Sorted target modalities observed in training data.
- property treatment_modalities¶
Sorted treatment modalities observed in training data.
- tree_to_dot(max_depth: int | None = None, show_node_stats: bool = True) str¶
Export tree to Graphviz DOT format.
- Parameters:
- max_depthint | None, default=None
Optional max rendering depth.
- show_node_statsbool, default=True
Whether to include detailed labels.
- Returns:
- str
DOT graph source.
- tree_to_image(dest: Path | str | None = None, img_format: str = 'png', *args, **kwargs) str¶
Render tree to image using Graphviz.
- Parameters:
- destPath | str | None, default=None
Output destination file path. If None, auto-generated in current directory.
- img_formatstr, default=”png”
Image format passed to Graphviz.
- *args, **kwargs
Forwarded to
tree_to_dot().
- Returns:
- str
Path to rendered image file.
- Raises:
- RuntimeError
If graphviz Python package is missing.
- tree_to_string(show_path: bool = False, max_depth: int | None = None) str¶
Export tree as a unicode text structure.
- Parameters:
- show_pathbool, default=False
Whether to append full path for each displayed node.
- max_depthint | None, default=None
Optional max rendering depth.
- Returns:
- str
Pretty-printed tree.
kuplift.mt_random_forest module¶
- class kuplift.mt_random_forest.MultiTreatmentRandomForest(n_trees: int = 30, max_features: int = 20, random_state: int | None = None, max_depth: int = 15, min_samples_leaf: int = 20, cost_model=None, control_name=None, maxparts: int = 2, maxtreatmentgroups: int | None = None, local_fit_mode: str = 'per_leaf', split_max_features: int | None = None, max_cores=None, memory_limit_mb=None)¶
Bases:
objectMultiTreatmentRandomForest for uplift-style multi-treatment probabilities.
Each tree is a MultiTreatmentDecisionTree with leaf_selection=”random”
Each tree is trained on all rows, but only a random subset of features (max_features=20 by default, or all if fewer are available)
At each split inside each tree, variables can also be sub-sampled via
split_max_features(forwarded to MultiTreatmentDecisionTree)predict() averages per-tree positive-class probabilities per treatment
Notes
This class does not perform row bootstrap sampling by default. Diversity is induced through random feature subspaces and per-tree seeds.
Uplift output requires
control_nameto be set and present in treatment modalities.
- fit(data: DataFrame, treatment_col, y_col, positive_target=None) MultiTreatmentRandomForest¶
Fit all trees of the forest.
- Parameters:
- datapandas.DataFrame
Feature matrix.
- treatment_colarray-like / pandas.Series
Treatment column aligned with
data.- y_colarray-like / pandas.Series
Target column aligned with
data.- positive_targetAny, default=None
Positive target modality forwarded to each tree fit.
- Returns:
- MultiTreatmentRandomForest
Fitted estimator.
- Raises:
- ValueError
If data is empty, has no feature column, or lengths are inconsistent.
- TypeError
If
datais not a pandas DataFrame.
- predict(X: DataFrame, predict_probabilities: bool = True, predict_best_treatment: bool = True, predict_uplift: bool = True) DataFrame¶
Predict requested outputs for each sample.
- Parameters:
- Xpandas.DataFrame or array-like
Input feature matrix.
- predict_probabilitiesbool, default=True
Include class probabilities (negative then positive per treatment).
- predict_best_treatmentbool, default=True
Include best treatment column according to maximal positive probability.
- predict_upliftbool, default=True
Include uplift column as max_t P(Y=positive|t) - P(Y=positive|control).
- Returns:
- pandas.DataFrame
Concatenated prediction blocks according to requested outputs.
- Raises:
- RuntimeError
If model is not fitted.
- ValueError
If no output is requested, features are missing, or uplift cannot be computed.
- predict_probabilities(X: DataFrame, result_type: Literal['df', 'ndarray', 'lists'] = 'ndarray', include_negative_probabilities: bool = False)¶
Predict treatment-wise probabilities by averaging tree outputs.
- Parameters:
- Xpandas.DataFrame or array-like
Input feature matrix.
- result_type{“df”, “ndarray”, “lists”}, default=”ndarray”
Output format.
- include_negative_probabilitiesbool, default=False
If True, prepend negative probabilities as
1 - positive.
- Returns:
- pandas.DataFrame | numpy.ndarray | list
Predicted probabilities in requested format.
- Raises:
- RuntimeError
If model is not fitted.
- ValueError
If input features are incomplete or
result_typeis invalid.