Documentation¶
kuplift.bayesian_decision_tree module¶
- class kuplift.bayesian_decision_tree.BayesianDecisionTree(control_name=None)¶
Bases:
_TreeThe BayesianDecisionTree class implements the UB-DT algorithm described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, May). A Non-Parametric Bayesian Decision Trees for Uplift modelling. In PAKDD.
- Parameters:
- datapd.Dataframe
Dataframe containing feature variables.
- treatment_colpd.Series
Treatment column.
- y_colpd.Series
Outcome column.
- control_name: int or str
The name of the control value in the treatment column
- fit(data, treatment_col, y_col)¶
Fit an uplift decision tree model using UB-DT
- Parameters:
- X_trainpd.Dataframe
Dataframe containing feature variables.
- treatment_colpd.Series
Treatment column.
- y_colpd.Series
Outcome column.
kuplift.bayesian_random_forest module¶
- class kuplift.bayesian_random_forest.BayesianRandomForest(n_trees=10, vars_subset=False, random_state=10)¶
Bases:
objectThe BayesianRandomForest class implements the UB-RF algorithm described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, May). A Non-Parametric Bayesian Decision Trees for Uplift modelling. In PAKDD.
- Parameters:
- datapd.Dataframe
Dataframe containing data.
- treatment_colpd.Series
Treatment column.
- outcome_colpd.Series
Outcome column.
- n_treesint, default 10
Number of trees in a forest.
- vars_subsetbool, default False
Use a random subset of the variables for each tree in the forest.
- random_stateint, default 10
Seed used by the random number generator.
- fit(data, treatment_col, y_col)¶
Fit a decision tree algorithm.
- predict(X_test, weighted_average=False)¶
Predict the uplift value for each example in X_test.
- Parameters:
- X_testpd.Dataframe
Dataframe containing test data.
- weighted_averagebool, default False
Give a weight for the predictions of each tree according to its cost.
- Returns:
- y_pred_list(ndarray, shape=(num_samples, 1))
An array containing the predicted uplift for each sample.
kuplift.feature_selection module¶
- class kuplift.feature_selection.FeatureSelection(control_name=None)¶
Bases:
objectThe FeatureSelection implements the feature selection algorithm ‘UMODL-FS’ described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, March). A non-parametric bayesian approach for uplift discretization and feature selection. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part V (pp. 239-254). Cham: Springer Nature Switzerland.
- filter(data, treatment_col, y_col, parallelized=False, num_processes=5)¶
This function runs the feature selection algorithm ‘UMODL-FS’, ranking variables based on their importance in the given data.
- Parameters:
- datapd.Dataframe
Dataframe containing feature variables.
- treatment_colpd.Series
Treatment column.
- y_colpd.Series
Outcome column.
- parallelizedbool, default False
Whether to run the code on several processes.
- num_processesint, default 5
Number of processes to use in parallel, ‘parallelized’ argument should be True.
- Returns:
- Python Dictionary
Variables names and their corresponding importance value (Sorted).
- get_features_importance_details()¶
After launch the feature selection approach, this function helps getting the details of a each feature. How it was discretized, the intervals, the outcome denisities in each interval.
kuplift.univariate_encoding module¶
- class kuplift.univariate_encoding.UnivariateEncoding(control_name=None)¶
Bases:
objectThe UnivariateEncoding class implements the UMODL algorithm for uplift data encoding described in: Rafla, M., Voisine, N., Crémilleux, B., & Boullé, M. (2023, March). A non-parametric bayesian approach for uplift discretization and feature selection. ECML PKDD
- fit(data, treatment_col, y_col, parallelized=False, num_processes=5)¶
fit() learns a discretisation model using the UMODL approach.
- Parameters:
- datapd.Dataframe
Dataframe containing feature variables.
- treatment_colpd.Series
Treatment column.
- y_colpd.Series
Outcome column.
- parallelizedbool, default False
Whether to run the code on several processes.
- num_processesint, default 5
Number of processes to use in parallel.
- fit_transform(data, treatment_col, y_col, parallelized=False, num_processes=5)¶
fit_transform() learns a discretisation model using UMODL and transforms the data.
- Parameters:
- datapd.Dataframe
Dataframe containing feature variables.
- treatment_colpd.Series
Treatment column.
- y_colpd.Series
Outcome column.
- parallelizedbool, default False
Whether to run the code on several processes.
- num_processesint, default 5
Number of processes to use in parallel.
- Returns:
- pd.Dataframe
Pandas Dataframe that contains encoded data.
- get_features_importance_details()¶
- transform(data)¶
transform() applies the discretisation model learned by the fit() method.
- Parameters:
- datapd.Dataframe
Dataframe containing feature variables.
- Returns:
- pd.Dataframe
Pandas Dataframe that contains encoded data.
kuplift.optimized_univariate_encoding module¶
Optimized Univariate Encoding
This module contains everything needed to make univariate variable transformation optimized through the use of the C++ implementation of ‘umodl’. It calls the ‘umodl’ executable as a subprocess indirectly by the use the ‘umodl’ library.
The main class of this module is ‘OptimizedUnivariateEncoding’.
An example code is in examples/optimized_univariate_encoding.py.
- class kuplift.optimized_univariate_encoding.Interval(lower: float | None = None, upper: float | None = None)¶
Bases:
object- property catches_missing¶
- class kuplift.optimized_univariate_encoding.IntervalPartition(intervals: Sequence[Interval])¶
Bases:
PartitionPartition of type ‘intervals’.
- Attributes:
- intervals: Sequence[Interval]
The intervals. Each interval is a pair defining its lower bound and its upper bound (in that order). The exception to this rule is the empty interval representing ‘MISSING’ values. If present, it must be the first interval of the sequence.
- property parts¶
- transform(col)¶
- transform_elem(elem)¶
- class kuplift.optimized_univariate_encoding.OptimizedUnivariateEncoding¶
Bases:
objectThe OptimizedUnivariateEncoding class makes use of the external umodl tool hosted at https://github.com/UData-Orange/umodl.
- Attributes:
- model: dict mapping str to ValGrpPartition or IntervalPartition
The model generated by the ‘umodl’ executable. It describes the partitioning of values of informative variables into groups or intervals. It maps the informative variable names to value partitions.
- levels: list of (str, float) pairs
(variable-name, variable-level) pairs in decreasing level order.
- variable_cols: DataFrame
The data columns of all variables. This means all the data from the dataset but the treatment and target columns.
- treatment_col: Series
The treatment column from the dataset.
- target_col: Series
The target column from the dataset.
- fit(data, treatment_col, target_col, maxpartnumber=None)¶
Learn a discretisation model using UMODL.
- Parameters:
- data: pd.DataFrame
Dataframe containing feature variables. Categorical variables should have the object dtype, otherwise they are processed as numerical variables.
- treatment_col: pd.Series
Treatment column.
- target_col: pd.Series
Outcome column.
- maxpartnumber: int, default=None
The maximal number of intervals or groups. None means default to the ‘umodl’ program default.
- fit_transform(data, treatment_col, target_col, maxpartnumber=None)¶
Learn a discretisation model using UMODL and transform the data.
- Parameters:
- data: pd.DataFrame
Dataframe containing feature variables. Categorical variables should have the object dtype, otherwise they are processed as numerical variables.
- treatment_col: pd.Series
Treatment column.
- target_col: pd.Series
Outcome column.
- maxpartnumber: int, default=None
The maximal number of intervals or groups. None means default to the ‘umodl’ program default.
- Returns:
- pd.Dataframe
Pandas Dataframe that contains encoded data.
- get_level(variable)¶
Get the level of a single variable.
- Parameters:
- variable: str
The variable to get the level from.
- Returns:
- float
The level of the specified variable.
- get_levels()¶
Get the level of all variables.
- Returns:
- list[tuple[str, float]]
(variable-name, variable-level) pairs in decreasing level order.
- get_partition(variable)¶
Get the partition corresponding to a single variable of the model.
- Parameters:
- variable: str
The variable name.
- Returns:
- ValGrpPartition | IntervalPartition
The partition corresponding to a single variable of the model.
- get_target_frequencies(variable)¶
Get the frequencies for each (target, treatment) pair.
The frequencies are computed for a single variable.
- Parameters:
- variable: str
The variable name.
- Returns:
- pd.DataFrame
- The frequencies as a Dataframe containing:
A column named ‘Part’ listing all the parts of the variable.
One column per (target, treatment) pair.
- get_target_probabilities(variable)¶
Get the probabilities P(target|treatment) for each (target, treatment) pair.
The probabilities are computed for a single variable.
- Parameters:
- variable: str
The variable name.
- Returns:
- pd.DataFrame
- The probabilities as a Dataframe containing:
A column named ‘Part’ listing all the parts of the variable.
One column per (target, treatment) pair.
- get_uplift(reftarget, reftreatment, variable)¶
Get the uplift for a single variable.
See explanations of the computations in the ‘Returns’ section below.
- Parameters:
- reftarget
The reference target.
- reftreatment
The reference treatment to which all the other treatments are compared.
- variable: str
The name of the variable.
- Returns:
- pd.DataFrame
- A Dataframe containing:
A column named ‘Part’ listing all the parts of the variable.
One column per treatment other than the reference treatment. A column gives the difference P(reftarget|treatment) - P(reftarget|reftreatment), that is, the benefit (or deficit) of probabilities to have ‘reftarget’ as the outcome with the column’s treatment compared to the reference treatment.
- property informative_input_variables¶
list of str
The names of the informative variables.
- property input_variables¶
list of str
The names of the variables.
- model: dict[str, ValGrpPartition | IntervalPartition]¶
- property noninformative_input_variables¶
list of str
The names of the non-informative variables.
- property target_modalities¶
list
All the different targets from the dataset.
- property target_name¶
str
The name of the target column.
- property target_treatment_pairs¶
list of TargetTreatmentPair
All (target, treatment) pairs.
- transform(data)¶
Apply the discretisation model learned by the fit() method.
- Parameters:
- data: pd.DataFrame
Dataframe containing feature variables.
- Returns:
- pd.DataFrame
Pandas Dataframe that contains encoded data.
- property treatment_modalities¶
list
All the different treatments from the dataset.
- property treatment_name¶
str
The name of the treatment column.
- class kuplift.optimized_univariate_encoding.TargetTreatmentPair(target: object, treatment: object)¶
Bases:
objectTarget-treatment pair.
Used to identify both a target and a treatment. This class only exists for the purpose of formatting.
- class kuplift.optimized_univariate_encoding.ValGrpPartition(groups: Sequence[ValGrp], defaultgroupindex: int)¶
Bases:
PartitionPartition of type ‘value groups’.
- Attributes:
- groups: Sequence[ValGrp]
The groups. Each group is an iterable of its values.
- defaultgroupindex: int
The group index affected to transformed elements when they do not explicitly appear in any group.
- property parts¶
- transform(col)¶
- transform_elem(elem)¶