pyrolite.util.skl

Utilities for use with sklearn.

pyrolite.util.skl.vis

pyrolite.util.skl.vis.plot_confusion_matrix(*args, ax=None, classes=[], class_order=None, normalize=False, title='Confusion Matrix', cmap=<matplotlib.colors.LinearSegmentedColormap object>, norm=None, xlabelrotation=None)[source]

This function prints and plots the confusion matrix. Normalization can be applied by setting normalize=True.

Parameters

args (tuple) –

Data to evaluate and visualise a confusion matrix:

A single confusion matrix (n x n)

A tuple of (y_test, y_predict)

A tuple of (classifier_model, X_test, y_test)

ax (matplotlib.axes.Axes) – Axis to plot on, if one exists.

classes (list) – List of class names to use as labels, and for ordering (see below). This should match the order contained within the model. Where a classifier model is passed, the classes will be directly extracted.

class_order (list) – List of classes in the desired order along the axes. Should match the supplied classes where classes are given, or integer indicies for where no named classes are given.

normalize (bool) – Whether to normalize the counts for the confusion matrix to the sum of all cases (i.e. be between 0 and 1).

title (str) – Title for the axes.

cmap (str | matplotlib.color.Colormap) – Colormap for the visualisation of the confusion matrix.

norm (bool) – Normalization for the colormap visualisation across the confusion matrix.

xlabelrotation (float) – Rotation in degrees for the xaxis labels.

Returns

ax

Return type

matplotlib.axes.Axes

pyrolite.util.skl.vis.plot_gs_results(gs, xvar=None, yvar=None)[source]

Plots the results from a GridSearch showing location of optimum in 2D.

pyrolite.util.skl.vis.alphas_from_multiclass_prob(probs, method='entropy', alpha=1.0)[source]

Take an array of multiclass probabilities and map to an alpha variable.

Parameters

probs (numpy.ndarray) – Multiclass probabilities with shape (nsamples, nclasses).

method (str, entropy | kl_div) – Method for mapping probabilities to alphas.

alpha (float) – Optional specification of overall maximum alpha value.

Returns

a – Alpha values for each sample with shape (nsamples, 1).

Return type

numpy.ndarray

pyrolite.util.skl.vis.plot_mapping(X, Y, mapping=None, ax=None, cmap=None, alpha=1.0, s=10, alpha_method='entropy', **kwargs)[source]

Parameters

X (numpy.ndarray) – Coordinates in multidimensional space.

Y (numpy.ndarray | sklearn.base.BaseEstimator) – An array of targets, or a method to obtain such an array of targets via Y.predict(). Transformers with probabilistic output (via Y.predict_proba()) will have these probability estimates accounted for via the alpha channel.

mapping (numpy.ndarray | TransformerMixin) – Mapped points or transformer to create mapped points.

ax (matplotlib.axes.Axes) – Axes to plot on.

cmap (matplotlib.cm.ListedColormap) – Colormap to use for the classification visualisation (ideally this should be a discrete colormap unless the classes are organised ).

alpha (float) – Coefficient for alpha.

alpha_method ('entropy' or 'kl_div') – Method to map class probabilities to alpha. 'entropy' uses a measure of entropy relative to null-scenario of equal distribution across classes, while 'kl_div' calculates the information gain relative to the same null-scenario.

Returns

ax (Axes) – Axes on which the mapping is plotted.

tfm (BaseEstimator) – Fitted mapping transform.

Todo

Option to generate colors for individual classes

This could be based on the distances between their centres in multidimensional space (or low dimensional mapping of this space), enabling a continuous (n-dimensional) colormap to be used to show similar classes, in addition to classification confidence.

pyrolite.util.skl.pipeline

pyrolite.util.skl.pipeline.fit_save_classifier(clf, X_train, y_train, directory='.', name='clf', extension='.joblib')[source]

Fit and save a classifier model. Also save relevant metadata where possible.

Parameters

clf (sklearn.base.BaseEstimator) – Classifier or gridsearch.

X_train (numpy.ndarray | pandas.DataFrame) – Training data.

y_train (numpy.ndarray | pandas.Series) – Training true classes.

directory (str | pathlib.Path) – Path to the save directory.

name (str) – Name of the classifier.

extension (str) – Extension to give the saved classifier pickled witih joblib.

Returns

clf – Fitted classifier.

Return type

sklearn.base.BaseEstimator

pyrolite.util.skl.pipeline.classifier_performance_report(clf, X_test, y_test, classes=[], directory='.', name='clf')[source]

Output a performance report for a classifier. Currently outputs the overall classification score, a confusion matrix and where relevant an indication of variation seen across the gridsearch (currently only possible for 2D searches).

Parameters

clf (sklearn.base.BaseEstimator | sklearn.model_selection.GridSearchCV) – Classifer or gridsearch.

X_test (numpy.ndarray | pandas.DataFrame) – Input data for testing.

y_test (numpy.ndarray | pandas.Series) – Labelled/target data for testing.

classes (list) – Names of classes. directory : str | pathlib.Path

Path to the save directory.

name (str) – Name of the classifier.

Returns

clf – Fitted classifier.

Return type

sklearn.base.BaseEstimator

pyrolite.util.skl.pipeline.SVC_pipeline(sampler=None, balance=True, transform=None, scaler=None, kernel='rbf', decision_function_shape='ovo', probability=False, cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=True), param_grid={}, n_jobs=4, verbose=10, cache_size=500, **kwargs)[source]

A convenience function for constructing a Support Vector Classifier pipeline.

Parameters

sampler (sklearn.base.TransformerMixin) – Resampling transformer.

balance (bool) – Whether to balance the class weights for the classifier.

transform (sklearn.base.TransformerMixin) – Preprocessing transformer.

scaler (sklearn.base.TransformerMixin) – Scale transformer.

kernel (str | callable) – Name of kernel to use for the support vector classifier ('linear'|'rbf'|'poly'|'sigmoid'). Optionally, a custom kernel function can be supplied (see sklearn docs for more info).

decision_function_shape (str, 'ovo' or 'ovr') – Shape of the decision function surface. 'ovo' one-vs-one classifier of libsvm (returning classification of shape (samples, classes*(classes-1)/2)), or the default 'ovr' one-vs-rest classifier which will return classification estimation shape of :code:`(samples, classes).

probability (bool) – Whether to implement Platt-scaling to enable probability estimates. This must be enabled prior to calling fit, and will slow down that method.

cv (int | sklearn.model_selection.BaseSearchCV) – Cross validation search. If an integer k is provided, results in default k-fold cross validation. Optionally, if a sklearn.model_selection.BaseSearchCV instance is provided, it will be used directly (enabling finer control, e.g. over sorting/shuffling etc).

param_grid (dict) – Dictionary reprenting a parameter grid for the support vector classifier. Typically contains 1D arrays of grid indicies for SVC() parameters each prefixed with svc__ (e.g. dict(svc__gamma=np.logspace(-1, 3, 5), svc__C=np.logspace(-0.5, 2, 5)).

n_jobs (int) – Number of processors to use for the SVC construction. Note that providing n_jobs = -1 will use all available processors.

verbose (int) – Level of verbosity for the pipeline logging output.

cache_size (float) – Specify the size of the kernel cache (in MB).

Note

The following additional parameters are from sklearn.svm._classes.SVC().

Other Parameters

C (float, default=1.0) – Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.

degree (int, default=3) – Degree of the polynomial kernel function (‘poly’). Must be non-negative. Ignored by all other kernels.

gamma ({‘scale’, ‘auto’} or float, default=’scale’) – Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.

if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,

if ‘auto’, uses 1 / n_features

if float, must be non-negative.

Changed in version 0.22: The default value of gamma changed from ‘auto’ to ‘scale’.

coef0 (float, default=0.0) – Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.

shrinking (bool, default=True) – Whether to use the shrinking heuristic. See the User Guide.

tol (float, default=1e-3) – Tolerance for stopping criterion.

cache_size (float, default=200) – Specify the size of the kernel cache (in MB).

class_weight (dict or ‘balanced’, default=None) – Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).

max_iter (int, default=-1) – Hard limit on iterations within solver, or -1 for no limit.

break_ties (bool, default=False) – If true, decision_function_shape='ovr', and number of classes > 2, predict will break ties according to the confidence values of decision_function; otherwise the first class among the tied classes is returned. Please note that breaking ties comes at a relatively high computational cost compared to a simple predict.

New in version 0.22.

random_state (int, RandomState instance or None, default=None) – Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.

Returns

gs – Gridsearch object containing the results of the SVC training across the parameter grid. Access the best estimator with gs.best_estimator_ and its parameters with gs.best_params_.

Return type

sklearn.model_selection.GridSearchCV

class pyrolite.util.skl.pipeline.PdUnion(estimators: list = [])[source]

fit(X, y=None)[source]

transform(X)[source]

pyrolite.util.skl.select

class pyrolite.util.skl.select.TypeSelector(dtype)[source]

fit(X, y=None)[source]

transform(X)[source]

class pyrolite.util.skl.select.ColumnSelector(columns)[source]

fit(X, y=None)[source]

transform(X)[source]

class pyrolite.util.skl.select.CompositionalSelector(columns=None, inverse=False)[source]

fit(X, y=None)[source]

transform(X)[source]

class pyrolite.util.skl.select.MajorsSelector(components=None)[source]

fit(X, y=None)[source]

transform(X)[source]

class pyrolite.util.skl.select.ElementSelector(components=None)[source]

fit(X, y=None)[source]

transform(X)[source]

class pyrolite.util.skl.select.REESelector(components=None)[source]

fit(X, y=None)[source]

transform(X)[source]

pyrolite.util.skl.transform

class pyrolite.util.skl.transform.DropBelowZero(**kwargs)[source]

transform(X, *args, **kwargs)[source]

fit(X, *args)[source]

class pyrolite.util.skl.transform.LinearTransform(**kwargs)[source]

transform(X, *args, **kwargs)[source]

inverse_transform(Y, *args, **kwargs)[source]

fit(X, *args)[source]

class pyrolite.util.skl.transform.ExpTransform(**kwargs)[source]

transform(X, *args, **kwargs)[source]

inverse_transform(Y, *args, **kwargs)[source]

fit(X, *args)[source]

class pyrolite.util.skl.transform.LogTransform(fmt_string='Ln({})', **kwargs)[source]

transform(X, *args, **kwargs)[source]

inverse_transform(Y, *args, **kwargs)[source]

fit(X, *args)[source]

class pyrolite.util.skl.transform.ALRTransform(label_mode='numeric', **kwargs)[source]

transform(X, *args, **kwargs)[source]

inverse_transform(Y, *args, **kwargs)[source]

fit(X, *args, **kwargs)[source]

class pyrolite.util.skl.transform.CLRTransform(label_mode='numeric', **kwargs)[source]

transform(X, *args, **kwargs)[source]

inverse_transform(Y, *args, **kwargs)[source]

fit(X, *args, **kwargs)[source]

class pyrolite.util.skl.transform.ILRTransform(label_mode='numeric', **kwargs)[source]

transform(X, *args, **kwargs)[source]

inverse_transform(Y, *args, **kwargs)[source]

fit(X, *args, **kwargs)[source]

class pyrolite.util.skl.transform.SphericalCoordTransform(**kwargs)[source]

transform(X, *args, **kwargs)[source]

inverse_transform(Y, *args, **kwargs)[source]

fit(X, *args, **kwargs)[source]

class pyrolite.util.skl.transform.BoxCoxTransform(**kwargs)[source]

transform(X, *args, **kwargs)[source]

inverse_transform(Y, *args, **kwargs)[source]

fit(X, *args, **kwargs)[source]

class pyrolite.util.skl.transform.Devolatilizer(exclude=['H2O', 'H2O_PLUS', 'H2O_MINUS', 'CO2', 'LOI'], renorm=True)[source]

fit(X, y=None)[source]

transform(X)[source]

class pyrolite.util.skl.transform.ElementAggregator(renorm=True, form='oxide')[source]

fit(X, y=None)[source]

transform(X)[source]

class pyrolite.util.skl.transform.LambdaTransformer(norm_to='Chondrite_PON', exclude=['Pm', 'Eu', 'Ce'], params=None, degree=5)[source]

fit(X, y=None)[source]

transform(X)[source]

pyrolite.util.skl.impute

class pyrolite.util.skl.impute.MultipleImputer(multiple=5, max_iter=10, groupby=None, *args, **kwargs)[source]

transform(X, *args, **kwargs)[source]

fit(X, y=None)[source]

Fit the imputers.

Parameters

X (pandas.DataFrame) – Data to use to fit the imputations.

y (pandas.Series) – Target class; optionally specified, and used similarly to groupby.