pyrolite.util.resampling

Utilities for (weighted) bootstrap resampling applied to geoscientific point-data.

pyrolite.util.resampling.univariate_distance_matrix(a, b=None, distance_metric=None)[source]

Get a distance matrix for a single column or array of values (here used for ages).

Parameters

a, b (numpy.ndarray) – Points or arrays to calculate distance between. If only one array is specified, a full distance matrix (i.e. calculate a point-to-point distance for every combination of points) will be returned.

distance_metric – Callable function f(a, b) from which to derive a distance metric.

Returns

2D distance matrix.

Return type

numpy.ndarray

pyrolite.util.resampling.get_spatiotemporal_resampling_weights(df, spatial_norm=1.8, temporal_norm=38, latlong_names=['Latitude', 'Longitude'], age_name='Age', max_memory_fraction=0.25, normalized_weights=True, **kwargs)[source]

Takes a dataframe with lat, long and age and returns a sampling weight for each sample which is essentailly the inverse of the mean distance to other samples.

Parameters

df (pandas.DataFrame) – Dataframe to calculate weights for.

spatial_norm (float) – Normalising constant for spatial measures (1.8 arc degrees).

temporal_norm (float) – Normalising constant for temporal measures (38 Mya).

latlong_names (list) – List of column names referring to latitude and longitude.

age_name (str) – Column name corresponding to geological age or time.

max_memory_fraction (float) – Constraint to switch to calculating mean distances where matrix=True and the distance matrix requires greater than a specified fraction of total avaialbe physical memory. This is passed on to great_circle_distance().

normalized_weights (bool) – Whether to renormalise weights to unity.

Returns

weights – Sampling weights.

Return type

numpy.ndarray

Notes

This function is equivalent to Eq(1) from Keller and Schone:

\[W_i \propto 1 \Big / \sum_{j=1}^{n} \Big ( \frac{1}{((z_i - z_j)/a)^2 + 1} + \frac{1}{((t_i - t_j)/b)^2 + 1} \Big )\]

pyrolite.util.resampling.add_age_noise(df, min_sigma=50, noise_level=1.0, age_name='Age', age_uncertainty_name='AgeUncertainty', min_age_name='MinAge', max_age_name='MaxAge')[source]

Add gaussian noise to a series of geological ages based on specified uncertainties or age ranges.

Parameters

df (pandas.DataFrame) – Dataframe with age data within which to look up the age name and add noise.

min_sigma (float) – Minimum uncertainty to be considered for adding age noise.

noise_level (float) – Scaling of the noise added to the ages. By default the uncertaines are unscaled, but where age uncertaines are specified and are the one standard deviation level this can be used to expand the range of noise added (e.g. to 2SD).

age_name (str) – Column name for absolute ages.

age_uncertainty_name (str) – Name of the column specifiying absolute age uncertainties.

min_age_name (str) – Name of the column specifying minimum absolute ages (used where uncertainties are otherwise unspecified).

max_age_name (str) – Name of the column specifying maximum absolute ages (used where uncertainties are otherwise unspecified).

Returns

df – Dataframe with noise-modified ages.

Return type

pandas.DataFrame

Notes

This modifies the dataframe which is input - be aware of this if using outside of the bootstrap resampling for which this was designed.

pyrolite.util.resampling.spatiotemporal_bootstrap_resample(df, columns=None, uncert=None, weights=None, niter=100, categories=None, transform=None, bootstrap_method='smooth', add_gaussian_age_noise=True, metrics=['mean', 'var'], default_uncertainty=0.02, relative_uncertainties=True, noise_level=1, age_name='Age', latlong_names=['Latitude', 'Longitude'], **kwargs)[source]

Resample and aggregate metrics from a dataframe, optionally aggregating by a given set of categories. Formulated specifically for dealing with resampling to address uneven sampling density in space and particularly geological time.

Parameters

df (pandas.DataFrame) – Dataframe to resample.

columns (list) – Columns to provide bootstrap resampled estimates for.

uncert (float | numpy.ndarray | pandas.Series | pandas.DataFrame) – Fractional uncertainties for the dataset.

weights (numpy.ndarray | pandas.Series) – Array of weights for resampling, if precomputed.

niter (int) – Number of resampling iterations. This will be the minimum index size of the output metric dataframes.

categories (list | numpy.ndarray | pandas.Series) – List of sample categories to group the ouputs by, which has the same size as the dataframe index.

transform – Callable function to transform input data prior to aggregation functions. Note that the outputs will need to be inverse-transformed.

bootstrap_method (str) – Which method to use to add gaussian noise to the input dataset parameters.

add_gaussian_age_noise (bool) – Whether to add gassian noise to the input dataset ages, where present.

metrics (list) – List of metrics to use for dataframe aggregation.

default_uncertainty (float) – Default (fractional) uncertainty where uncertainties are not given.

relative_uncertainties (bool) – Whether uncertainties are relative (True, i.e. fractional proportions of parameter values), or absolute (False)

noise_level (float) – Multiplier for the random gaussian noise added to the dataset and ages.

age_name (str) – Column name for geological age.

latlong_names (list) – Column names for latitude and longitude, or equvalent orthogonal spherical spatial measures.

Returns

Dictionary of aggregated Dataframe(s) indexed by statistical metrics. If categories are specified, the dataframe(s) will have a hierarchical index of categories, iteration.

Return type

dict