pyrolite.util.synthetic
Utility functions for creating synthetic (geochemical) data.
- pyrolite.util.synthetic.random_cov_matrix(dim, sigmas=None, validate=False, seed=None)[source]
Generate a random covariance matrix which is symmetric positive-semidefinite.
- Parameters
dim (
int
) – Dimensionality of the covariance matrix.sigmas (
numpy.ndarray
) – Optionally specified sigmas for the variables.validate (
bool
) – Whether to validate output.- Returns
Covariance matrix of shape
(dim, dim)
.- Return type
Todo
Implement a characteristic scale for the covariance matrix.
- pyrolite.util.synthetic.random_composition(size=1000, D=4, mean=None, cov=None, propnan=0.1, missing_columns=None, missing=None, seed=None)[source]
Generate a simulated random unimodal compositional dataset, optionally with missing data.
- Parameters
size (
int
) – Size of the dataset.D (
int
) – Dimensionality of the dataset.mean (
numpy.ndarray
,None
) – Optional specification of mean composition.cov (
numpy.ndarray
,None
) – Optional specification of covariance matrix (in log space).propnan (
float
, [0, 1)) – Proportion of missing values in the output dataset.missing_columns (
int
|tuple
) – Specification of columns to be missing. If an integer is specified, interpreted to be the number of columns containin missing data (at a proportion defined by propnan). If a tuple or list, the specific columns to contain missing data.missing (
str
,None
) – Missingness pattern. If notNone
, one of"MCAR", "MAR", "MNAR"
.
If
missing = "MCAR"
, data will be missing at random.If
missing = "MAR"
, data will be missing with some relationship to other parameters.If
missing = "MNAR"
, data will be thresholded at some lower bound.seed (
int
,None
) – Random seed to use, optionally specified.- Returns
Simulated dataset with missing values.
- Return type
Todo
Add feature to translate rough covariance in D to logcovariance in D-1
Update the :code:`missing = “MAR”` example to be more realistic/variable.
- pyrolite.util.synthetic.normal_frame(columns=['SiO2', 'CaO', 'MgO', 'FeO', 'TiO2'], size=10, mean=None, **kwargs)[source]
Creates a
pandas.DataFrame
with samples from a single multivariate-normal distributed composition.
- Parameters
columns (
list
) – List of columns to use for the dataframe. These won’t have any direct impact on the data returned, and are only for labelling.size (
int
) – Index length for the dataframe.mean (
numpy.ndarray
,None
) – Optional specification of mean composition.- Return type
- pyrolite.util.synthetic.normal_series(index=['SiO2', 'CaO', 'MgO', 'FeO', 'TiO2'], mean=None, **kwargs)[source]
Creates a
pandas.Series
with a single sample from a single multivariate-normal distributed composition.
- Parameters
index (
list
) – List of indexes for the series. These won’t have any direct impact on the data returned, and are only for labelling.mean (
numpy.ndarray
,None
) – Optional specification of mean composition.- Return type
- pyrolite.util.synthetic.example_spider_data(start='EMORB_SM89', norm_to='PM_PON', size=120, noise_level=0.5, offsets=None, units='ppm')[source]
Generate some random data for demonstrating spider plots.
By default, this generates a composition based around EMORB, normalised to Primitive Mantle.
- Parameters
start (
str
) – Composition to start with.norm_to (
str
) – Composition to normalise to. Can optionally specifyNone
.size (
int
) – Number of observations to include (index length).noise_level (
float
) – Log-units of noise (1sigma).offsets (
dict
) – Dictionary of offsets in log-units (in log units).units (
str
) – Units to use before conversion. Should have no effect other than reducing calculation times if norm_to isNone
.- Returns
df – Dataframe of example synthetic data.
- Return type