aepsych.benchmark

Submodules

aepsych.benchmark.benchmark module

class aepsych.benchmark.benchmark.Benchmark(problems, configs, seed=None, n_reps=1, log_every=10)[source]

Bases: object

Benchmark base class.

This class wraps standard functionality for benchmarking models including generating cartesian products of run configurations, running the simulated experiment loop, and logging results.

TODO make a benchmarking tutorial and link/refer to it here.

Initialize benchmark.

Parameters:
  • problems (List[Problem]) – Problem objects containing the test function to evaluate.

  • configs (Mapping[str, Union[str, list]]) – Dictionary of configs to run. Lists at leaves are used to construct a cartesian product of configurations.

  • seed (int, optional) – Random seed to use for reproducible benchmarks. Defaults to randomized seeds.

  • n_reps (int, optional) – Number of repetitions to run of each configuration. Defaults to 1.

  • log_every (int, optional) – Logging interval during an experiment. Defaults to logging every 10 trials.

make_benchmark_list(**bench_config)[source]

Generate a list of benchmarks to run from configuration.

This constructs a cartesian product of config dicts using lists at the leaves of the base config

Returns:

List of dictionaries, each of which can be passed

to aepsych.config.Config.

Return type:

List[dict[str, float]]

materialize_config(config_dict)[source]
property num_benchmarks: int

Return the total number of runs in this benchmark.

Returns:

Total number of runs in this benchmark.

Return type:

int

make_strat_and_flatconfig(config_dict)[source]
From a config dict, generate a strategy (for running) and

flattened config (for logging)

Parameters:

config_dict (Mapping[str, str]) – A run configuration dictionary.

Returns:

A tuple containing a strategy

object and a flat config.

Return type:

Tuple[SequentialStrategy, Dict[str,str]]

run_experiment(problem, config_dict, seed, rep)[source]

Run one simulated experiment.

Parameters:
  • config_dict (Dict[str, str]) – AEPsych configuration to use.

  • seed (int) – Random seed for this run.

  • rep (int) – Index of this repetition.

  • problem (Problem) –

Returns:

A tuple containing a log of the results and the strategy as

of the end of the simulated experiment. This is ignored in large-scale benchmarks but useful for one-off visualization.

Return type:

Tuple[List[Dict[str, object]], SequentialStrategy]

run_benchmarks()[source]

Run all the benchmarks, sequentially.

flatten_config(config)[source]

Flatten a config object for logging.

Parameters:

config (Config) – AEPsych config object.

Returns:

A flat dictionary (that can be used to build a flat pandas data frame).

Return type:

Dict[str,str]

log_at(i)[source]

Check if we should log on this trial index.

Parameters:

i (int) – Trial index to (maybe) log at.

Returns:

True if this trial should be logged.

Return type:

bool

pandas()[source]
Return type:

DataFrame

class aepsych.benchmark.benchmark.DerivedValue(args, func)[source]

Bases: object

A class for dynamically generating config values from other config values during benchmarking.

Initialize DerivedValue.

Parameters:
  • args (List[Tuple[str]]) – Each tuple in this list is a pair of strings that refer to keys in a nested dictionary.

  • func (Callable) – A function that accepts args as input.

For example, consider the following:

benchmark_config = {
“common”: {

“model”: [“GPClassificationModel”, “FancyNewModelToBenchmark”], “acqf”: “MCLevelSetEstimation”

}, “init_strat”: {

“min_asks”: [10, 20], “generator”: “SobolGenerator”

}, “opt_strat”: {

“generator”: “OptimizeAcqfGenerator”, “min_asks”:

DerivedValue(

[(“init_strat”, “min_asks”), (“common”, “model”)], lambda x,y : 100 - x if y == “GPClassificationModel” else 50 - x)

}

}

Four separate benchmarks would be generated from benchmark_config:
  1. model = GPClassificationModel; init trials = 10; opt trials = 90

  2. model = GPClassificationModel; init trials = 20; opt trials = 80

  3. model = FancyNewModelToBenchmark; init trials = 10; opt trials = 40

  4. model = FancyNewModelToBenchmark; init trials = 20; opt trials = 30

Note that if you can also access problem names into func by including (“problem”, “name”) in args.

aepsych.benchmark.pathos_benchmark module

class aepsych.benchmark.pathos_benchmark.PathosBenchmark(nproc=1, *args, **kwargs)[source]

Bases: Benchmark

Benchmarking class for parallelized benchmarks using pathos

Initialize pathos benchmark.

Parameters:

nproc (int, optional) – Number of cores to use. Defaults to 1.

run_experiment(problem, config_dict, seed, rep)[source]

Run one simulated experiment.

Parameters:
  • config_dict (Dict[str, Any]) – AEPsych configuration to use.

  • seed (int) – Random seed for this run.

  • rep (int) – Index of this repetition.

  • problem (Problem) –

Returns:

A tuple containing a log of the results and the strategy as

of the end of the simulated experiment. This is ignored in large-scale benchmarks but useful for one-off visualization.

Return type:

Tuple[List[Dict[str, Any]], SequentialStrategy]

run_benchmarks()[source]

Run all the benchmarks,

Note that this blocks while waiting for benchmarks to complete. If you would like to start benchmarks and periodically collect partial results, use start_benchmarks and then call collate_benchmarks(wait=False) on some interval.

start_benchmarks()[source]

Start benchmark run.

This does not block: after running it, self.futures holds the status of benchmarks running in parallel.

property is_done: bool

Check if the benchmark is done.

Returns:

True if all futures are cleared and benchmark is done.

Return type:

bool

collate_benchmarks(wait=False)[source]

Collect benchmark results from completed futures.

Parameters:
  • wait (bool, optional) – If true, this method blocks and waits

  • False. (on all futures to complete. Defaults to) –

Return type:

None

aepsych.benchmark.pathos_benchmark.run_benchmarks_with_checkpoints(out_path, benchmark_name, problems, configs, global_seed=None, start_idx=0, n_chunks=1, n_reps_per_chunk=1, log_every=None, checkpoint_every=60, n_proc=1, serial_debug=False)[source]

Runs a series of benchmarks, saving both final and intermediate results to .csv files. Benchmarks are run in sequential chunks, each of which runs all combinations of problems/configs/reps in parallel. This function should always be used using the “if __name__ == ‘__main__’: …” idiom.

Parameters:
  • out_path (str) – The path to save the results to.

  • benchmark_name (str) – A name give to this set of benchmarks. Results will be saved in files named like “out_path/benchmark_name_chunk{chunk_number}_out.csv”

  • problems (List[Problem]) – Problem objects containing the test function to evaluate.

  • configs (Mapping[str, Union[str, list]]) – Dictionary of configs to run. Lists at leaves are used to construct a cartesian product of configurations.

  • global_seed (int, optional) – Global seed to use for reproducible benchmarks. Defaults to randomized seeds.

  • start_idx (int) – The chunk number to start from after the last checkpoint. Defaults to 0.

  • n_chunks (int) – The number of chunks to break the results into. Each chunk will contain at least 1 run of every combination of problem and config.

  • n_reps_per_chunk (int, optional) – Number of repetitions to run each problem/config in each chunk.

  • log_every (int, optional) – Logging interval during an experiment. Defaults to only logging at the end.

  • checkpoint_every (int) – Save intermediate results every checkpoint_every seconds.

  • n_proc (int) – Number of processors to use.

  • serial_debug (bool) – debug serially?

Return type:

None

aepsych.benchmark.problem module

class aepsych.benchmark.problem.Problem[source]

Bases: object

Wrapper for a problem or test function. Subclass from this and override f() to define your test function.

n_eval_points = 1000
property eval_grid
property name: str
f(x)[source]
property lb
property ub
property bounds
property metadata: Dict[str, Any]

A dictionary of metadata passed to the Benchmark to be logged. Each key will become a column in the Benchmark’s output dataframe, with its associated value stored in each row.

p(x)[source]

Evaluate response probability from test function.

Parameters:

x (torch.Tensor) – Points at which to evaluate.

Returns:

Response probability at queried points.

Return type:

torch.Tensor

sample_y(x)[source]

Sample a response from test function.

Parameters:

x (torch.Tensor) – Points at which to sample.

Returns:

A single (bernoulli) sample at points.

Return type:

np.ndarray

f_hat(model)[source]

Generate mean predictions from the model over the evaluation grid.

Parameters:

model (aepsych.models.base.ModelProtocol) – Model to evaluate.

Returns:

Posterior mean from underlying model over the evaluation grid.

Return type:

torch.Tensor

property f_true: Tensor

Evaluate true test function over evaluation grid.

Returns:

Values of true test function over evaluation grid.

Return type:

torch.Tensor

property p_true: Tensor

Evaluate true response probability over evaluation grid.

Returns:

Values of true response probability over evaluation grid.

Return type:

torch.Tensor

p_hat(model)[source]

Generate mean predictions from the model over the evaluation grid.

Parameters:

model (aepsych.models.base.ModelProtocol) – Model to evaluate.

Returns:

Posterior mean from underlying model over the evaluation grid.

Return type:

torch.Tensor

evaluate(strat)[source]

Evaluate the strategy with respect to this problem.

Extend this in subclasses to add additional metrics. Metrics include: - mae (mean absolute error), mae (mean absolute error), max_abs_err (max absolute error),

pearson correlation. All of these are computed over the latent variable f and the outcome probability p, w.r.t. the posterior mean. Squared and absolute errors (miae, mise) are also computed in expectation over the posterior, by sampling.

  • Brier score, which measures how well-calibrated the outcome probability is, both at the posterior

    mean (plain brier) and in expectation over the posterior (expected_brier).

Parameters:

strat (aepsych.strategy.Strategy) – Strategy to evaluate.

Returns:

A dictionary containing metrics and their values.

Return type:

Dict[str, float]

class aepsych.benchmark.problem.LSEProblem(thresholds)[source]

Bases: Problem

Level set estimation problem.

This extends the base problem class to evaluate the LSE/threshold estimate in addition to the function estimate.

Parameters:

thresholds (Union[float, List]) –

property metadata: Dict[str, Any]

A dictionary of metadata passed to the Benchmark to be logged. Each key will become a column in the Benchmark’s output dataframe, with its associated value stored in each row.

f_threshold(model=None)[source]
Return type:

Tensor

property true_below_threshold: Tensor

Evaluate whether the true function is below threshold over the eval grid (used for proper scoring and threshold missclassification metric).

evaluate(strat)[source]

Evaluate the model with respect to this problem.

For level set estimation, we add metrics w.r.t. the true threshold: - brier_p_below_{thresh), the brier score w.r.t. p(f(x)<thresh), in contrast to

regular brier, which is the brier score for p(phi(f(x))=1), and the same for misclassification error.

Parameters:

strat (aepsych.strategy.Strategy) – Strategy to evaluate.

Returns:

A dictionary containing metrics and their values, including parent class metrics.

Return type:

Dict[str, float]

class aepsych.benchmark.problem.LSEProblemWithEdgeLogging(thresholds)[source]

Bases: LSEProblem

eps = 0.05
evaluate(strat)[source]

Evaluate the model with respect to this problem.

For level set estimation, we add metrics w.r.t. the true threshold: - brier_p_below_{thresh), the brier score w.r.t. p(f(x)<thresh), in contrast to

regular brier, which is the brier score for p(phi(f(x))=1), and the same for misclassification error.

Parameters:

strat (aepsych.strategy.Strategy) – Strategy to evaluate.

Returns:

A dictionary containing metrics and their values, including parent class metrics.

Return type:

Dict[str, float]

aepsych.benchmark.test_functions module

aepsych.benchmark.test_functions.make_songetal_threshfun(x, y)[source]

Generate a synthetic threshold function by interpolation of real data.

Real data is from Dubno et al. 2013, and procedure follows Song et al. 2017, 2018. See make_songetal_testfun for more detail.

Parameters:
  • x (np.ndarray) – Frequency

  • y (np.ndarray) – Threshold

Returns:

Function that interpolates the given

frequencies and thresholds and returns threshold as a function of frequency.

Return type:

Callable[[float], float]

aepsych.benchmark.test_functions.make_songetal_testfun(phenotype='Metabolic', beta=1)[source]

Make an audiometric test function following Song et al. 2017.

To do so,we first compute a threshold by interpolation/extrapolation from real data, then assume a linear psychometric function in intensity with slope beta.

Parameters:
  • phenotype (str, optional) – Audiometric phenotype from Dubno et al. 2013. Specifically, one of “Metabolic”, “Sensory”, “Metabolic+Sensory”, or “Older-normal”. Defaults to “Metabolic”.

  • beta (float, optional) – Psychometric function slope. Defaults to 1.

Returns:

A test function taking a [b x 2] array of points and returning the psychometric function value at those points.

Return type:

Callable[[np.ndarray, bool], np.ndarray]

Raises:

AssertionError – if an invalid phenotype is passed.

References

Song, X. D., Garnett, R., & Barbour, D. L. (2017).

Psychometric function estimation by probabilistic classification. The Journal of the Acoustical Society of America, 141(4), 2513–2525. https://doi.org/10.1121/1.4979594

aepsych.benchmark.test_functions.novel_discrimination_testfun(x)[source]

Evaluate novel discrimination test function from Owen et al.

The threshold is roughly parabolic with context, and the slope varies with the threshold. Adding to the difficulty is the fact that the function is minimized at f=0 (or p=0.5), corresponding to discrimination being at chance at zero stimulus intensity.

Parameters:

x (np.ndarray) – Points at which to evaluate.

Returns:

Value of function at these points.

Return type:

np.ndarray

aepsych.benchmark.test_functions.novel_detection_testfun(x)[source]

Evaluate novel detection test function from Owen et al.

The threshold is roughly parabolic with context, and the slope varies with the threshold.

Parameters:

x (np.ndarray, torch.Tensor) – Points at which to evaluate.

Returns:

Value of function at these points.

Return type:

np.ndarray, torch.Tensor

aepsych.benchmark.test_functions.discrim_highdim(x)[source]
aepsych.benchmark.test_functions.modified_hartmann6(X)[source]

The modified Hartmann6 function used in Lyu et al.

aepsych.benchmark.test_functions.f_1d(x, mu=0)[source]

latent is just a gaussian bump at mu

aepsych.benchmark.test_functions.f_2d(x)[source]

a gaussian bump at 0 , 0

aepsych.benchmark.test_functions.new_novel_det_params(freq, scale_factor=1.0)[source]
Get the loc and scale params for 2D synthetic novel_det(frequency) function

Keyword arguments:

freq – 1D array of frequencies whose thresholds to return scale factor – scale for the novel_det function, where higher is steeper/lower SD target – target threshold

aepsych.benchmark.test_functions.target_new_novel_det(freq, scale_factor=1.0, target=0.75)[source]
Get the target (i.e. threshold) for 2D synthetic novel_det(frequency) function

Keyword arguments:

freq – 1D array of frequencies whose thresholds to return scale factor – scale for the novel_det function, where higher is steeper/lower SD target – target threshold

aepsych.benchmark.test_functions.new_novel_det(x, scale_factor=1.0)[source]
Get the cdf for 2D synthetic novel_det(frequency) function

Keyword arguments:

x – array of shape (n,2) of locations to sample;

x[…,0] is frequency from -1 to 1; x[…,1] is intensity from -1 to 1

scale factor – scale for the novel_det function, where higher is steeper/lower SD

aepsych.benchmark.test_functions.cdf_new_novel_det(x, scale_factor=1.0)[source]
Get the cdf for 2D synthetic novel_det(frequency) function

Keyword arguments:

x – array of shape (n,2) of locations to sample;

x[…,0] is frequency from -1 to 1; x[…,1] is intensity from -1 to 1

scale factor – scale for the novel_det function, where higher is steeper/lower SD

aepsych.benchmark.test_functions.new_novel_det_channels_params(channel, scale_factor=1.0, wave_freq=1, target=0.75)[source]
Get the target parameters for 2D synthetic novel_det(channel) function

Keyword arguments:

channel – 1D array of channel locations whose thresholds to return scale factor – scale for the novel_det function, where higher is steeper/lower SD wave_freq – frequency of location waveform on [-1,1] target – target threshold

aepsych.benchmark.test_functions.target_new_novel_det_channels(channel, scale_factor=1.0, wave_freq=1, target=0.75)[source]
Get the target (i.e. threshold) for 2D synthetic novel_det(channel) function

Keyword arguments:

channel – 1D array of channel locations whose thresholds to return scale factor – scale for the novel_det function, where higher is steeper/lower SD wave_freq – frequency of location waveform on [-1,1] target – target threshold

aepsych.benchmark.test_functions.new_novel_det_channels(x, channel, scale_factor=1.0, wave_freq=1, target=0.75)[source]
Get the 2D synthetic novel_det(channel) function

Keyword arguments:

x – array of shape (n,2) of locations to sample;

x[…,0] is channel from -1 to 1; x[…,1] is intensity from -1 to 1

scale factor – scale for the novel_det function, where higher is steeper/lower SD wave_freq – frequency of location waveform on [-1,1]

aepsych.benchmark.test_functions.cdf_new_novel_det_channels(channel, scale_factor=1.0, wave_freq=1, target=0.75)[source]
Get the cdf for 2D synthetic novel_det(channel) function

Keyword arguments:

x – array of shape (n,2) of locations to sample;

x[…,0] is channel from -1 to 1; x[…,1] is intensity from -1 to 1

scale factor – scale for the novel_det function, where higher is steeper/lower SD wave_freq – frequency of location waveform on [-1,1]

aepsych.benchmark.test_functions.new_novel_det_3D_params(x, scale_factor=1.0)[source]
aepsych.benchmark.test_functions.new_novel_det_3D(x, scale_factor=1.0)[source]

Get the synthetic 3D novel_det function over freqs,channels and amplitudes

aepsych.benchmark.test_functions.cdf_new_novel_det_3D(x, scale_factor=1.0)[source]

Get the cdf for 3D synthetic novel_det function

x – array of shape (n,3) of locations to sample

x[…,0] is frequency, x[…,1] is channel, x[…,2] is intensity

scale factor – scale for the novel_det function, where higher is steeper/lower SD

aepsych.benchmark.test_functions.target_new_novel_det_3D(x, scale_factor=1.0, target=0.75)[source]

Get target for 3D synthetic novel_det function at location x

x – array of shape (n,2) of locations to sample

x[…,0] is frequency, x[…,1] is channel,

scale factor – scale for the novel_det function, where higher is steeper/lower SD target – target threshold

aepsych.benchmark.test_functions.f_pairwise(f, x, noise_scale=1)[source]

Module contents

class aepsych.benchmark.Benchmark(problems, configs, seed=None, n_reps=1, log_every=10)[source]

Bases: object

Benchmark base class.

This class wraps standard functionality for benchmarking models including generating cartesian products of run configurations, running the simulated experiment loop, and logging results.

TODO make a benchmarking tutorial and link/refer to it here.

Initialize benchmark.

Parameters:
  • problems (List[Problem]) – Problem objects containing the test function to evaluate.

  • configs (Mapping[str, Union[str, list]]) – Dictionary of configs to run. Lists at leaves are used to construct a cartesian product of configurations.

  • seed (int, optional) – Random seed to use for reproducible benchmarks. Defaults to randomized seeds.

  • n_reps (int, optional) – Number of repetitions to run of each configuration. Defaults to 1.

  • log_every (int, optional) – Logging interval during an experiment. Defaults to logging every 10 trials.

make_benchmark_list(**bench_config)[source]

Generate a list of benchmarks to run from configuration.

This constructs a cartesian product of config dicts using lists at the leaves of the base config

Returns:

List of dictionaries, each of which can be passed

to aepsych.config.Config.

Return type:

List[dict[str, float]]

materialize_config(config_dict)[source]
property num_benchmarks: int

Return the total number of runs in this benchmark.

Returns:

Total number of runs in this benchmark.

Return type:

int

make_strat_and_flatconfig(config_dict)[source]
From a config dict, generate a strategy (for running) and

flattened config (for logging)

Parameters:

config_dict (Mapping[str, str]) – A run configuration dictionary.

Returns:

A tuple containing a strategy

object and a flat config.

Return type:

Tuple[SequentialStrategy, Dict[str,str]]

run_experiment(problem, config_dict, seed, rep)[source]

Run one simulated experiment.

Parameters:
  • config_dict (Dict[str, str]) – AEPsych configuration to use.

  • seed (int) – Random seed for this run.

  • rep (int) – Index of this repetition.

  • problem (Problem) –

Returns:

A tuple containing a log of the results and the strategy as

of the end of the simulated experiment. This is ignored in large-scale benchmarks but useful for one-off visualization.

Return type:

Tuple[List[Dict[str, object]], SequentialStrategy]

run_benchmarks()[source]

Run all the benchmarks, sequentially.

flatten_config(config)[source]

Flatten a config object for logging.

Parameters:

config (Config) – AEPsych config object.

Returns:

A flat dictionary (that can be used to build a flat pandas data frame).

Return type:

Dict[str,str]

log_at(i)[source]

Check if we should log on this trial index.

Parameters:

i (int) – Trial index to (maybe) log at.

Returns:

True if this trial should be logged.

Return type:

bool

pandas()[source]
Return type:

DataFrame

class aepsych.benchmark.DerivedValue(args, func)[source]

Bases: object

A class for dynamically generating config values from other config values during benchmarking.

Initialize DerivedValue.

Parameters:
  • args (List[Tuple[str]]) – Each tuple in this list is a pair of strings that refer to keys in a nested dictionary.

  • func (Callable) – A function that accepts args as input.

For example, consider the following:

benchmark_config = {
“common”: {

“model”: [“GPClassificationModel”, “FancyNewModelToBenchmark”], “acqf”: “MCLevelSetEstimation”

}, “init_strat”: {

“min_asks”: [10, 20], “generator”: “SobolGenerator”

}, “opt_strat”: {

“generator”: “OptimizeAcqfGenerator”, “min_asks”:

DerivedValue(

[(“init_strat”, “min_asks”), (“common”, “model”)], lambda x,y : 100 - x if y == “GPClassificationModel” else 50 - x)

}

}

Four separate benchmarks would be generated from benchmark_config:
  1. model = GPClassificationModel; init trials = 10; opt trials = 90

  2. model = GPClassificationModel; init trials = 20; opt trials = 80

  3. model = FancyNewModelToBenchmark; init trials = 10; opt trials = 40

  4. model = FancyNewModelToBenchmark; init trials = 20; opt trials = 30

Note that if you can also access problem names into func by including (“problem”, “name”) in args.

class aepsych.benchmark.PathosBenchmark(nproc=1, *args, **kwargs)[source]

Bases: Benchmark

Benchmarking class for parallelized benchmarks using pathos

Initialize pathos benchmark.

Parameters:

nproc (int, optional) – Number of cores to use. Defaults to 1.

run_experiment(problem, config_dict, seed, rep)[source]

Run one simulated experiment.

Parameters:
  • config_dict (Dict[str, Any]) – AEPsych configuration to use.

  • seed (int) – Random seed for this run.

  • rep (int) – Index of this repetition.

  • problem (Problem) –

Returns:

A tuple containing a log of the results and the strategy as

of the end of the simulated experiment. This is ignored in large-scale benchmarks but useful for one-off visualization.

Return type:

Tuple[List[Dict[str, Any]], SequentialStrategy]

run_benchmarks()[source]

Run all the benchmarks,

Note that this blocks while waiting for benchmarks to complete. If you would like to start benchmarks and periodically collect partial results, use start_benchmarks and then call collate_benchmarks(wait=False) on some interval.

start_benchmarks()[source]

Start benchmark run.

This does not block: after running it, self.futures holds the status of benchmarks running in parallel.

property is_done: bool

Check if the benchmark is done.

Returns:

True if all futures are cleared and benchmark is done.

Return type:

bool

collate_benchmarks(wait=False)[source]

Collect benchmark results from completed futures.

Parameters:
  • wait (bool, optional) – If true, this method blocks and waits

  • False. (on all futures to complete. Defaults to) –

Return type:

None

class aepsych.benchmark.Problem[source]

Bases: object

Wrapper for a problem or test function. Subclass from this and override f() to define your test function.

n_eval_points = 1000
property eval_grid
property name: str
f(x)[source]
property lb
property ub
property bounds
property metadata: Dict[str, Any]

A dictionary of metadata passed to the Benchmark to be logged. Each key will become a column in the Benchmark’s output dataframe, with its associated value stored in each row.

p(x)[source]

Evaluate response probability from test function.

Parameters:

x (torch.Tensor) – Points at which to evaluate.

Returns:

Response probability at queried points.

Return type:

torch.Tensor

sample_y(x)[source]

Sample a response from test function.

Parameters:

x (torch.Tensor) – Points at which to sample.

Returns:

A single (bernoulli) sample at points.

Return type:

np.ndarray

f_hat(model)[source]

Generate mean predictions from the model over the evaluation grid.

Parameters:

model (aepsych.models.base.ModelProtocol) – Model to evaluate.

Returns:

Posterior mean from underlying model over the evaluation grid.

Return type:

torch.Tensor

property f_true: Tensor

Evaluate true test function over evaluation grid.

Returns:

Values of true test function over evaluation grid.

Return type:

torch.Tensor

property p_true: Tensor

Evaluate true response probability over evaluation grid.

Returns:

Values of true response probability over evaluation grid.

Return type:

torch.Tensor

p_hat(model)[source]

Generate mean predictions from the model over the evaluation grid.

Parameters:

model (aepsych.models.base.ModelProtocol) – Model to evaluate.

Returns:

Posterior mean from underlying model over the evaluation grid.

Return type:

torch.Tensor

evaluate(strat)[source]

Evaluate the strategy with respect to this problem.

Extend this in subclasses to add additional metrics. Metrics include: - mae (mean absolute error), mae (mean absolute error), max_abs_err (max absolute error),

pearson correlation. All of these are computed over the latent variable f and the outcome probability p, w.r.t. the posterior mean. Squared and absolute errors (miae, mise) are also computed in expectation over the posterior, by sampling.

  • Brier score, which measures how well-calibrated the outcome probability is, both at the posterior

    mean (plain brier) and in expectation over the posterior (expected_brier).

Parameters:

strat (aepsych.strategy.Strategy) – Strategy to evaluate.

Returns:

A dictionary containing metrics and their values.

Return type:

Dict[str, float]

class aepsych.benchmark.LSEProblem(thresholds)[source]

Bases: Problem

Level set estimation problem.

This extends the base problem class to evaluate the LSE/threshold estimate in addition to the function estimate.

Parameters:

thresholds (Union[float, List]) –

property metadata: Dict[str, Any]

A dictionary of metadata passed to the Benchmark to be logged. Each key will become a column in the Benchmark’s output dataframe, with its associated value stored in each row.

f_threshold(model=None)[source]
Return type:

Tensor

property true_below_threshold: Tensor

Evaluate whether the true function is below threshold over the eval grid (used for proper scoring and threshold missclassification metric).

evaluate(strat)[source]

Evaluate the model with respect to this problem.

For level set estimation, we add metrics w.r.t. the true threshold: - brier_p_below_{thresh), the brier score w.r.t. p(f(x)<thresh), in contrast to

regular brier, which is the brier score for p(phi(f(x))=1), and the same for misclassification error.

Parameters:

strat (aepsych.strategy.Strategy) – Strategy to evaluate.

Returns:

A dictionary containing metrics and their values, including parent class metrics.

Return type:

Dict[str, float]

class aepsych.benchmark.LSEProblemWithEdgeLogging(thresholds)[source]

Bases: LSEProblem

eps = 0.05
evaluate(strat)[source]

Evaluate the model with respect to this problem.

For level set estimation, we add metrics w.r.t. the true threshold: - brier_p_below_{thresh), the brier score w.r.t. p(f(x)<thresh), in contrast to

regular brier, which is the brier score for p(phi(f(x))=1), and the same for misclassification error.

Parameters:

strat (aepsych.strategy.Strategy) – Strategy to evaluate.

Returns:

A dictionary containing metrics and their values, including parent class metrics.

Return type:

Dict[str, float]

aepsych.benchmark.make_songetal_testfun(phenotype='Metabolic', beta=1)[source]

Make an audiometric test function following Song et al. 2017.

To do so,we first compute a threshold by interpolation/extrapolation from real data, then assume a linear psychometric function in intensity with slope beta.

Parameters:
  • phenotype (str, optional) – Audiometric phenotype from Dubno et al. 2013. Specifically, one of “Metabolic”, “Sensory”, “Metabolic+Sensory”, or “Older-normal”. Defaults to “Metabolic”.

  • beta (float, optional) – Psychometric function slope. Defaults to 1.

Returns:

A test function taking a [b x 2] array of points and returning the psychometric function value at those points.

Return type:

Callable[[np.ndarray, bool], np.ndarray]

Raises:

AssertionError – if an invalid phenotype is passed.

References

Song, X. D., Garnett, R., & Barbour, D. L. (2017).

Psychometric function estimation by probabilistic classification. The Journal of the Acoustical Society of America, 141(4), 2513–2525. https://doi.org/10.1121/1.4979594

aepsych.benchmark.novel_detection_testfun(x)[source]

Evaluate novel detection test function from Owen et al.

The threshold is roughly parabolic with context, and the slope varies with the threshold.

Parameters:

x (np.ndarray, torch.Tensor) – Points at which to evaluate.

Returns:

Value of function at these points.

Return type:

np.ndarray, torch.Tensor

aepsych.benchmark.novel_discrimination_testfun(x)[source]

Evaluate novel discrimination test function from Owen et al.

The threshold is roughly parabolic with context, and the slope varies with the threshold. Adding to the difficulty is the fact that the function is minimized at f=0 (or p=0.5), corresponding to discrimination being at chance at zero stimulus intensity.

Parameters:

x (np.ndarray) – Points at which to evaluate.

Returns:

Value of function at these points.

Return type:

np.ndarray

aepsych.benchmark.modified_hartmann6(X)[source]

The modified Hartmann6 function used in Lyu et al.

aepsych.benchmark.discrim_highdim(x)[source]
aepsych.benchmark.run_benchmarks_with_checkpoints(out_path, benchmark_name, problems, configs, global_seed=None, start_idx=0, n_chunks=1, n_reps_per_chunk=1, log_every=None, checkpoint_every=60, n_proc=1, serial_debug=False)[source]

Runs a series of benchmarks, saving both final and intermediate results to .csv files. Benchmarks are run in sequential chunks, each of which runs all combinations of problems/configs/reps in parallel. This function should always be used using the “if __name__ == ‘__main__’: …” idiom.

Parameters:
  • out_path (str) – The path to save the results to.

  • benchmark_name (str) – A name give to this set of benchmarks. Results will be saved in files named like “out_path/benchmark_name_chunk{chunk_number}_out.csv”

  • problems (List[Problem]) – Problem objects containing the test function to evaluate.

  • configs (Mapping[str, Union[str, list]]) – Dictionary of configs to run. Lists at leaves are used to construct a cartesian product of configurations.

  • global_seed (int, optional) – Global seed to use for reproducible benchmarks. Defaults to randomized seeds.

  • start_idx (int) – The chunk number to start from after the last checkpoint. Defaults to 0.

  • n_chunks (int) – The number of chunks to break the results into. Each chunk will contain at least 1 run of every combination of problem and config.

  • n_reps_per_chunk (int, optional) – Number of repetitions to run each problem/config in each chunk.

  • log_every (int, optional) – Logging interval during an experiment. Defaults to only logging at the end.

  • checkpoint_every (int) – Save intermediate results every checkpoint_every seconds.

  • n_proc (int) – Number of processors to use.

  • serial_debug (bool) – debug serially?

Return type:

None

class aepsych.benchmark.DiscrimLowDim(thresholds=None)[source]

Bases: LSEProblemWithEdgeLogging

name = 'discrim_lowdim'
bounds = tensor([[-1., -1.],         [ 1.,  1.]])
f(x)[source]
Parameters:

x (Tensor) –

Return type:

Tensor

class aepsych.benchmark.DiscrimHighDim(thresholds=None)[source]

Bases: LSEProblemWithEdgeLogging

name = 'discrim_highdim'
bounds = tensor([[-1.0000, -1.0000,  0.5000,  0.0500,  0.0500,  0.0000,  0.0000,  0.5000],         [ 1.0000,  1.0000,  1.5000,  0.1500,  0.2000,  0.9000,  1.5700,  2.0000]])
f(x)[source]
Parameters:

x (Tensor) –

Return type:

Tensor

class aepsych.benchmark.Hartmann6Binary(thresholds=None)[source]

Bases: LSEProblemWithEdgeLogging

name = 'hartmann6_binary'
bounds = tensor([[0., 0., 0., 0., 0., 0.],         [1., 1., 1., 1., 1., 1.]])
f(X)[source]
Parameters:

X (Tensor) –

Return type:

Tensor

class aepsych.benchmark.ContrastSensitivity6d(thresholds=None)[source]

Bases: LSEProblemWithEdgeLogging

Uses a surrogate model fit to real data from a constrast sensitivity study.

name = 'contrast_sensitivity_6d'
bounds = tensor([[-1.5000, -1.5000,  0.0000,  0.5000,  1.0000,  0.0000],         [ 0.0000,  0.0000, 20.0000,  7.0000, 10.0000, 10.0000]])
f(X)[source]
Parameters:

X (Tensor) –

Return type:

Tensor