aepsych.benchmark¶
Submodules¶
aepsych.benchmark.benchmark module¶
- class aepsych.benchmark.benchmark.Benchmark(problems, configs, seed=None, n_reps=1, log_every=10)[source]¶
Bases:
object
Benchmark base class.
This class wraps standard functionality for benchmarking models including generating cartesian products of run configurations, running the simulated experiment loop, and logging results.
TODO make a benchmarking tutorial and link/refer to it here.
Initialize benchmark.
- Parameters
problems (List[Problem]) – Problem objects containing the test function to evaluate.
configs (Mapping[str, Union[str, list]]) – Dictionary of configs to run. Lists at leaves are used to construct a cartesian product of configurations.
seed (int, optional) – Random seed to use for reproducible benchmarks. Defaults to randomized seeds.
n_reps (int, optional) – Number of repetitions to run of each configuration. Defaults to 1.
log_every (int, optional) – Logging interval during an experiment. Defaults to logging every 10 trials.
- make_benchmark_list(**bench_config)[source]¶
Generate a list of benchmarks to run from configuration.
This constructs a cartesian product of config dicts using lists at the leaves of the base config
- Returns
- List of dictionaries, each of which can be passed
to aepsych.config.Config.
- Return type
List[dict[str, float]]
- property num_benchmarks: int¶
Return the total number of runs in this benchmark.
- Returns
Total number of runs in this benchmark.
- Return type
int
- make_strat_and_flatconfig(config_dict)[source]¶
- From a config dict, generate a strategy (for running) and
flattened config (for logging)
- Parameters
config_dict (Mapping[str, str]) – A run configuration dictionary.
- Returns
- A tuple containing a strategy
object and a flat config.
- Return type
Tuple[SequentialStrategy, Dict[str,str]]
- run_experiment(problem, config_dict, seed, rep)[source]¶
Run one simulated experiment.
- Parameters
config_dict (Dict[str, str]) – AEPsych configuration to use.
seed (int) – Random seed for this run.
rep (int) – Index of this repetition.
problem (Problem) –
- Returns
- A tuple containing a log of the results and the strategy as
of the end of the simulated experiment. This is ignored in large-scale benchmarks but useful for one-off visualization.
- Return type
Tuple[List[Dict[str, object]], SequentialStrategy]
- flatten_config(config)[source]¶
Flatten a config object for logging.
- Parameters
config (Config) – AEPsych config object.
- Returns
A flat dictionary (that can be used to build a flat pandas data frame).
- Return type
Dict[str,str]
- class aepsych.benchmark.benchmark.DerivedValue(args, func)[source]¶
Bases:
object
A class for dynamically generating config values from other config values during benchmarking.
Initialize DerivedValue.
- Parameters
args (List[Tuple[str]]) – Each tuple in this list is a pair of strings that refer to keys in a nested dictionary.
func (Callable) – A function that accepts args as input.
For example, consider the following:
- benchmark_config = {
- “common”: {
“model”: [“GPClassificationModel”, “FancyNewModelToBenchmark”], “acqf”: “MCLevelSetEstimation”
}, “init_strat”: {
“min_asks”: [10, 20], “generator”: “SobolGenerator”
}, “opt_strat”: {
“generator”: “OptimizeAcqfGenerator”, “min_asks”:
- DerivedValue(
[(“init_strat”, “min_asks”), (“common”, “model”)], lambda x,y : 100 - x if y == “GPClassificationModel” else 50 - x)
}
}
- Four separate benchmarks would be generated from benchmark_config:
model = GPClassificationModel; init trials = 10; opt trials = 90
model = GPClassificationModel; init trials = 20; opt trials = 80
model = FancyNewModelToBenchmark; init trials = 10; opt trials = 40
model = FancyNewModelToBenchmark; init trials = 20; opt trials = 30
Note that if you can also access problem names into func by including (“problem”, “name”) in args.
aepsych.benchmark.pathos_benchmark module¶
- class aepsych.benchmark.pathos_benchmark.PathosBenchmark(nproc=1, *args, **kwargs)[source]¶
Bases:
Benchmark
Benchmarking class for parallelized benchmarks using pathos
Initialize pathos benchmark.
- Parameters
nproc (int, optional) – Number of cores to use. Defaults to 1.
- run_experiment(problem, config_dict, seed, rep)[source]¶
Run one simulated experiment.
- Parameters
config_dict (Dict[str, Any]) – AEPsych configuration to use.
seed (int) – Random seed for this run.
rep (int) – Index of this repetition.
problem (Problem) –
- Returns
- A tuple containing a log of the results and the strategy as
of the end of the simulated experiment. This is ignored in large-scale benchmarks but useful for one-off visualization.
- Return type
Tuple[List[Dict[str, Any]], SequentialStrategy]
- run_benchmarks()[source]¶
Run all the benchmarks,
Note that this blocks while waiting for benchmarks to complete. If you would like to start benchmarks and periodically collect partial results, use start_benchmarks and then call collate_benchmarks(wait=False) on some interval.
- start_benchmarks()[source]¶
Start benchmark run.
This does not block: after running it, self.futures holds the status of benchmarks running in parallel.
- property is_done: bool¶
Check if the benchmark is done.
- Returns
True if all futures are cleared and benchmark is done.
- Return type
bool
- aepsych.benchmark.pathos_benchmark.run_benchmarks_with_checkpoints(out_path, benchmark_name, problems, configs, global_seed=None, n_chunks=1, n_reps_per_chunk=1, log_every=None, checkpoint_every=60, n_proc=1, serial_debug=False)[source]¶
Runs a series of benchmarks, saving both final and intermediate results to .csv files. Benchmarks are run in sequential chunks, each of which runs all combinations of problems/configs/reps in parallel. This function should always be used using the “if __name__ == ‘__main__’: …” idiom.
- Parameters
out_path (str) – The path to save the results to.
benchmark_name (str) – A name give to this set of benchmarks. Results will be saved in files named like “out_path/benchmark_name_chunk{chunk_number}_out.csv”
problems (List[Problem]) – Problem objects containing the test function to evaluate.
configs (Mapping[str, Union[str, list]]) – Dictionary of configs to run. Lists at leaves are used to construct a cartesian product of configurations.
global_seed (int, optional) – Global seed to use for reproducible benchmarks. Defaults to randomized seeds.
n_chunks (int) – The number of chunks to break the results into. Each chunk will contain at least 1 run of every combination of problem and config.
n_reps_per_chunk (int, optional) – Number of repetitions to run each problem/config in each chunk.
log_every (int, optional) – Logging interval during an experiment. Defaults to only logging at the end.
checkpoint_every (int) – Save intermediate results every checkpoint_every seconds.
n_proc (int) – Number of processors to use.
serial_debug (bool) – debug serially?
- Return type
None
aepsych.benchmark.problem module¶
- class aepsych.benchmark.problem.Problem[source]¶
Bases:
object
Wrapper for a problem or test function. Subclass from this and override f() to define your test function.
- n_eval_points = 1000¶
- property eval_grid¶
- property name: str¶
- property lb¶
- property ub¶
- property bounds¶
- property metadata: Dict[str, Any]¶
A dictionary of metadata passed to the Benchmark to be logged. Each key will become a column in the Benchmark’s output dataframe, with its associated value stored in each row.
- p(x)[source]¶
Evaluate response probability from test function.
- Parameters
x (np.ndarray) – Points at which to evaluate.
- Returns
Response probability at queries points.
- Return type
np.ndarray
- sample_y(x)[source]¶
Sample a response from test function.
- Parameters
x (np.ndarray) – Points at which to sample.
- Returns
A single (bernoulli) sample at points.
- Return type
np.ndarray
- f_hat(model)[source]¶
Generate mean predictions from the model over the evaluation grid.
- Parameters
model (aepsych.models.base.ModelProtocol) – Model to evaluate.
- Returns
Posterior mean from underlying model over the evaluation grid.
- Return type
torch.Tensor
- property f_true: ndarray¶
Evaluate true test function over evaluation grid.
- Returns
Values of true test function over evaluation grid.
- Return type
torch.Tensor
- property p_true: Tensor¶
Evaluate true response probability over evaluation grid.
- Returns
Values of true response probability over evaluation grid.
- Return type
torch.Tensor
- p_hat(model)[source]¶
Generate mean predictions from the model over the evaluation grid.
- Parameters
model (aepsych.models.base.ModelProtocol) – Model to evaluate.
- Returns
Posterior mean from underlying model over the evaluation grid.
- Return type
torch.Tensor
- evaluate(strat)[source]¶
Evaluate the strategy with respect to this problem.
Extend this in subclasses to add additional metrics. Metrics include: - mae (mean absolute error), mae (mean absolute error), max_abs_err (max absolute error),
pearson correlation. All of these are computed over the latent variable f and the outcome probability p, w.r.t. the posterior mean. Squared and absolute errors (miae, mise) are also computed in expectation over the posterior, by sampling.
- Brier score, which measures how well-calibrated the outcome probability is, both at the posterior
mean (plain brier) and in expectation over the posterior (expected_brier).
- Parameters
strat (aepsych.strategy.Strategy) – Strategy to evaluate.
- Returns
A dictionary containing metrics and their values.
- Return type
Dict[str, float]
- class aepsych.benchmark.problem.LSEProblem[source]¶
Bases:
Problem
Level set estimation problem.
This extends the base problem class to evaluate the LSE/threshold estimate in addition to the function estimate.
- threshold = 0.75¶
- property metadata: Dict[str, Any]¶
A dictionary of metadata passed to the Benchmark to be logged. Each key will become a column in the Benchmark’s output dataframe, with its associated value stored in each row.
- property true_below_threshold: ndarray¶
Evaluate whether the true function is below threshold over the eval grid (used for proper scoring and threshold missclassification metric).
- evaluate(strat)[source]¶
Evaluate the model with respect to this problem.
For level set estimation, we add metrics w.r.t. the true threshold: - brier_p_below_{thresh), the brier score w.r.t. p(f(x)<thresh), in contrast to
regular brier, which is the brier score for p(phi(f(x))=1), and the same for misclassification error.
- Parameters
strat (aepsych.strategy.Strategy) – Strategy to evaluate.
- Returns
A dictionary containing metrics and their values, including parent class metrics.
- Return type
Dict[str, float]
aepsych.benchmark.test_functions module¶
- aepsych.benchmark.test_functions.make_songetal_threshfun(x, y)[source]¶
Generate a synthetic threshold function by interpolation of real data.
Real data is from Dubno et al. 2013, and procedure follows Song et al. 2017, 2018. See make_songetal_testfun for more detail.
- Parameters
x (np.ndarray) – Frequency
y (np.ndarray) – Threshold
- Returns
- Function that interpolates the given
frequencies and thresholds and returns threshold as a function of frequency.
- Return type
Callable[[float], float]
- aepsych.benchmark.test_functions.make_songetal_testfun(phenotype='Metabolic', beta=1)[source]¶
Make an audiometric test function following Song et al. 2017.
To do so,we first compute a threshold by interpolation/extrapolation from real data, then assume a linear psychometric function in intensity with slope beta.
- Parameters
phenotype (str, optional) – Audiometric phenotype from Dubno et al. 2013. Specifically, one of “Metabolic”, “Sensory”, “Metabolic+Sensory”, or “Older-normal”. Defaults to “Metabolic”.
beta (float, optional) – Psychometric function slope. Defaults to 1.
- Returns
A test function taking a [b x 2] array of points and returning the psychometric function value at those points.
- Return type
Callable[[np.ndarray, bool], np.ndarray]
- Raises
AssertionError – if an invalid phenotype is passed.
References
- Song, X. D., Garnett, R., & Barbour, D. L. (2017).
Psychometric function estimation by probabilistic classification. The Journal of the Acoustical Society of America, 141(4), 2513–2525. https://doi.org/10.1121/1.4979594
- aepsych.benchmark.test_functions.novel_discrimination_testfun(x)[source]¶
Evaluate novel discrimination test function from Owen et al.
The threshold is roughly parabolic with context, and the slope varies with the threshold. Adding to the difficulty is the fact that the function is minimized at f=0 (or p=0.5), corresponding to discrimination being at chance at zero stimulus intensity.
- Parameters
x (np.ndarray) – Points at which to evaluate.
- Returns
Value of function at these points.
- Return type
np.ndarray
- aepsych.benchmark.test_functions.novel_detection_testfun(x)[source]¶
Evaluate novel detection test function from Owen et al.
The threshold is roughly parabolic with context, and the slope varies with the threshold.
- Parameters
x (np.ndarray) – Points at which to evaluate.
- Returns
Value of function at these points.
- Return type
np.ndarray
Module contents¶
- class aepsych.benchmark.Benchmark(problems, configs, seed=None, n_reps=1, log_every=10)[source]¶
Bases:
object
Benchmark base class.
This class wraps standard functionality for benchmarking models including generating cartesian products of run configurations, running the simulated experiment loop, and logging results.
TODO make a benchmarking tutorial and link/refer to it here.
Initialize benchmark.
- Parameters
problems (List[Problem]) – Problem objects containing the test function to evaluate.
configs (Mapping[str, Union[str, list]]) – Dictionary of configs to run. Lists at leaves are used to construct a cartesian product of configurations.
seed (int, optional) – Random seed to use for reproducible benchmarks. Defaults to randomized seeds.
n_reps (int, optional) – Number of repetitions to run of each configuration. Defaults to 1.
log_every (int, optional) – Logging interval during an experiment. Defaults to logging every 10 trials.
- make_benchmark_list(**bench_config)[source]¶
Generate a list of benchmarks to run from configuration.
This constructs a cartesian product of config dicts using lists at the leaves of the base config
- Returns
- List of dictionaries, each of which can be passed
to aepsych.config.Config.
- Return type
List[dict[str, float]]
- property num_benchmarks: int¶
Return the total number of runs in this benchmark.
- Returns
Total number of runs in this benchmark.
- Return type
int
- make_strat_and_flatconfig(config_dict)[source]¶
- From a config dict, generate a strategy (for running) and
flattened config (for logging)
- Parameters
config_dict (Mapping[str, str]) – A run configuration dictionary.
- Returns
- A tuple containing a strategy
object and a flat config.
- Return type
Tuple[SequentialStrategy, Dict[str,str]]
- run_experiment(problem, config_dict, seed, rep)[source]¶
Run one simulated experiment.
- Parameters
config_dict (Dict[str, str]) – AEPsych configuration to use.
seed (int) – Random seed for this run.
rep (int) – Index of this repetition.
problem (Problem) –
- Returns
- A tuple containing a log of the results and the strategy as
of the end of the simulated experiment. This is ignored in large-scale benchmarks but useful for one-off visualization.
- Return type
Tuple[List[Dict[str, object]], SequentialStrategy]
- flatten_config(config)[source]¶
Flatten a config object for logging.
- Parameters
config (Config) – AEPsych config object.
- Returns
A flat dictionary (that can be used to build a flat pandas data frame).
- Return type
Dict[str,str]
- class aepsych.benchmark.DerivedValue(args, func)[source]¶
Bases:
object
A class for dynamically generating config values from other config values during benchmarking.
Initialize DerivedValue.
- Parameters
args (List[Tuple[str]]) – Each tuple in this list is a pair of strings that refer to keys in a nested dictionary.
func (Callable) – A function that accepts args as input.
For example, consider the following:
- benchmark_config = {
- “common”: {
“model”: [“GPClassificationModel”, “FancyNewModelToBenchmark”], “acqf”: “MCLevelSetEstimation”
}, “init_strat”: {
“min_asks”: [10, 20], “generator”: “SobolGenerator”
}, “opt_strat”: {
“generator”: “OptimizeAcqfGenerator”, “min_asks”:
- DerivedValue(
[(“init_strat”, “min_asks”), (“common”, “model”)], lambda x,y : 100 - x if y == “GPClassificationModel” else 50 - x)
}
}
- Four separate benchmarks would be generated from benchmark_config:
model = GPClassificationModel; init trials = 10; opt trials = 90
model = GPClassificationModel; init trials = 20; opt trials = 80
model = FancyNewModelToBenchmark; init trials = 10; opt trials = 40
model = FancyNewModelToBenchmark; init trials = 20; opt trials = 30
Note that if you can also access problem names into func by including (“problem”, “name”) in args.
- class aepsych.benchmark.PathosBenchmark(nproc=1, *args, **kwargs)[source]¶
Bases:
Benchmark
Benchmarking class for parallelized benchmarks using pathos
Initialize pathos benchmark.
- Parameters
nproc (int, optional) – Number of cores to use. Defaults to 1.
- run_experiment(problem, config_dict, seed, rep)[source]¶
Run one simulated experiment.
- Parameters
config_dict (Dict[str, Any]) – AEPsych configuration to use.
seed (int) – Random seed for this run.
rep (int) – Index of this repetition.
problem (Problem) –
- Returns
- A tuple containing a log of the results and the strategy as
of the end of the simulated experiment. This is ignored in large-scale benchmarks but useful for one-off visualization.
- Return type
Tuple[List[Dict[str, Any]], SequentialStrategy]
- run_benchmarks()[source]¶
Run all the benchmarks,
Note that this blocks while waiting for benchmarks to complete. If you would like to start benchmarks and periodically collect partial results, use start_benchmarks and then call collate_benchmarks(wait=False) on some interval.
- start_benchmarks()[source]¶
Start benchmark run.
This does not block: after running it, self.futures holds the status of benchmarks running in parallel.
- property is_done: bool¶
Check if the benchmark is done.
- Returns
True if all futures are cleared and benchmark is done.
- Return type
bool
- class aepsych.benchmark.Problem[source]¶
Bases:
object
Wrapper for a problem or test function. Subclass from this and override f() to define your test function.
- n_eval_points = 1000¶
- property eval_grid¶
- property name: str¶
- property lb¶
- property ub¶
- property bounds¶
- property metadata: Dict[str, Any]¶
A dictionary of metadata passed to the Benchmark to be logged. Each key will become a column in the Benchmark’s output dataframe, with its associated value stored in each row.
- p(x)[source]¶
Evaluate response probability from test function.
- Parameters
x (np.ndarray) – Points at which to evaluate.
- Returns
Response probability at queries points.
- Return type
np.ndarray
- sample_y(x)[source]¶
Sample a response from test function.
- Parameters
x (np.ndarray) – Points at which to sample.
- Returns
A single (bernoulli) sample at points.
- Return type
np.ndarray
- f_hat(model)[source]¶
Generate mean predictions from the model over the evaluation grid.
- Parameters
model (aepsych.models.base.ModelProtocol) – Model to evaluate.
- Returns
Posterior mean from underlying model over the evaluation grid.
- Return type
torch.Tensor
- property f_true: ndarray¶
Evaluate true test function over evaluation grid.
- Returns
Values of true test function over evaluation grid.
- Return type
torch.Tensor
- property p_true: Tensor¶
Evaluate true response probability over evaluation grid.
- Returns
Values of true response probability over evaluation grid.
- Return type
torch.Tensor
- p_hat(model)[source]¶
Generate mean predictions from the model over the evaluation grid.
- Parameters
model (aepsych.models.base.ModelProtocol) – Model to evaluate.
- Returns
Posterior mean from underlying model over the evaluation grid.
- Return type
torch.Tensor
- evaluate(strat)[source]¶
Evaluate the strategy with respect to this problem.
Extend this in subclasses to add additional metrics. Metrics include: - mae (mean absolute error), mae (mean absolute error), max_abs_err (max absolute error),
pearson correlation. All of these are computed over the latent variable f and the outcome probability p, w.r.t. the posterior mean. Squared and absolute errors (miae, mise) are also computed in expectation over the posterior, by sampling.
- Brier score, which measures how well-calibrated the outcome probability is, both at the posterior
mean (plain brier) and in expectation over the posterior (expected_brier).
- Parameters
strat (aepsych.strategy.Strategy) – Strategy to evaluate.
- Returns
A dictionary containing metrics and their values.
- Return type
Dict[str, float]
- class aepsych.benchmark.LSEProblem[source]¶
Bases:
Problem
Level set estimation problem.
This extends the base problem class to evaluate the LSE/threshold estimate in addition to the function estimate.
- threshold = 0.75¶
- property metadata: Dict[str, Any]¶
A dictionary of metadata passed to the Benchmark to be logged. Each key will become a column in the Benchmark’s output dataframe, with its associated value stored in each row.
- property true_below_threshold: ndarray¶
Evaluate whether the true function is below threshold over the eval grid (used for proper scoring and threshold missclassification metric).
- evaluate(strat)[source]¶
Evaluate the model with respect to this problem.
For level set estimation, we add metrics w.r.t. the true threshold: - brier_p_below_{thresh), the brier score w.r.t. p(f(x)<thresh), in contrast to
regular brier, which is the brier score for p(phi(f(x))=1), and the same for misclassification error.
- Parameters
strat (aepsych.strategy.Strategy) – Strategy to evaluate.
- Returns
A dictionary containing metrics and their values, including parent class metrics.
- Return type
Dict[str, float]
- aepsych.benchmark.make_songetal_testfun(phenotype='Metabolic', beta=1)[source]¶
Make an audiometric test function following Song et al. 2017.
To do so,we first compute a threshold by interpolation/extrapolation from real data, then assume a linear psychometric function in intensity with slope beta.
- Parameters
phenotype (str, optional) – Audiometric phenotype from Dubno et al. 2013. Specifically, one of “Metabolic”, “Sensory”, “Metabolic+Sensory”, or “Older-normal”. Defaults to “Metabolic”.
beta (float, optional) – Psychometric function slope. Defaults to 1.
- Returns
A test function taking a [b x 2] array of points and returning the psychometric function value at those points.
- Return type
Callable[[np.ndarray, bool], np.ndarray]
- Raises
AssertionError – if an invalid phenotype is passed.
References
- Song, X. D., Garnett, R., & Barbour, D. L. (2017).
Psychometric function estimation by probabilistic classification. The Journal of the Acoustical Society of America, 141(4), 2513–2525. https://doi.org/10.1121/1.4979594
- aepsych.benchmark.novel_detection_testfun(x)[source]¶
Evaluate novel detection test function from Owen et al.
The threshold is roughly parabolic with context, and the slope varies with the threshold.
- Parameters
x (np.ndarray) – Points at which to evaluate.
- Returns
Value of function at these points.
- Return type
np.ndarray
- aepsych.benchmark.novel_discrimination_testfun(x)[source]¶
Evaluate novel discrimination test function from Owen et al.
The threshold is roughly parabolic with context, and the slope varies with the threshold. Adding to the difficulty is the fact that the function is minimized at f=0 (or p=0.5), corresponding to discrimination being at chance at zero stimulus intensity.
- Parameters
x (np.ndarray) – Points at which to evaluate.
- Returns
Value of function at these points.
- Return type
np.ndarray
- aepsych.benchmark.run_benchmarks_with_checkpoints(out_path, benchmark_name, problems, configs, global_seed=None, n_chunks=1, n_reps_per_chunk=1, log_every=None, checkpoint_every=60, n_proc=1, serial_debug=False)[source]¶
Runs a series of benchmarks, saving both final and intermediate results to .csv files. Benchmarks are run in sequential chunks, each of which runs all combinations of problems/configs/reps in parallel. This function should always be used using the “if __name__ == ‘__main__’: …” idiom.
- Parameters
out_path (str) – The path to save the results to.
benchmark_name (str) – A name give to this set of benchmarks. Results will be saved in files named like “out_path/benchmark_name_chunk{chunk_number}_out.csv”
problems (List[Problem]) – Problem objects containing the test function to evaluate.
configs (Mapping[str, Union[str, list]]) – Dictionary of configs to run. Lists at leaves are used to construct a cartesian product of configurations.
global_seed (int, optional) – Global seed to use for reproducible benchmarks. Defaults to randomized seeds.
n_chunks (int) – The number of chunks to break the results into. Each chunk will contain at least 1 run of every combination of problem and config.
n_reps_per_chunk (int, optional) – Number of repetitions to run each problem/config in each chunk.
log_every (int, optional) – Logging interval during an experiment. Defaults to only logging at the end.
checkpoint_every (int) – Save intermediate results every checkpoint_every seconds.
n_proc (int) – Number of processors to use.
serial_debug (bool) – debug serially?
- Return type
None