aepsych.benchmark¶
Submodules¶
aepsych.benchmark.benchmark module¶
- class aepsych.benchmark.benchmark.Benchmark(problems, configs, seed=None, n_reps=1, log_every=10)[source]¶
Bases:
object
Benchmark base class.
This class wraps standard functionality for benchmarking models including generating cartesian products of run configurations, running the simulated experiment loop, and logging results.
TODO make a benchmarking tutorial and link/refer to it here.
Initialize benchmark.
- Parameters:
problems (List[Problem]) – Problem objects containing the test function to evaluate.
configs (Mapping[str, Union[str, list]]) – Dictionary of configs to run. Lists at leaves are used to construct a cartesian product of configurations.
seed (int, optional) – Random seed to use for reproducible benchmarks. Defaults to randomized seeds.
n_reps (int, optional) – Number of repetitions to run of each configuration. Defaults to 1.
log_every (int, optional) – Logging interval during an experiment. Defaults to logging every 10 trials.
- make_benchmark_list(**bench_config)[source]¶
Generate a list of benchmarks to run from configuration.
This constructs a cartesian product of config dicts using lists at the leaves of the base config
- Returns:
- List of dictionaries, each of which can be passed
to aepsych.config.Config.
- Return type:
List[dict[str, float]]
- property num_benchmarks: int¶
Return the total number of runs in this benchmark.
- Returns:
Total number of runs in this benchmark.
- Return type:
int
- make_strat_and_flatconfig(config_dict)[source]¶
- From a config dict, generate a strategy (for running) and
flattened config (for logging)
- Parameters:
config_dict (Mapping[str, str]) – A run configuration dictionary.
- Returns:
- A tuple containing a strategy
object and a flat config.
- Return type:
Tuple[SequentialStrategy, Dict[str,str]]
- run_experiment(problem, config_dict, seed, rep)[source]¶
Run one simulated experiment.
- Parameters:
config_dict (Dict[str, str]) – AEPsych configuration to use.
seed (int) – Random seed for this run.
rep (int) – Index of this repetition.
problem (Problem) –
- Returns:
- A tuple containing a log of the results and the strategy as
of the end of the simulated experiment. This is ignored in large-scale benchmarks but useful for one-off visualization.
- Return type:
Tuple[List[Dict[str, object]], SequentialStrategy]
- flatten_config(config)[source]¶
Flatten a config object for logging.
- Parameters:
config (Config) – AEPsych config object.
- Returns:
A flat dictionary (that can be used to build a flat pandas data frame).
- Return type:
Dict[str,str]
- class aepsych.benchmark.benchmark.DerivedValue(args, func)[source]¶
Bases:
object
A class for dynamically generating config values from other config values during benchmarking.
Initialize DerivedValue.
- Parameters:
args (List[Tuple[str]]) – Each tuple in this list is a pair of strings that refer to keys in a nested dictionary.
func (Callable) – A function that accepts args as input.
For example, consider the following:
- benchmark_config = {
- “common”: {
“model”: [“GPClassificationModel”, “FancyNewModelToBenchmark”], “acqf”: “MCLevelSetEstimation”
}, “init_strat”: {
“min_asks”: [10, 20], “generator”: “SobolGenerator”
}, “opt_strat”: {
“generator”: “OptimizeAcqfGenerator”, “min_asks”:
- DerivedValue(
[(“init_strat”, “min_asks”), (“common”, “model”)], lambda x,y : 100 - x if y == “GPClassificationModel” else 50 - x)
}
}
- Four separate benchmarks would be generated from benchmark_config:
model = GPClassificationModel; init trials = 10; opt trials = 90
model = GPClassificationModel; init trials = 20; opt trials = 80
model = FancyNewModelToBenchmark; init trials = 10; opt trials = 40
model = FancyNewModelToBenchmark; init trials = 20; opt trials = 30
Note that if you can also access problem names into func by including (“problem”, “name”) in args.
aepsych.benchmark.pathos_benchmark module¶
- class aepsych.benchmark.pathos_benchmark.PathosBenchmark(nproc=1, *args, **kwargs)[source]¶
Bases:
Benchmark
Benchmarking class for parallelized benchmarks using pathos
Initialize pathos benchmark.
- Parameters:
nproc (int, optional) – Number of cores to use. Defaults to 1.
- run_experiment(problem, config_dict, seed, rep)[source]¶
Run one simulated experiment.
- Parameters:
config_dict (Dict[str, Any]) – AEPsych configuration to use.
seed (int) – Random seed for this run.
rep (int) – Index of this repetition.
problem (Problem) –
- Returns:
- A tuple containing a log of the results and the strategy as
of the end of the simulated experiment. This is ignored in large-scale benchmarks but useful for one-off visualization.
- Return type:
Tuple[List[Dict[str, Any]], SequentialStrategy]
- run_benchmarks()[source]¶
Run all the benchmarks,
Note that this blocks while waiting for benchmarks to complete. If you would like to start benchmarks and periodically collect partial results, use start_benchmarks and then call collate_benchmarks(wait=False) on some interval.
- start_benchmarks()[source]¶
Start benchmark run.
This does not block: after running it, self.futures holds the status of benchmarks running in parallel.
- property is_done: bool¶
Check if the benchmark is done.
- Returns:
True if all futures are cleared and benchmark is done.
- Return type:
bool
- aepsych.benchmark.pathos_benchmark.run_benchmarks_with_checkpoints(out_path, benchmark_name, problems, configs, global_seed=None, n_chunks=1, n_reps_per_chunk=1, log_every=None, checkpoint_every=60, n_proc=1, serial_debug=False)[source]¶
Runs a series of benchmarks, saving both final and intermediate results to .csv files. Benchmarks are run in sequential chunks, each of which runs all combinations of problems/configs/reps in parallel. This function should always be used using the “if __name__ == ‘__main__’: …” idiom.
- Parameters:
out_path (str) – The path to save the results to.
benchmark_name (str) – A name give to this set of benchmarks. Results will be saved in files named like “out_path/benchmark_name_chunk{chunk_number}_out.csv”
problems (List[Problem]) – Problem objects containing the test function to evaluate.
configs (Mapping[str, Union[str, list]]) – Dictionary of configs to run. Lists at leaves are used to construct a cartesian product of configurations.
global_seed (int, optional) – Global seed to use for reproducible benchmarks. Defaults to randomized seeds.
n_chunks (int) – The number of chunks to break the results into. Each chunk will contain at least 1 run of every combination of problem and config.
n_reps_per_chunk (int, optional) – Number of repetitions to run each problem/config in each chunk.
log_every (int, optional) – Logging interval during an experiment. Defaults to only logging at the end.
checkpoint_every (int) – Save intermediate results every checkpoint_every seconds.
n_proc (int) – Number of processors to use.
serial_debug (bool) – debug serially?
- Return type:
None
aepsych.benchmark.problem module¶
- class aepsych.benchmark.problem.Problem[source]¶
Bases:
object
Wrapper for a problem or test function. Subclass from this and override f() to define your test function.
- n_eval_points = 1000¶
- property eval_grid¶
- property name: str¶
- property lb¶
- property ub¶
- property bounds¶
- property metadata: Dict[str, Any]¶
A dictionary of metadata passed to the Benchmark to be logged. Each key will become a column in the Benchmark’s output dataframe, with its associated value stored in each row.
- p(x)[source]¶
Evaluate response probability from test function.
- Parameters:
x (np.ndarray) – Points at which to evaluate.
- Returns:
Response probability at queries points.
- Return type:
np.ndarray
- sample_y(x)[source]¶
Sample a response from test function.
- Parameters:
x (np.ndarray) – Points at which to sample.
- Returns:
A single (bernoulli) sample at points.
- Return type:
np.ndarray
- f_hat(model)[source]¶
Generate mean predictions from the model over the evaluation grid.
- Parameters:
model (aepsych.models.base.ModelProtocol) – Model to evaluate.
- Returns:
Posterior mean from underlying model over the evaluation grid.
- Return type:
torch.Tensor
- property f_true: ndarray¶
Evaluate true test function over evaluation grid.
- Returns:
Values of true test function over evaluation grid.
- Return type:
torch.Tensor
- property p_true: Tensor¶
Evaluate true response probability over evaluation grid.
- Returns:
Values of true response probability over evaluation grid.
- Return type:
torch.Tensor
- p_hat(model)[source]¶
Generate mean predictions from the model over the evaluation grid.
- Parameters:
model (aepsych.models.base.ModelProtocol) – Model to evaluate.
- Returns:
Posterior mean from underlying model over the evaluation grid.
- Return type:
torch.Tensor
- evaluate(strat)[source]¶
Evaluate the strategy with respect to this problem.
Extend this in subclasses to add additional metrics. Metrics include: - mae (mean absolute error), mae (mean absolute error), max_abs_err (max absolute error),
pearson correlation. All of these are computed over the latent variable f and the outcome probability p, w.r.t. the posterior mean. Squared and absolute errors (miae, mise) are also computed in expectation over the posterior, by sampling.
- Brier score, which measures how well-calibrated the outcome probability is, both at the posterior
mean (plain brier) and in expectation over the posterior (expected_brier).
- Parameters:
strat (aepsych.strategy.Strategy) – Strategy to evaluate.
- Returns:
A dictionary containing metrics and their values.
- Return type:
Dict[str, float]
- class aepsych.benchmark.problem.LSEProblem[source]¶
Bases:
Problem
Level set estimation problem.
This extends the base problem class to evaluate the LSE/threshold estimate in addition to the function estimate.
- threshold = 0.75¶
- property metadata: Dict[str, Any]¶
A dictionary of metadata passed to the Benchmark to be logged. Each key will become a column in the Benchmark’s output dataframe, with its associated value stored in each row.
- property true_below_threshold: ndarray¶
Evaluate whether the true function is below threshold over the eval grid (used for proper scoring and threshold missclassification metric).
- evaluate(strat)[source]¶
Evaluate the model with respect to this problem.
For level set estimation, we add metrics w.r.t. the true threshold: - brier_p_below_{thresh), the brier score w.r.t. p(f(x)<thresh), in contrast to
regular brier, which is the brier score for p(phi(f(x))=1), and the same for misclassification error.
- Parameters:
strat (aepsych.strategy.Strategy) – Strategy to evaluate.
- Returns:
A dictionary containing metrics and their values, including parent class metrics.
- Return type:
Dict[str, float]
aepsych.benchmark.test_functions module¶
- aepsych.benchmark.test_functions.make_songetal_threshfun(x, y)[source]¶
Generate a synthetic threshold function by interpolation of real data.
Real data is from Dubno et al. 2013, and procedure follows Song et al. 2017, 2018. See make_songetal_testfun for more detail.
- Parameters:
x (np.ndarray) – Frequency
y (np.ndarray) – Threshold
- Returns:
- Function that interpolates the given
frequencies and thresholds and returns threshold as a function of frequency.
- Return type:
Callable[[float], float]
- aepsych.benchmark.test_functions.make_songetal_testfun(phenotype='Metabolic', beta=1)[source]¶
Make an audiometric test function following Song et al. 2017.
To do so,we first compute a threshold by interpolation/extrapolation from real data, then assume a linear psychometric function in intensity with slope beta.
- Parameters:
phenotype (str, optional) – Audiometric phenotype from Dubno et al. 2013. Specifically, one of “Metabolic”, “Sensory”, “Metabolic+Sensory”, or “Older-normal”. Defaults to “Metabolic”.
beta (float, optional) – Psychometric function slope. Defaults to 1.
- Returns:
A test function taking a [b x 2] array of points and returning the psychometric function value at those points.
- Return type:
Callable[[np.ndarray, bool], np.ndarray]
- Raises:
AssertionError – if an invalid phenotype is passed.
References
- Song, X. D., Garnett, R., & Barbour, D. L. (2017).
Psychometric function estimation by probabilistic classification. The Journal of the Acoustical Society of America, 141(4), 2513–2525. https://doi.org/10.1121/1.4979594
- aepsych.benchmark.test_functions.novel_discrimination_testfun(x)[source]¶
Evaluate novel discrimination test function from Owen et al.
The threshold is roughly parabolic with context, and the slope varies with the threshold. Adding to the difficulty is the fact that the function is minimized at f=0 (or p=0.5), corresponding to discrimination being at chance at zero stimulus intensity.
- Parameters:
x (np.ndarray) – Points at which to evaluate.
- Returns:
Value of function at these points.
- Return type:
np.ndarray
- aepsych.benchmark.test_functions.novel_detection_testfun(x)[source]¶
Evaluate novel detection test function from Owen et al.
The threshold is roughly parabolic with context, and the slope varies with the threshold.
- Parameters:
x (np.ndarray) – Points at which to evaluate.
- Returns:
Value of function at these points.
- Return type:
np.ndarray
Module contents¶
- class aepsych.benchmark.Benchmark(problems, configs, seed=None, n_reps=1, log_every=10)[source]¶
Bases:
object
Benchmark base class.
This class wraps standard functionality for benchmarking models including generating cartesian products of run configurations, running the simulated experiment loop, and logging results.
TODO make a benchmarking tutorial and link/refer to it here.
Initialize benchmark.
- Parameters:
problems (List[Problem]) – Problem objects containing the test function to evaluate.
configs (Mapping[str, Union[str, list]]) – Dictionary of configs to run. Lists at leaves are used to construct a cartesian product of configurations.
seed (int, optional) – Random seed to use for reproducible benchmarks. Defaults to randomized seeds.
n_reps (int, optional) – Number of repetitions to run of each configuration. Defaults to 1.
log_every (int, optional) – Logging interval during an experiment. Defaults to logging every 10 trials.
- make_benchmark_list(**bench_config)[source]¶
Generate a list of benchmarks to run from configuration.
This constructs a cartesian product of config dicts using lists at the leaves of the base config
- Returns:
- List of dictionaries, each of which can be passed
to aepsych.config.Config.
- Return type:
List[dict[str, float]]
- property num_benchmarks: int¶
Return the total number of runs in this benchmark.
- Returns:
Total number of runs in this benchmark.
- Return type:
int
- make_strat_and_flatconfig(config_dict)[source]¶
- From a config dict, generate a strategy (for running) and
flattened config (for logging)
- Parameters:
config_dict (Mapping[str, str]) – A run configuration dictionary.
- Returns:
- A tuple containing a strategy
object and a flat config.
- Return type:
Tuple[SequentialStrategy, Dict[str,str]]
- run_experiment(problem, config_dict, seed, rep)[source]¶
Run one simulated experiment.
- Parameters:
config_dict (Dict[str, str]) – AEPsych configuration to use.
seed (int) – Random seed for this run.
rep (int) – Index of this repetition.
problem (Problem) –
- Returns:
- A tuple containing a log of the results and the strategy as
of the end of the simulated experiment. This is ignored in large-scale benchmarks but useful for one-off visualization.
- Return type:
Tuple[List[Dict[str, object]], SequentialStrategy]
- flatten_config(config)[source]¶
Flatten a config object for logging.
- Parameters:
config (Config) – AEPsych config object.
- Returns:
A flat dictionary (that can be used to build a flat pandas data frame).
- Return type:
Dict[str,str]
- class aepsych.benchmark.DerivedValue(args, func)[source]¶
Bases:
object
A class for dynamically generating config values from other config values during benchmarking.
Initialize DerivedValue.
- Parameters:
args (List[Tuple[str]]) – Each tuple in this list is a pair of strings that refer to keys in a nested dictionary.
func (Callable) – A function that accepts args as input.
For example, consider the following:
- benchmark_config = {
- “common”: {
“model”: [“GPClassificationModel”, “FancyNewModelToBenchmark”], “acqf”: “MCLevelSetEstimation”
}, “init_strat”: {
“min_asks”: [10, 20], “generator”: “SobolGenerator”
}, “opt_strat”: {
“generator”: “OptimizeAcqfGenerator”, “min_asks”:
- DerivedValue(
[(“init_strat”, “min_asks”), (“common”, “model”)], lambda x,y : 100 - x if y == “GPClassificationModel” else 50 - x)
}
}
- Four separate benchmarks would be generated from benchmark_config:
model = GPClassificationModel; init trials = 10; opt trials = 90
model = GPClassificationModel; init trials = 20; opt trials = 80
model = FancyNewModelToBenchmark; init trials = 10; opt trials = 40
model = FancyNewModelToBenchmark; init trials = 20; opt trials = 30
Note that if you can also access problem names into func by including (“problem”, “name”) in args.
- class aepsych.benchmark.PathosBenchmark(nproc=1, *args, **kwargs)[source]¶
Bases:
Benchmark
Benchmarking class for parallelized benchmarks using pathos
Initialize pathos benchmark.
- Parameters:
nproc (int, optional) – Number of cores to use. Defaults to 1.
- run_experiment(problem, config_dict, seed, rep)[source]¶
Run one simulated experiment.
- Parameters:
config_dict (Dict[str, Any]) – AEPsych configuration to use.
seed (int) – Random seed for this run.
rep (int) – Index of this repetition.
problem (Problem) –
- Returns:
- A tuple containing a log of the results and the strategy as
of the end of the simulated experiment. This is ignored in large-scale benchmarks but useful for one-off visualization.
- Return type:
Tuple[List[Dict[str, Any]], SequentialStrategy]
- run_benchmarks()[source]¶
Run all the benchmarks,
Note that this blocks while waiting for benchmarks to complete. If you would like to start benchmarks and periodically collect partial results, use start_benchmarks and then call collate_benchmarks(wait=False) on some interval.
- start_benchmarks()[source]¶
Start benchmark run.
This does not block: after running it, self.futures holds the status of benchmarks running in parallel.
- property is_done: bool¶
Check if the benchmark is done.
- Returns:
True if all futures are cleared and benchmark is done.
- Return type:
bool
- class aepsych.benchmark.Problem[source]¶
Bases:
object
Wrapper for a problem or test function. Subclass from this and override f() to define your test function.
- n_eval_points = 1000¶
- property eval_grid¶
- property name: str¶
- property lb¶
- property ub¶
- property bounds¶
- property metadata: Dict[str, Any]¶
A dictionary of metadata passed to the Benchmark to be logged. Each key will become a column in the Benchmark’s output dataframe, with its associated value stored in each row.
- p(x)[source]¶
Evaluate response probability from test function.
- Parameters:
x (np.ndarray) – Points at which to evaluate.
- Returns:
Response probability at queries points.
- Return type:
np.ndarray
- sample_y(x)[source]¶
Sample a response from test function.
- Parameters:
x (np.ndarray) – Points at which to sample.
- Returns:
A single (bernoulli) sample at points.
- Return type:
np.ndarray
- f_hat(model)[source]¶
Generate mean predictions from the model over the evaluation grid.
- Parameters:
model (aepsych.models.base.ModelProtocol) – Model to evaluate.
- Returns:
Posterior mean from underlying model over the evaluation grid.
- Return type:
torch.Tensor
- property f_true: ndarray¶
Evaluate true test function over evaluation grid.
- Returns:
Values of true test function over evaluation grid.
- Return type:
torch.Tensor
- property p_true: Tensor¶
Evaluate true response probability over evaluation grid.
- Returns:
Values of true response probability over evaluation grid.
- Return type:
torch.Tensor
- p_hat(model)[source]¶
Generate mean predictions from the model over the evaluation grid.
- Parameters:
model (aepsych.models.base.ModelProtocol) – Model to evaluate.
- Returns:
Posterior mean from underlying model over the evaluation grid.
- Return type:
torch.Tensor
- evaluate(strat)[source]¶
Evaluate the strategy with respect to this problem.
Extend this in subclasses to add additional metrics. Metrics include: - mae (mean absolute error), mae (mean absolute error), max_abs_err (max absolute error),
pearson correlation. All of these are computed over the latent variable f and the outcome probability p, w.r.t. the posterior mean. Squared and absolute errors (miae, mise) are also computed in expectation over the posterior, by sampling.
- Brier score, which measures how well-calibrated the outcome probability is, both at the posterior
mean (plain brier) and in expectation over the posterior (expected_brier).
- Parameters:
strat (aepsych.strategy.Strategy) – Strategy to evaluate.
- Returns:
A dictionary containing metrics and their values.
- Return type:
Dict[str, float]
- class aepsych.benchmark.LSEProblem[source]¶
Bases:
Problem
Level set estimation problem.
This extends the base problem class to evaluate the LSE/threshold estimate in addition to the function estimate.
- threshold = 0.75¶
- property metadata: Dict[str, Any]¶
A dictionary of metadata passed to the Benchmark to be logged. Each key will become a column in the Benchmark’s output dataframe, with its associated value stored in each row.
- property true_below_threshold: ndarray¶
Evaluate whether the true function is below threshold over the eval grid (used for proper scoring and threshold missclassification metric).
- evaluate(strat)[source]¶
Evaluate the model with respect to this problem.
For level set estimation, we add metrics w.r.t. the true threshold: - brier_p_below_{thresh), the brier score w.r.t. p(f(x)<thresh), in contrast to
regular brier, which is the brier score for p(phi(f(x))=1), and the same for misclassification error.
- Parameters:
strat (aepsych.strategy.Strategy) – Strategy to evaluate.
- Returns:
A dictionary containing metrics and their values, including parent class metrics.
- Return type:
Dict[str, float]
- aepsych.benchmark.make_songetal_testfun(phenotype='Metabolic', beta=1)[source]¶
Make an audiometric test function following Song et al. 2017.
To do so,we first compute a threshold by interpolation/extrapolation from real data, then assume a linear psychometric function in intensity with slope beta.
- Parameters:
phenotype (str, optional) – Audiometric phenotype from Dubno et al. 2013. Specifically, one of “Metabolic”, “Sensory”, “Metabolic+Sensory”, or “Older-normal”. Defaults to “Metabolic”.
beta (float, optional) – Psychometric function slope. Defaults to 1.
- Returns:
A test function taking a [b x 2] array of points and returning the psychometric function value at those points.
- Return type:
Callable[[np.ndarray, bool], np.ndarray]
- Raises:
AssertionError – if an invalid phenotype is passed.
References
- Song, X. D., Garnett, R., & Barbour, D. L. (2017).
Psychometric function estimation by probabilistic classification. The Journal of the Acoustical Society of America, 141(4), 2513–2525. https://doi.org/10.1121/1.4979594
- aepsych.benchmark.novel_detection_testfun(x)[source]¶
Evaluate novel detection test function from Owen et al.
The threshold is roughly parabolic with context, and the slope varies with the threshold.
- Parameters:
x (np.ndarray) – Points at which to evaluate.
- Returns:
Value of function at these points.
- Return type:
np.ndarray
- aepsych.benchmark.novel_discrimination_testfun(x)[source]¶
Evaluate novel discrimination test function from Owen et al.
The threshold is roughly parabolic with context, and the slope varies with the threshold. Adding to the difficulty is the fact that the function is minimized at f=0 (or p=0.5), corresponding to discrimination being at chance at zero stimulus intensity.
- Parameters:
x (np.ndarray) – Points at which to evaluate.
- Returns:
Value of function at these points.
- Return type:
np.ndarray
- aepsych.benchmark.run_benchmarks_with_checkpoints(out_path, benchmark_name, problems, configs, global_seed=None, n_chunks=1, n_reps_per_chunk=1, log_every=None, checkpoint_every=60, n_proc=1, serial_debug=False)[source]¶
Runs a series of benchmarks, saving both final and intermediate results to .csv files. Benchmarks are run in sequential chunks, each of which runs all combinations of problems/configs/reps in parallel. This function should always be used using the “if __name__ == ‘__main__’: …” idiom.
- Parameters:
out_path (str) – The path to save the results to.
benchmark_name (str) – A name give to this set of benchmarks. Results will be saved in files named like “out_path/benchmark_name_chunk{chunk_number}_out.csv”
problems (List[Problem]) – Problem objects containing the test function to evaluate.
configs (Mapping[str, Union[str, list]]) – Dictionary of configs to run. Lists at leaves are used to construct a cartesian product of configurations.
global_seed (int, optional) – Global seed to use for reproducible benchmarks. Defaults to randomized seeds.
n_chunks (int) – The number of chunks to break the results into. Each chunk will contain at least 1 run of every combination of problem and config.
n_reps_per_chunk (int, optional) – Number of repetitions to run each problem/config in each chunk.
log_every (int, optional) – Logging interval during an experiment. Defaults to only logging at the end.
checkpoint_every (int) – Save intermediate results every checkpoint_every seconds.
n_proc (int) – Number of processors to use.
serial_debug (bool) – debug serially?
- Return type:
None