Workflow overview
HydroBOT may be used stepwise, that is calling the Controller, Aggregator, and Comparer in separate scripts or notebooks. This may in fact be the best solution for large jobs, especially where Aggregator and Comparer steps might need to be re-run without re-running the response modules. It can also be called in one go, feeding all necessary parameters in at once. In this case, we can think of the Controller as simply having larger scope, passing arguments all the way through instead of just to the modules. This can be done in-memory, or saving outputs at each step. In either case, it can be controlled interactively in notebooks, or with parameters which might be passed from a params.yml
file, parameters in a notebook, or arguments passed to Rscript
at the command line.
The workflows section here provides three examples. Similar but not identical examples are available in the template repository, which contains helper scripts for setup on several specific systems as well as skeleton scripts for various workflows. If needed (e.g. Azure at MDBA, HPC systems), contact authors.
HydroBOT auto-documents itself, saving the settings from runs to [prep_run_save_ewrs()] and [read_and_agg()] into *.yml
files. These files also attempt to find the metadata for the scenarios if it exists.
These yaml files are fully-specified parameters files for running HydroBOT, along with some additional run information such as the time of the run, and software versions of the EWR tool and HydroBOT. As such, runs can be replicated by re-running HydroBOT with run_hydrobot_params(yamlpath = 'path/to/generated/metadata.yml')
.
In practice, with large jobs, the typical approach is often to run the response modules (EWR tool) and a default Aggregator as a large parallel job over scenarios, whether that parallelisation happens locally, on an HPC, Azure, or databricks. By saving out the results of each step, additional aggregations can be re-run in parallel with dedicated scripts without having to re-run the EWR tool, as adjustments need to be made to address the question of interest. The comparer is almost always run interactively in a notebook or notebooks for three reasons; first, comparisons across scenarios cannot be parallelised, second, comparisons tend to be fast relative to the processing steps in the controller and aggregator, and third, the comparison step is often quite interactive and iterative as relevant outputs are developed and adjusted to target the questions of interest. If a final, large, set of outputs is needed, such notebooks would still be used to settle on those outputs, and then re-run if needed, e.g. with updated input data.
If you want to supply your own spatial data, causal networks, or aggregation functions, see examples in the aggregator section for detail, and they would typically be done through read_and_agg in the full workflow.