Parameterised workflow

This document provides a template for running HydroBOT from a parameters file, as we might do when batch processing. As such, it typically wouldn’t be run through a notebook, but be called with Rscript. That sort of setup can go a lot of different directions depending on use case, and we assume a user would be familiar with shell scripting and the particular idiosyncracies of their relevant batching system. Here, we demonstrate how to set up the parameter file and use it, and leave it to the user to build the script that gets called with Rscript from the command line or as part of an external process.

We have the ability to have a default params file, a second params file that tweaks those defaults, as well as include params in Quarto header yaml. These different options are all used here.

Load the package

Structure of params files and arguments

The run_hydrobot_params function takes four arguments: yamlpath, which is a path to a yaml params file, passed_args which can come from the command line, list_args, and defaults, which is another yaml file. This two-yaml approach lets us set most of the params in common across all runs, and only modify a subset with the yamlpath file or passed_args.

In all cases, the arguments end up in a list with two top-level items: ewr and aggregation, within which items can be added with names matching the arguments to [prep_run_save_ewrs()] and [read_and_agg()], respectively. This gives full control over those functions.

The package comes with a set of default parameters in system.file('yml/default_params.yml', package = 'HydroBOT'). Users can (should) however create their own default yaml params file to set a standard set of defaults for a given project. See this file for basic structure.

The params.yml file (or any other name, passed to yamlpath) and passed_args and list_args then can be used to modify the default values. The idea is only a small subset of those defaults would be modified for a particular run.

In general, it is best to specify everything in terms of characters, logicals, or NULL. If there is a situation where that isn’t possible (bespoke spatial data, for example), it is possible to specify the aggsequence and funsequence with an R script. To do that, change the aggregation_def entry of the aggregation list to the path to that R script. For an example, see system.file('yml/params.R', package = 'HydroBOT').

Finally, [run_hydrobot_params()] ingests paths to these files (or passed command line or lists), turns their params into R arguments, and runs HydroBOT.

Important

The arguments overwrite each other, so list_args has highest precedence, followed by passed_args, yamlpath, and finally defaults.

Note

At present we do not provide yaml param options for the comparer. This is possible, but the possibilities are a bit too wide open. It is likely the user will want to explore the output, rather than generate parameterised output, though that may change in future.

Parameters

This section provides a look at the parameters being set in the various params files or passed in.

There are a number of parameters to set, mirroring those set in the notebook-driven runs of HydroBOT, e.g. running while saving steps.

Here, we provide example yaml that may appear in the files at defaults or yamlpath .

Additional parameters

Specify the aggregation sequence in R and pass the path to that file.

aggregation:
  # aggregation sequences (need to be defined in R)
  aggregation_def: 'toolkit_project/agg_params.R'

Directories

Input and output directories

ewr:
  # Outer directory for scenario
  output_parent_dir: 'toolkit_project'

  # Preexisting data
  # Hydrographs (expected to exist already)
  hydro_dir: NULL
  

aggregation:
  # outputs of aggregator
  savepath: 'path/to/aggs'

Normally output_parent_dir should point somewhere external (though keeping it inside or alongside the hydrograph data is a good idea.).

Setting the output directories to NULL expects (in the case of hydro_dir) or builds (for savepath) a standard toolkit directory structure, with output_parent_dir as the outer directory, holding hydrographs, aggregator_output, and module_output subdirectories.

Module arguments

Currently, just the EWR tool. Any argument in [prep_run_save_ewrs()] can be passed. Some examples are

ewr:
  # Model type
  model_format: 'IQQM - netcdf'
  
  # output and return
  outputType:
    - summary
  
  returnType: none

Aggregation settings

Any arguments to read_and_agg. Some examples are

aggregation:
  # What to aggregate
  type: achievement
  
  # Aggregation settings
  groupers: scenario
  aggCols: ewr_achieved
  namehistory: FALSE
  keepAllPolys: TRUE

Run HydroBOT

These examples are set not to evaluate in normal use, but show different ways of running the parameters.

This runs the toolkit using a yaml parameter file that modifies the default provided with HydroBOT.

run_hydrobot_params(yamlpath = file.path("workflows", "params.yml"))
! Unmatched links in causal network
• 11 from env_obj to Specific_goal
! Unmatched links in causal network
• 7 from Objective to target_5_year_2024

Passing arguments as text is tricky to get the yaml right for more than one argument, but it can be useful for command-line use, for example. Here, we demonstrate changing the output_parent_dir and the namehistory, noting that the number of spaces and newlines is critical to get it to work. In practice, this would need some tweaking to use with Rscript and extract the string from commandArgs() .

run_hydrobot_params(
  yamlpath = file.path("workflows", "params.yml"),
  passed_args = "ewr:\n output_parent_dir: 'hydrobot_scenarios'\naggregation:\n namehistory: TRUE"
)

We can also pass arguments in a list from R, which is a bit easier syntax. Here, we use it to only run a subset of the scenarios and put the outputs in the same directory, as well as specify aggregation with an R file.

run_hydrobot_params(list_args = list(
  ewr = list(
    output_parent_dir = "hydrobot_scenarios/hydrographs/base",
    hydro_dir = "hydrobot_scenarios/hydrographs/base"
  ),
  aggregation = list(
    aggregation_def = "workflows/params.R",
    auto_ewr_PU = TRUE
  )
))

And finally, if the params are included in the parameters section of a Quarto notebook, they should get parsed. Quarto with R puts these in a list called params, so we could just pass that. Unfortunately, the params list in quarto isn’t full-featured yaml, and can’t do nested lists and does not currently work, but may work with !r syntax in Rmarkdown. This is likely to not be very useful.

run_hydrobot_params(list_args = params)

Replication

The prep_run_save_ewrs and read_and_agg functions save metadata yaml files that are fully-specified parameters files. Thus, to replicate runs, we can run from the final yaml (after the aggregator), as it has all preceding steps.

run_hydrobot_params(yamlpath = "hydrobot_scenarios/aggregator_output/agg_metadata.yml")