Data preparation

library(HydroBOT)
library(dplyr)
library(ggplot2)

In some cases, data coming in from the response modules may need to be prepared before use by the aggregator. A common situation here is calculation of ‘synthetic’ variables from those in the module output. For example, HydroBOT contains functions to assess the EWR tool outputs in terms of whether they meet frequency and interevent requirements, as well as characteristics of those interevent periods. See prep_ewr_output() and specifically assess_ewr_achievement() and assess_ewr_interevent().

In the language of dplyr, these are mutate - type operations, rather than the summarise - type operations of the Aggregator proper. They prepare new columns to be aggregated. Such functions might range from fairly complex (e.g prep_ewr_output()), or not needed at all (see the user-provided module example.

HydroBOT provides default functions (prep_ewr_output() for EWR (assess_ewr_achievement()) and interevent (assess_ewr_interevent()) assessment. However, it is possible to pass in other data preparation functions as user-arguments using the prepfun and prepargs arguments to read_and_agg(). If no preparation is needed, users should use prepfun = "identity" to leave the data unchanged.

Tip

It would be natural for the default to be prepfun = "identity", but due to many existing analyses expecting automatic preparation of EWR outputs, the default is prepfun = "prep_ewr_output", and read_and_agg() automatically passes the achievement or interevents arguments to prep_ewr_output() depending on the value of type.

This will hopefully change in future to be more general, and so it is best practice for new code to use a type argument matching the filenames, along with appropriate prepfun and prepargs. E.g. for EWR achievement, use type = 'yearly', prepfun = 'prep_ewr_output' , prepargs = list(type = 'achievement').

User-provided dataprep

As an example, let’s say that instead of calculating the EWR achievement as is done by default, we want to calculate the difference between totalEventDays to the eventLength, and then aggregate that new value. We assume we have run the module and so have data at these paths:

project_dir <- "hydrobot_scenarios"
hydro_dir <- file.path(project_dir, "hydrographs")
ewr_results <- file.path(project_dir, "module_output", "EWR")
agg_results <- file.path(project_dir, "aggregator_output", "demo")

Define dataprep function

First, we define a new function that takes the read-in dataframe as the first argument. It can have other arguments as well, these can be passed to the prepargs. Here, we’ll add a contrived version to either use the difference or ratio.

daydiff <- function(dat, dr = "difference") {
  # This bit isn't necessary, it just keeps some processing similar with the built-in EWR prep functions.
  names(dat) <- HydroBOT:::nameclean(names(dat))
  dat <- HydroBOT::cleanewrs(dat)
  dat$date <- as.Date(paste0(as.character(dat$year), "-07-01"))


  if (dr == "difference") {
    dat <- dat |>
      dplyr::mutate(synthvals = total_event_days - event_length)
  } else if (dr == "ratio") {
    dat <- dat |>
      dplyr::mutate(synthvals = total_event_days / event_length)
  } else {
    rlang::abort("Bad dr argument")
  }

  return(dat)
}

Aggregate (difference)

Now specify aggregation functions for the difference. Differences often make sense as sums in aggregation, so here we’ll use a simple set.

aggseq_d <- list(
  all_time = "all_time",
  sdl_units = sdl_units,
  Target = c("ewr_code_timing", "Target")
)

funseq_d <- list(
  all_time = "Sum",
  sdl_units = "Sum",
  Target = "Sum"
)

We use type = 'yearly' to get the yearly output files, prepfun = 'daydiff' to specify our new one, prepargs is simply the default empty list here since we’re using the function defaults.

diffagg <- read_and_agg(
  datpath = ewr_results,
  type = "yearly",
  geopath = bom_basin_gauges,
  causalpath = causal_ewr,
  groupers = "scenario",
  prepfun = "daydiff",
  prepargs = list(),
  aggCols = "synthvals",
  aggsequence = aggseq_d,
  funsequence = funseq_d,
  keepAllPolys = FALSE,
  group_until = list(
    planning_unit_name = "sdl_units",
    gauge = is_notpoint,
    SWSDLName = "sdl_units"
  ),
  pseudo_spatial = "sdl_units",
  saveintermediate = TRUE,
  namehistory = FALSE
)

We can then plot that, noting that these values are contrived.

plot_outcomes(diffagg$Target,
  outcome_col = "synthvals",
  plot_type = "map",
  colorset = "synthvals",
  facet_row = "scenario",
  facet_col = "Target"
)

Aggregate (passing an argument)

Now, we can use the prepargs to use the arguments available in daydiff().

We’d likely specify a different set of aggregation functions, though ratios are inherently tricky to aggregate. Here, we just use means, though note something more meaningful should be chosen if the preparations are meaningful.

aggseq_r <- list(
  all_time = "all_time",
  sdl_units = sdl_units,
  Target = c("ewr_code_timing", "Target")
)

funseq_r <- list(
  all_time = "ArithmeticMean",
  sdl_units = "ArithmeticMean",
  Target = "ArithmeticMean"
)

Now, all we change is to use those aggregation and function sequences, and change prepargs = list(dr = 'ratio').

ratioagg <- read_and_agg(
  datpath = ewr_results,
  type = "yearly",
  geopath = bom_basin_gauges,
  causalpath = causal_ewr,
  groupers = "scenario",
  prepfun = "daydiff",
  prepargs = list(dr = "ratio"),
  aggCols = "synthvals",
  aggsequence = aggseq_r,
  funsequence = funseq_r,
  keepAllPolys = FALSE,
  group_until = list(
    planning_unit_name = "sdl_units",
    gauge = is_notpoint,
    SWSDLName = "sdl_units"
  ),
  pseudo_spatial = "sdl_units",
  saveintermediate = TRUE,
  namehistory = FALSE
)

And plot again, noting these are just examples of functionality and not well-supported aggregations.

plot_outcomes(ratioagg$Target,
  outcome_col = "synthvals",
  plot_type = "map",
  colorset = "synthvals",
  facet_row = "scenario",
  facet_col = "Target"
)