Aggregation functions and grouping

library(HydroBOT)

To prepare, we generate some EWR outputs to use as example data.

project_dir <- "hydrobot_scenarios"
hydro_dir <- file.path(project_dir, "hydrographs")
ewr_results <- file.path(project_dir, "module_output", "EWR")

ewr_out <- prep_run_save_ewrs(
  hydro_dir = hydro_dir,
  output_parent_dir = project_dir,
  outputType = list("none"),
  returnType = list("yearly")
)


# This is just a simple prep step that is usually done internally to put the geographic coordinates on input data
sumdat <- prep_ewr_output(ewr_out$yearly, type = "achievement")

Multiple aggregation columns

HydroBOT can aggregate more than one column at a time if they use the same aggregation sequence. Here, we use aggCols = c("frequency_achieved", "ewr_achieved") to perform the same aggregation steps on frequency achievement and achievement considering both frequency and interevent conditions.

aggseq_1 <- list(
  all_time = "all_time",
  ewr_code = c("ewr_code_timing", "ewr_code"),
  env_obj = c("ewr_code", "env_obj"),
  sdl_units = sdl_units,
  Target = c("env_obj", "Target")
)


funseq_1 <- list(
  all_time = "ArithmeticMean",
  ewr_code = "ArithmeticMean",
  env_obj = "ArithmeticMean",
  sdl_units = "ArithmeticMean",
  Target = "ArithmeticMean"
)

And we see that the output has an ewr_achieved and a frequency_achieved column that have both carried through the sequence.

agged_multi <- multi_aggregate(
  dat = sumdat,
  causal_edges = causal_ewr,
  groupers = c("scenario"),
  aggCols = c("ewr_achieved", "frequency_achieved"),
  aggsequence = aggseq_1,
  funsequence = funseq_1,
  namehistory = FALSE
)

agged_multi |> dplyr::filter(scenario != "MAX")

User-set functions

We have established a simple set of default aggregation functions (ArithmeticMean(), GeometricMean(), LimitingFactor(), CompensatingFactor(), and SpatialWeightedMean()), available in default_agg_functions.R. It is also possible to supply user-defined functions to include in funsequence. We can do this by specifying a new function and including it in the list given to funsequence. Some different syntax is provided below in Section 3.2, but the best way is shown here, involving defining a function or functions and using it just as we would built-ins.

We demonstrate here with a threshold function and a median. For example, we might want to know the mean of all values greater than 0 for some stages and use the median at others.

mean_given_occurred <- function(x) {
  mean(ifelse(x > 0, x, NA), na.rm = TRUE)
}

medna <- function(x) {
  median(x, na.rm = TRUE)
}

aggseq <- list(
  all_time = "all_time",
  ewr_code = c("ewr_code_timing", "ewr_code"),
  env_obj = c("ewr_code", "env_obj"),
  sdl_units = sdl_units,
  Target = c("env_obj", "Target")
)


funseq <- list(
  all_time = "ArithmeticMean",
  ewr_code = "ArithmeticMean",
  env_obj = "mean_given_occurred",
  sdl_units = "medna",
  Target = "mean_given_occurred"
)

Those changes are then reflected in the aggregation history and determine the aggregated values.

agged_custom_funs <- multi_aggregate(
  dat = sumdat,
  causal_edges = causal_ewr,
  groupers = c("scenario"),
  aggCols = "ewr_achieved",
  aggsequence = aggseq,
  funsequence = funseq,
  namehistory = FALSE
)

agged_custom_funs |> dplyr::filter(scenario != "MAX")

We can plot them to double check it worked.

agged_custom_funs |>
  dplyr::filter(scenario != "MAX") |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    y_lab = "Arithmetic Mean",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "Target",
    facet_row = "scenario",
    sceneorder = c("down4", "base", "up4")
  )

Nonstandard syntax

The aggregation functions have flexible syntax in some of their arguments, particularly for selecting grouping columns, the column(s) of values to aggregate, specification of aggregation functions, and the specification of spatial and temporal aggregation units. In general, these apply across the functions, though multi_aggregate has a bit less flexibility than the internal spatial_aggregate, theme_aggregate, temporal_aggregate, and general_aggregate, largely as a consequence of passing through the call stack. Here, we use primarily the example of spatial_aggregate to illustrate the different arguments, their syntax, and why we might use the different options.

First, create test data as in the spatial notebook. This does some simple pre-aggregation in time and theme dimensions to yield something easier to visualise.

# This gets us to env_obj at the gauge over all time
preseq <- list(
  all_time = "all_time",
  ewr_code = c("ewr_code_timing", "ewr_code"),
  env_obj = c("ewr_code", "env_obj")
)

funseq <- list(
  "ArithmeticMean",
  "CompensatingFactor",
  "ArithmeticMean"
)

# Do the aggregation to get env_obj at each gauge
simpleThemeAgg <- multi_aggregate(
  dat = sumdat,
  causal_edges = causal_ewr,
  groupers = c("scenario", "gauge"),
  aggCols = "ewr_achieved",
  aggsequence = preseq,
  funsequence = funseq
)

simpleThemeAgg

Now that we have a dataframe that’s easier to look at, we use it to examine the syntax.

Selecting grouping and data columns

Both aggCols and groupers can be character vectors, bare data-variable names, or we might want to use tidyselect syntax. For example, maybe we want to use ends_with('ewr_achieved') to grab pre-aggregated columns with long name histories or starts_with('scenar') instead of specifying 'scenario' as a grouper. This is handled under the hood by selectcreator and careful parsing in the function stack. In general, the safest thing to use is characters.

Here, we use a combination of tidyselect and bare column names for groupers, a character for the aggCols, and bare function name for the function to apply.

Tip

multi_aggregate() takes advantage of this tidyselect ability under the hood to deal with the ever-lengthening column names (and sometimes expanding number of value columns if we have multiple aggregation functions at a step). This means, though, that using tidyselect to specify aggCols in multi_aggregate() or read_and_agg() is fragile. We try to handle it by changing to character, but it can collide with the internal tidyselect, and so characters are a safer option in those outer functions.

Similarly, multi_aggregate() and read_and_agg() have to attempt to extract the character name from bare functions. If this extraction fails, they get lost for the purposes of naming the history by the time they get used in the call stack. So again, characters are safer.

Note

We need to specify env_obj as a grouper here because we are using spatial_aggregate() directly; in normal use with multi_aggregate() or read_and_agg() the dimensional safety will enforce that grouping in the theme dimension while doing spatial aggregation.

This throws warnings about its meaning for the EWRs; since the point here is to demonstrate functionality, we will not print them.

obj2poly2 <- spatial_aggregate(
  dat = simpleThemeAgg,
  to_geo = sdl_units,
  groupers = c(starts_with("sce"), env_obj),
  aggCols = "env_obj_ArithmeticMean_ewr_code_CompensatingFactor_all_time_ArithmeticMean_ewr_achieved",
  funlist = ArithmeticMean,
  keepAllPolys = TRUE
)

obj2poly2

We can see we get the same result as obj2poly with different ways of specifying aggCols and groupers.

There are times when we might want to send a vector of names, but ignore those not in the data. This typically occurs when there is a set of possible grouping variables but only some exist in a particular dataset, so they should be used if present and ignored if not. It fails by default, because in normal use datasets would match, but setting failmissing = FALSE allows it to pass. Here, the ‘extra_grouper’ column doesn’t exist in the data and so is ignored. The failmissing argument also applies to aggCols; if it is TRUE, aggCols can have values that are not column names in the data.

We also now use only characters for groupers, but switch to tidyselect to specify aggCols.

obj2polyF <- spatial_aggregate(
  dat = simpleThemeAgg,
  to_geo = sdl_units,
  groupers = c("scenario", "env_obj", "extra_grouper"),
  aggCols = ends_with("ewr_achieved"),
  funlist = ArithmeticMean,
  keepAllPolys = TRUE,
  failmissing = FALSE
)
obj2polyF

Function specifications

We can pass single bare aggregation function names, characters, or named lists defining functions with arguments (though this is fragile). We can also pass multiple functions to apply; for example we might want to calculate the mean, max, and min all at the same time.

Above, we have been specifying the function to apply as just a single bare function name. Now, we explore some other possibilities and capabilities of the aggregator.

Most simply, we can pass character names of functions instead of bare

doublesimplechar <- spatial_aggregate(
  dat = simpleThemeAgg,
  to_geo = sdl_units,
  groupers = c("scenario", "env_obj"),
  aggCols = ends_with("ewr_achieved"),
  funlist = "ArithmeticMean",
  keepAllPolys = TRUE,
  failmissing = FALSE
)
doublesimplechar

If we want to do two different aggregations on the same data, we can pass a vector of names. Now we have two output columns, starting with spatial_ArithmeticMean_... and spatial_GeometricMean... .

#|
simplefuns <- c("ArithmeticMean", "GeometricMean")

doublesimplec <- spatial_aggregate(
  dat = simpleThemeAgg,
  to_geo = sdl_units,
  groupers = c("scenario", "env_obj"),
  aggCols = ends_with("ewr_achieved"),
  funlist = simplefuns,
  keepAllPolys = TRUE,
  failmissing = FALSE
)
doublesimplec

Note

If passing multiple functions, it is unlikely to work to pass bare names. It just gets too complex to handle, and the bare names is really just a convenience shorthand.

We can also use anonymous (lambda) functions in a list. This example is trivial, but it becomes useful (though still rarely a good idea in production) in the next section. It does not work with more modern \(x) lambda functions. Given the fragility of this approach, it has not been a high priority to get working.

lambda1 <- spatial_aggregate(
  dat = simpleThemeAgg,
  to_geo = sdl_units,
  groupers = c("scenario", "env_obj"),
  aggCols = ends_with("ewr_achieved"),
  funlist = list(mean = ~ mean(.)),
  keepAllPolys = TRUE,
  failmissing = FALSE
)
lambda1

Arguments to aggregation functions

There are three primary ways to specify function arguments- using ... in spatial_aggregate, writing a wrapper function with the arguments specified (e.g. see ArithmeticMean, which is just mean(x, na.rm = TRUE)), or using anonymous functions with ~ syntax in a named list. The simplest version is to use …, but this really only works in simple cases, like passing na.rm = TRUE. It does work for multiple functions, but starts getting convoluted and unclear if they don’t share arguments or there are many arguments. Thus, the best option is to specify a custom function, and pass it.

Here, we pass na.rm = TRUE to both mean and sd using ....

singlearg <- spatial_aggregate(
  dat = simpleThemeAgg,
  to_geo = sdl_units,
  groupers = "scenario",
  aggCols = ends_with("ewr_achieved"),
  funlist = c(mean, sd),
  na.rm = TRUE,
  keepAllPolys = TRUE,
  failmissing = FALSE
)
singlearg

We can also pass arguments by sending a list of lambda functions with their arguments. This is far more flexible than the … approach, as we can send any arguments to any functions this way. For clarity, we demonstrate it here for the same situation- passing the na.rm argument to mean and sd. This also lets us control the function names, because the list-names do not need to match the function names. The list-names are what get used in history-tracking (see the column names).

simplelamfuns <- list(
  meanna = ~ mean(., na.rm = TRUE),
  sdna = ~ sd(., na.rm = TRUE)
)

doublelam <- spatial_aggregate(
  dat = simpleThemeAgg,
  to_geo = sdl_units,
  groupers = c("scenario", "env_obj"),
  aggCols = ends_with("ewr_achieved"),
  funlist = simplelamfuns,
  keepAllPolys = TRUE,
  failmissing = FALSE
)
doublelam

Note: if using anonymous functions in a list this way, they need to use rlang ~ syntax, not base \(x) or function(x){}. Given the fragility of this approach, that hasn’t been a priority for implementation. It’s hopefully not much of a constraint, and anything complex can be written as a standard function and called that way.

Vector arguments

See Section 3.2.4 for the only real way to handle vector arguments in HydroBOT workflows. The first part of this deals with how to use vector arguments with [spatial_aggregate()], [theme_aggregate()], and [temporal_aggregate()] directly. If vector arguments other than area are needed in future, HydroBOT will need to accept the vectors themselves as arguments.

Using aggregation functions directly

It’s fairly common that we’ll have vector arguments, especially for the spatial aggregations. One primary example is weightings. The most flexible approach requires these vectors to be attached to the data before it enters the function (vs. creating them automatically in-function). That works well for one-off use of e.g. spatial_aggregate, but is tricky to make happen in an automated workflow though.

First, we illustrate how it works as a one-off, assuming that the vectors are columns in the dataset, demonstrating with weighted means on dummy weights.

Caution

As of {dplyr} 1.1, if we pass a function with a data-variable argument (e.g. the name of a column in the dataframe) we have to wrap the list in rlang::quo. Otherwise it looks for an object with that name instead of a column. The better solution is to use custom function, not a lambda (see below in Section 3.2.4. If we have several layers of aggregation, the inside level where the function is defined needs to be wrapped. E.g.

funlist <- list(c('ArithmeticMean', 'LimitingFactor'),
                rlang::quo(list(wm = ~weighted.mean(., area, na.rm = TRUE))),
                rlang::quo(list(wm = ~weighted.mean(., area, na.rm = TRUE))))

veclamfuns <- rlang::quo(list(
  meanna = ~ mean(., na.rm = TRUE),
  sdna = ~ sd(., na.rm = TRUE),
  wmna = ~ weighted.mean(., wt, na.rm = TRUE)
))

# Not really meaningful, but weight by the number of gauges.
wtgauge <- simpleThemeAgg |>
  dplyr::group_by(scenario, gauge) |>
  dplyr::mutate(wt = dplyr::n()) |>
  dplyr::ungroup()

triplevec <- spatial_aggregate(
  dat = wtgauge,
  to_geo = sdl_units,
  groupers = c("scenario", "env_obj"),
  aggCols = ends_with("ewr_achieved"),
  funlist = veclamfuns,
  keepAllPolys = TRUE,
  failmissing = FALSE
)

triplevec

If we want to have custom functions with vector data arguments (i.e. full named functions, not anonymous functions specified in a list), we still need to use the tilde notation to point to those arguments. Making a dummy function that just adds two to the weighted mean, the wt argument doesn’t get seen if we just say funlist = wt2. Instead, we need to use a list.

wt2 <- function(x, wt) {
  2 + weighted.mean(x, w = wt, na.rm = TRUE)
}

wt2list <- rlang::quo(list(wt2 = ~ wt2(., wt)))


vecnamedfun <- spatial_aggregate(
  dat = wtgauge,
  to_geo = sdl_units,
  groupers = c("scenario", "env_obj"),
  aggCols = ends_with("ewr_achieved"),
  funlist = wt2list,
  keepAllPolys = TRUE,
  failmissing = FALSE
)

vecnamedfun

The ‘area’ exception and HydroBOT

The only exception to attaching vector arguments are situations where the needed vector arguments depend on both sets of from and to data/polygons, and so can’t be pre-attached. Moreover, in a HydroBOT workflow, there is limited ability to pre-attach columns (though it is possible with additional development).

However, area-weighting is almost always needed, and so spatial_joiner() when polygons are present. This means HydroBOT data at polygon scales should always have an area column available for weighting.

If specifying a custom function to use this area column (which will often be the case), the trick is to use .data$area in the function definition. This bypasses the need to use rlang::quo, as it makes the argument explicitly a data-variable. See HydroBOT::SpatialWeightedMean, which is just

SpatialWeightedMean <- function(x, na.rm = TRUE) {
  y <- stats::weighted.mean(x = x, w = .data$area, na.rm = na.rm)
}

Tip

If using this method, the safest is to define the function, and then pass it to the aggregation list using its character name.

In summary, we can pass single functions and their arguments in ellipses, complex lists of multiple functions using tilde-style anonymous functions, which can have vector arguments (as long as the vector is attached to the data), and lists of multiple function names. In general, the most robust method is to pre-define functions and pass them as character names.

Spatial information

Spatial aggregation steps are specified with an sf object (polygons) or the name of the sf object, e.g. sdl_units or "sdl_units". Allowing specification of spatial steps by character instead of object is a bit more fragile because it relies on get("name_of_object"), but allows the list to be wholly specified with characters. This both saves memory (potentially massively), and makes yaml parameter files (both input and output) much cleaner.