Aggregating over multiple dimensions

library(HydroBOT)
library(dplyr)
library(ggplot2)

Overview

We have theme aggregation, temporal_aggregation, and spatial aggregation shown separately for in-depth looks at their meaning and capability. Here, we focus on the typical use-case of interleaved aggregation along multiple dimensions. We do not get into all the different options and syntax as they are covered in those other documents. In use, multi_aggregate() is typically not called directly, instead, [read_and_agg](read_and_agg.qmd) while providing data read-in to allow just passing paths, allows for parallelisation, and maintains data provenance and metadata. Because multi_aggregate() provides the core aggregation functionality, we discuss it here.

All multi-step aggregation in {HydroBOT} operates on the same core function and use similar principles- take a list of aggregation sequences, and aggregate each step according to a list of aggregation functions. Here, we show how multi_aggregate allows us to interleave the dimensions along which we aggregate, including auto-detecting which dimension we’re operating on.

Fundamentally, multi_aggregate wraps theme_aggregate(), temporal_aggregate(), and spatial_aggregate() with data organisation and tracking of what the previous level of aggregation was to maintain proper grouping as they alternate. This provides critical functionality to prevent accidental collapse along multiple dimensions simultaneously, and also allows grouping through part of the aggregation sequence (see here). These internal functions wrap general_aggregate with some data arrangement specific to the dimension they aggregate along, such as stripping and re-adding geometry.

The multi_aggregate() function expects the incoming data to be in memory and geographic, and prefers (but does not require) the edges defining theme relationships to already be calculated (though in practice they typically are calculated on the fly by read_and_agg().

Note

Any step in the aggregation sequence can have multiple aggregations (e.g. calculating min and max). These single-step multiples are factorial with the steps- if we aggregate with min and max at step 2 (generating two columns of aggregated data), then each of those columns is aggregated according to the step-3 aggregation, and so on. This can be quite useful, but can also yield complex outputs with many unneeded sequences. It is worth considering running separate aggregation sequences if there are many places with multiple aggregations, particularly if they occur early in the sequence.

Demonstration setup

First, we need to provide a set of paths to point to the input data, in this case the outputs from the EWR tool for the small demonstration, created by a controller notebook. Spatial units could be any arbitrary polygons, but we use those provided by {HydroBOT} for consistency, which also provides the spatial locations of the gauges in bom_basin_gauges.

project_dir <- "hydrobot_scenarios"
hydro_dir <- file.path(project_dir, "hydrographs")
ewr_results <- file.path(project_dir, "module_output", "EWR")

Scenario information

This will be attached to metadata, typically. For this demonstration, we just use it for plot clarity and the data is simple.

multipliers <- c(4)

scenemults <- c(1 / rev(multipliers), 1, multipliers)

scenenames <- c(
  paste0("down", as.character(rev(multipliers))),
  "base",
  paste0("up", as.character(multipliers))
) |>
  stringr::str_replace("\\.", "_")


scenarios <- tibble::tibble(scenario = scenenames, delta = scenemults)

scene_pal <- make_pal(unique(scenarios$scenario), palette = "ggsci::nrc_npg", refvals = "base", refcols = "black")

Data prep

To make the actual multi-aggregate loop general, dataprep needs to happen first (e.g. we don’t want to do EWR-specific dataprep on econ data). That said, we can use read_and_agg, which takes paths and the aggregation lists and runs the dataprep and aggregation functions. At present, the EWR tool is the only module, so we run it and do minor data prep that would usually be handled by read_and_agg(), e.g. calculating EWR achievement and making it geographic with gauge locations.

ewr_out <- prep_run_save_ewrs(
  hydro_dir = hydro_dir,
  output_parent_dir = project_dir,
  outputType = list("none"),
  returnType = list("yearly")
)


# This is just a simple prep step that is usually done internally to put the geographic coordinates on input data
ewrdata <- prep_ewr_output(ewr_out$yearly, type = "achievement", add_max = FALSE)

Initially, we use causal_ewrs for causal relationships and spatial layers provided by {HydroBOT} to define the aggregation units, and defined functions for the aggregations, though each of these can be manually specified (see Section 1.3, Section 1.4, Section 1.5).

Setup

First, we specify a simple interleaved aggregation sequence with only one aggregation function applied per step for simplicity. Note that theme-axis aggregation steps are specified with a character vector c('level_from', 'level_to'), while spatial aggregation steps are specified with an sf object (polygons) or the name of the sf object, e.g. sdl_units or "sdl_units". Allowing specification of spatial steps by character instead of object is a bit more fragile because it relies on get("name_of_object"), but allows the list to be wholly specified with characters.

Here, we specify the aggregation function sequence with character names for functions, though other specifications are possible as discussed in the syntax notebook.

Tip

Naming the aggsequence and funsequence lists by the level aggregated into makes tracking and interpretation much easier, and is highly recommended.

Spatial aggregation should almost always be area-weighted after the data is in polygons (see spatial notebook and aggregation syntax), though there are some aggregation functions where it doesn’t matter (e.g. max). All polygon data has an area column calculated automatically for this reason. The first aggregation into polygons typically is not area-weighted, because the thing being aggregated (typically at the gauge scale) usually doesn’t have area. After that, all data is in polygons and so has area.

We consider a set of aggregations covering all three dimensions. It begins temporal (all_time), then has two theme aggregations (ewr_code and env_obj), then spatial to sdl_units, two more theme-dimension (Specific_goal, Objective), a spatially-weighted aggregation to the basin, and finally to the theme level of 5-year management targets.

aggseq <- list(
  all_time = "all_time",
  ewr_code = c("ewr_code_timing", "ewr_code"),
  env_obj = c("ewr_code", "env_obj"),
  sdl_units = sdl_units,
  Specific_goal = c("env_obj", "Specific_goal"),
  Objective = c("Specific_goal", "Objective"),
  basin = basin,
  target_5_year_2024 = c("Objective", "target_5_year_2024")
)

funseq <- list(
  all_time = "ArithmeticMean",
  ewr_code = "CompensatingFactor",
  env_obj = "ArithmeticMean",
  sdl_units = "ArithmeticMean",
  Specific_goal = "ArithmeticMean",
  Objective = "ArithmeticMean",
  basin = "SpatialWeightedMean",
  target_5_year_2024 = "ArithmeticMean"
)

Aggregate

Now we do the aggregation. Note that we have been very aggressive in handling spatial processing and so while spatial processing is slow, we minimise it as much as possible internally.

Because we’re aggregating EWR outputs, multi_aggregate() will catch some common pitfalls. The best solution is to use pseudo_spatial and group_until, but here for simplicity we use auto_ewr_PU = TRUE , which does that automatically.

Tip

By default the data column has a very long name, which provides a record of its full provenance. We can turn this into columns, which is usually clearer. The simplest way to do this is to let multi_aggregate() do it internally with namehistory = FALSE. For more detail, see Section 1.2.

Return only final

The default option is to return only the final outcome, which is far cheaper for memory, but doesn’t say how we got that answer.

Tip

HydroBOT can aggregate more than one column at a time if they use the same aggregation sequence. We demonstrate that here, with aggCols = c("frequency_achieved", "ewr_achieved") . We could also include any other outcome column if desired.

tsagg <- multi_aggregate(
  dat = ewrdata,
  causal_edges = causal_ewr,
  groupers = "scenario",
  aggCols = c("ewr_achieved", "frequency_achieved"),
  aggsequence = aggseq,
  funsequence = funseq,
  auto_ewr_PU = TRUE,
  namehistory = FALSE
)

That saves only the final outcome, which is far cheaper for memory, but doesn’t let us step through the intermediate steps.

tsagg

We can do the exact same thing with characters instead of objects for the spatial units:

aggseq_c <- list(
  all_time = "all_time",
  ewr_code = c("ewr_code_timing", "ewr_code"),
  env_obj = c("ewr_code", "env_obj"),
  sdl_units = "sdl_units",
  Specific_goal = c("env_obj", "Specific_goal"),
  Objective = c("Specific_goal", "Objective"),
  basin = "basin",
  target_5_year_2024 = c("Objective", "target_5_year_2024")
)

tsagg_c <- multi_aggregate(
  dat = ewrdata,
  causal_edges = causal_ewr,
  groupers = "scenario",
  aggCols = "ewr_achieved",
  aggsequence = aggseq_c,
  funsequence = funseq,
  auto_ewr_PU = TRUE,
  namehistory = FALSE
)

ℹ EWR outputs auto-grouped
• Done automatically because `auto_ewr_PU = TRUE`
• EWRs should be grouped by `SWSDLName`, `planning_unit_name`, and `gauge` until aggregated to larger spatial areas.
• Rows will collapse otherwise, silently aggregating over the wrong dimension
• Best to explicitly use `group_until` in `multi_aggregate()` or `read_and_agg()`.

ℹ EWR outputs auto-grouped
• Done automatically because `auto_ewr_PU = TRUE`
• EWRs should be grouped by `SWSDLName`, `planning_unit_name`, and `gauge` until aggregated to larger spatial areas.
• Rows will collapse otherwise, silently aggregating over the wrong dimension
• Best to explicitly use `group_until` in `multi_aggregate()` or `read_and_agg()`
.
ℹ EWR outputs auto-grouped
• Done automatically because `auto_ewr_PU = TRUE`
• EWRs should be grouped by `SWSDLName`, `planning_unit_name`, and `gauge` until aggregated to larger spatial areas.
• Rows will collapse otherwise, silently aggregating over the wrong dimension
• Best to explicitly use `group_until` in `multi_aggregate()` or `read_and_agg()`
.
ℹ EWR gauges joined to larger units pseudo-spatially.
• Done automatically because `auto_ewr_PU = TRUE`
• Non-spatial join needed because gauges may inform areas they are not within
• Best to explicitly use `pseudo_spatial = 'sdl_units'` in `multi_aggregate()` or `read_and_agg()`.

! Unmatched links in causal network
• 11 from env_obj to Specific_goal
! Unmatched links in causal network
• 7 from Objective to target_5_year_2024

tsagg_c

Caution

The ability to pass characters is only available in multi_aggregate() and read_and_agg(), not the internal spatial_aggregate() function.

All steps

Most often, we’ll want to save the list of outcomes at each step- it allows us to see how we got the final outcome, and it’s likely we’re interested in the outcomes at more than one step anyway. As in the theme notebook, we do this with saveintermediate = TRUE. We also set namehistory = FALSE to put the aggregation tracking in columns instead of names for ease of handling the output.

allagg <- multi_aggregate(
  dat = ewrdata,
  causal_edges = causal_ewr,
  groupers = "scenario",
  aggCols = "ewr_achieved",
  aggsequence = aggseq,
  funsequence = funseq,
  auto_ewr_PU = TRUE,
  namehistory = FALSE,
  saveintermediate = TRUE
)

Now, we’ll inspect each step, both as dataframes and maps. There are many other ways of plotting the outcome data available in the comparer. The goal here is simply to visualise what happens at each step along the way, so we make some quick maps.

Sheet 1- raw data from ewr

This is just the input data, so we don’t bother plotting it.

allagg$agg_input

Sheet 2- aggregated over the time period

The first aggregated level is in sheet 2, where the yearly data has been aggregated to the full time period. It is still quite complex, so we again don’t plot it.

allagg$all_time

Sheet 3- ewr_code

Sheet 3 has the ewr_code_timings aggregated to ewr_code.

allagg$ewr_code

There are many EWR codes, so just pick three haphazardly (LF1, BF1, and CF) and plot to see that this data is at the gauge scale.

allagg$ewr_code |>
  dplyr::filter(ewr_code %in% c("LF1", "BF1", "CF")) |>
  dplyr::left_join(scenarios) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "ewr_code",
    sceneorder = c("down4", "base", "up4"),
    underlay_list = list(
      underlay = sdl_units,
      underlay_pal = "cornsilk"
    )
  )

Sheet 4- env_obj

Sheet 4 has now been aggregated to the env_obj on the theme scale, still gauges spatially.

allagg$env_obj

Again choosing three of the first codes, we see this is still gauged.

allagg$env_obj |>
  dplyr::filter(env_obj %in% c("EF1", "WB1", "NF1")) |>
  # Some gauges have > 1 value because they inform multiple planning units. For this, just look at the first one
  dplyr::group_by(scenario, polyID, gauge, SWSDLName, env_obj) |>
  dplyr::summarise(ewr_achieved = first(ewr_achieved)) |>
  dplyr::ungroup() |>
  dplyr::left_join(scenarios) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "env_obj",
    sceneorder = c("down4", "base", "up4"),
    underlay_list = list(
      underlay = sdl_units,
      underlay_pal = "cornsilk"
    )
  )

Sheet 5- sdl_units

The fifth step is a spatial aggregation of env_obj theme-level data into sdl_units. This stays at the env_obj theme scale but aggregates the gauges into sdl units.

allagg$sdl_units

Now we have aggregated the data above into sdl polygons.

allagg$sdl_units |>
  dplyr::filter(env_obj %in% c("EF1", "WB1", "NF1")) |>
  dplyr::left_join(scenarios) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "env_obj",
    sceneorder = c("down4", "base", "up4"),
    underlay_list = list(
      underlay = sdl_units,
      underlay_pal = "grey90"
    )
  )

Sheet 6- Specific goal

Sheet 6 is back to the theme axis, aggregating env_obj to Specific goal, remaining in SDL units.

allagg$Specific_goal

Using fct_reorder. this is where info about the scenarios would come in handy as reorder cols.

allagg$Specific_goal |>
  dplyr::filter(Specific_goal %in% c(
    "All recorded fish species",
    "Spoonbills",
    "Decompsition"
  )) |>
  dplyr::left_join(scenarios) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "Specific_goal",
    sceneorder = c("down4", "base", "up4"),
    underlay_list = list(
      underlay = sdl_units,
      underlay_pal = "grey90"
    )
  )

Sheet 7- Objective

We are now back to aggregation along the theme axis (from Specific goal to Objective), remaining in cewo_valleys.

allagg$Objective

We see that these are still in the catchments, but now the values are different Objectives.

allagg$Objective |>
  dplyr::filter(Objective %in% c(
    "No loss of native fish species",
    "Increase total waterbird abundance across all functional groups",
    "Support instream & floodplain productivity"
  )) |>
  dplyr::left_join(scenarios) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "Objective",
    sceneorder = c("down4", "base", "up4"),
    underlay_list = list(
      underlay = sdl_units,
      underlay_pal = "grey90"
    )
  )

Sheet 8- Basin

This step is a spatial aggregation to the basin scale, with theme remaining at the Objective level. The scaling to the basin is area-weighted, so larger catchments count more toward the basin-scale outcome. Recognise that for this situation with data in only a subset of the basin, aggregation to the whole basin is fraught and is likely biased by missing data.

allagg$basin

We drop the underlay on the plots since we’re now plotting the whole basin

allagg$basin |>
  dplyr::filter(Objective %in% c(
    "No loss of native fish species",
    "Increase total waterbird abundance across all functional groups",
    "Support instream & floodplain productivity"
  )) |>
  dplyr::left_join(scenarios) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "Objective",
    sceneorder = c("down4", "base", "up4")
  )

Sheet 9- 5-year targets

Finally, we aggregate along the theme axis to 5-year targets, remaining at the basin-scale spatialy

allagg$target_5_year_2024

And we’re still at the basin, just plotting different outcomes.

allagg$target_5_year_2024 |>
  dplyr::filter(target_5_year_2024 %in% c(
    "All known species detected annually",
    "Establish baseline data on the number and distribution of wetlands with breeding activity of flow-dependant frog species",
    "Rates of fall does not exceed the 5th percentile of modelled natural rates during regulated water deliveries"
  )) |>
  dplyr::left_join(scenarios) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "target_5_year_2024",
    sceneorder = c("down4", "base", "up4")
  )

More detail

Multiple aggregation functions

For EWR outputs, the aggsequence list will typically need to start with a time aggregation, followed by ewr_code_timing and aggregate from there into ewr_code and env_obj as everything else flows from that.

The funsequence is a list (instead of a simple vector) because multiple functions can be used at each step. When multiple functions are passed, they are factorial (each function is calculated on the results of all previous aggregations). This keeps the history clean, and allows us to easily unpick the meaning of each value in the output.

As an example, we set a range of theme levels that hit all the theme relationship dataframes from the causal networks defined in HydroBOT::causal_ewr, and set the aggregation functions fairly simply, but with two multi-aggregation steps to illustrate how that works. For more complexity in these aggregation functions, see the spatial notebook and aggregation syntax.

We use CompensatingFactor as the aggregation function for the ewr_code_timing to ewr_code step here, assuming that passing either timing sub-code means the main code passes. A similar approach could be done if we want to lump the ewr_codes themselves, e.g. put EF4a,b,c,d into EF4. To demonstrate multiple aggregations, we use both ArithmeticMean and LimitingFactor for the 2nd and 3rd levels, showing also how the outputs from those steps get carried through subsequent steps.

aggseq_multi <- list(
  all_time = "all_time",
  ewr_code = c("ewr_code_timing", "ewr_code"),
  env_obj = c("ewr_code", "env_obj"),
  Specific_goal = c("env_obj", "Specific_goal"),
  Objective = c("Specific_goal", "Objective"),
  target_5_year_2024 = c("Objective", "target_5_year_2024")
)

funseq_multi <- list(
  all_time = "ArithmeticMean",
  ewr_code = c("CompensatingFactor"),
  env_obj = c("ArithmeticMean", "LimitingFactor"),
  Specific_goal = c("ArithmeticMean", "LimitingFactor"),
  Objective = c("ArithmeticMean"),
  target_5_year_2024 = c("ArithmeticMean")
)

multi_functions <- multi_aggregate(
  dat = ewrdata,
  causal_edges = causal_ewr,
  groupers = "scenario",
  aggCols = "ewr_achieved",
  aggsequence = aggseq_multi,
  funsequence = funseq_multi,
  auto_ewr_PU = TRUE
)

multi_functions

That output has 4 columns of output values because aggregation steps are factorial in the number of aggregation functions applied. The second step found the ArithmeticMean and LimitingFactor for ewr_achieved into env_obj and then the third step found the ArithmeticMean and LimitingFactor for each of those outcomes into Specific_goal. Each subsequent step only found the ArithmeticMean for each, and so the number of output columns stopped growing.

Tracking aggregation steps

Tracking aggregation steps is critical for knowing the meaning of the numbers produced. We can do that in two different ways- in column headers (names) or in columns themselves.

Tracking history in column names is unweildy, but describes exactly what the numbers are and is smaller in memory. For example, the last data column is

names(multi_functions)[ncol(multi_functions) - 1]

[1] "target_5_year_2024_ArithmeticMean_Objective_ArithmeticMean_Specific_goal_LimitingFactor_env_obj_LimitingFactor_ewr_code_CompensatingFactor_all_time_ArithmeticMean_ewr_achieved"

This says the values in this column are the 5-year targets, calculated as the arithmetic mean of Objectives, which were the arithmetic mean of Specific goal, which were calculated from env_obj as limiting factors, which were obtained from the ewr_code as limiting factors, with those were calculated from the ewr_code_timing as compensating factors, which were calculated from the time-average of the original data.

It may be easier to think about the meaning of the names from the other direction- ewr_achieved were aggregated from the yearly to the full timeseries as an average, and then ewr_code_timing into ewr_code as Compensating Factors, then into env_obj as limiting factors- for the env_obj to pass, all ewr_codes contributing to it must pass. Then the env_objs were aggregated into Specific_goal, again as limiting factors, so to meet a goal, all contributing env_obj must pass. Those Specific_goals were then aggregated into Objectives with the arithmetic mean, so the value for an Objective is then the average of the contributing Specific_goals. Similarly, the 5-year targets were obtained by averaging the Objectives contributing to them.

A different way to track the aggregations is possible by including them in columns instead of the names. This takes more memory, but can be clearer and makes subsequent uses easier in many cases. In the example above, we can feed the output data to agg_names_to_cols to put the history in columns instead of names.

agg_names_to_cols(multi_functions,
  aggsequence = aggseq_multi,
  funsequence = funseq_multi,
  aggCols = "ewr_achieved"
)

In practice, what makes most sense is to use a switch (namehistory = FALSE) inside multi_aggregate to return this format.

multi_functions_history <- multi_aggregate(
  dat = ewrdata,
  causal_edges = causal_ewr,
  groupers = "scenario",
  aggCols = "ewr_achieved",
  aggsequence = aggseq_multi,
  funsequence = funseq_multi,
  auto_ewr_PU = TRUE,
  namehistory = FALSE
)

multi_functions_history

User-passed causal networks

In most of the demonstrations, we have used the HydroBOT::causal_ewr causal networks. However, the causal network is an argument to multi_aggregate() and read_and_agg(), and so it is possible for the user to pass arbitrary networks. One use that is likely to be useful is to extract the (sometimes newer, but less tested) causal networks from the EWR tool with get_causal_ewr().

new_ewr_causal <- multi_aggregate(
  dat = ewrdata,
  causal_edges = get_causal_ewr(),
  groupers = "scenario",
  aggCols = "ewr_achieved",
  aggsequence = aggseq,
  funsequence = funseq,
  auto_ewr_PU = TRUE,
  namehistory = FALSE
)

ℹ EWR outputs auto-grouped
• Done automatically because `auto_ewr_PU = TRUE`
• EWRs should be grouped by `SWSDLName`, `planning_unit_name`, and `gauge` until aggregated to larger spatial areas.
• Rows will collapse otherwise, silently aggregating over the wrong dimension
• Best to explicitly use `group_until` in `multi_aggregate()` or `read_and_agg()`.

ℹ EWR outputs auto-grouped
• Done automatically because `auto_ewr_PU = TRUE`
• EWRs should be grouped by `SWSDLName`, `planning_unit_name`, and `gauge` until aggregated to larger spatial areas.
• Rows will collapse otherwise, silently aggregating over the wrong dimension
• Best to explicitly use `group_until` in `multi_aggregate()` or `read_and_agg()`
.
ℹ EWR outputs auto-grouped
• Done automatically because `auto_ewr_PU = TRUE`
• EWRs should be grouped by `SWSDLName`, `planning_unit_name`, and `gauge` until aggregated to larger spatial areas.
• Rows will collapse otherwise, silently aggregating over the wrong dimension
• Best to explicitly use `group_until` in `multi_aggregate()` or `read_and_agg()`
.
ℹ EWR gauges joined to larger units pseudo-spatially.
• Done automatically because `auto_ewr_PU = TRUE`
• Non-spatial join needed because gauges may inform areas they are not within
• Best to explicitly use `pseudo_spatial = 'sdl_units'` in `multi_aggregate()` or `read_and_agg()`.

! Unmatched links in causal network
• 11 from env_obj to Specific_goal
! Unmatched links in causal network
• 7 from Objective to target_5_year_2024

new_ewr_causal

It is also possible to use any arbitrary network with the needed links (columns). Here, we make up a very simple one. See causal_ewr for needed structure; the main key is it needs to be a list of dataframe(s).

fakegroups <- c("a", "b", "c")
fake_causal <- tibble::tibble(
  ewr_code_timing = unique(ewrdata$ewr_code_timing),
  fake_group = sample(fakegroups,
    length(unique(ewrdata$ewr_code_timing)),
    replace = TRUE
  )
)

aggseq_fakecausal <- list(
  all_time = "all_time",
  fake_group = c("ewr_code_timing", "fake_group"),
  sdl_units = sdl_units
)

funseq_fakecausal <- list(
  all_time = "ArithmeticMean",
  fake_group = "CompensatingFactor",
  sdl_units = "ArithmeticMean"
)

fake_causal_agg <- multi_aggregate(
  dat = ewrdata,
  causal_edges = list(fake_causal),
  groupers = "scenario",
  aggCols = "ewr_achieved",
  aggsequence = aggseq_fakecausal,
  funsequence = funseq_fakecausal,
  auto_ewr_PU = TRUE,
  namehistory = FALSE
)

ℹ EWR outputs auto-grouped
• Done automatically because `auto_ewr_PU = TRUE`
• EWRs should be grouped by `SWSDLName`, `planning_unit_name`, and `gauge` until aggregated to larger spatial areas.
• Rows will collapse otherwise, silently aggregating over the wrong dimension
• Best to explicitly use `group_until` in `multi_aggregate()` or `read_and_agg()`.

Warning in filtergroups(thisdf, fromcol = p[1], tocol = p[2], fromfilter =
fromfilter, : Unable to cross-check gauges and planning units, trusting the
user they work together

fake_causal_agg

And a quick plot of the random groupings implied there.

fake_causal_agg |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "fake_group",
    sceneorder = c("down4", "base", "up4")
  )

User-passed spatial data

Like the causal networks, users can pass arbitrary spatial units to multi_aggregate() and read_and_agg(), and are not limited to those provided by HydroBOT. The only requirement is that they are sf objects with a geometry column. To demonstrate, we’ll download Australian states and aggregate into those (and the fake causal network above).

austates <- rnaturalearth::ne_states(country = "australia") |>
  dplyr::select(state = name, geometry)

aggseq_states <- list(
  all_time = "all_time",
  fake_group = c("ewr_code_timing", "fake_group"),
  state = austates
)

state_agg <- multi_aggregate(
  dat = ewrdata,
  causal_edges = list(fake_causal),
  groupers = "scenario",
  aggCols = "ewr_achieved",
  aggsequence = aggseq_states,
  funsequence = funseq_fakecausal,
  namehistory = FALSE
)

Warning: ! EWR outputs detected without `group_until`!
ℹ EWR outputs should be grouped by `SWSDLName`, `planning_unit_name`, and `gauge` until aggregated to larger spatial areas.
ℹ Best to explicitly use `group_until` in `multi_aggregate()` or `read_and_agg()`.
ℹ Lower-level processing should include as `grouper` in `temporal_aggregate()`

Warning in filtergroups(thisdf, fromcol = p[1], tocol = p[2], fromfilter =
fromfilter, : Unable to cross-check gauges and planning units, trusting the
user they work together

state_agg

And a quick plot of that to see the different spatial units (though this example is all in New South Wales).

state_agg |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "fake_group",
    sceneorder = c("down4", "base", "up4")
  )

User-passed functions

See the aggregation syntax for a full treatment of the way functions can be specified. The most stable and best practice for specifying aggregation functions is to specify them as a named function, and supply that to the funsequence. Copying what’s at that page,

We demonstrate here with a threshold function and a median. For example, we might want to know the mean of all values greater than 0 for some stages and use the median at others.

mean_given_occurred <- function(x) {
  mean(ifelse(x > 0, x, NA), na.rm = TRUE)
}

medna <- function(x) {
  median(x, na.rm = TRUE)
}

aggseq_funs <- list(
  all_time = "all_time",
  ewr_code = c("ewr_code_timing", "ewr_code"),
  env_obj = c("ewr_code", "env_obj"),
  sdl_units = sdl_units,
  Target = c("env_obj", "Target")
)


funseq_funs <- list(
  all_time = "ArithmeticMean",
  ewr_code = "ArithmeticMean",
  env_obj = "mean_given_occurred",
  sdl_units = "medna",
  Target = "mean_given_occurred"
)

Those changes are then reflected in the aggregation history and determine the aggregated values.

agged_custom_funs <- multi_aggregate(
  dat = ewrdata,
  causal_edges = causal_ewr,
  groupers = c("scenario"),
  aggCols = "ewr_achieved",
  aggsequence = aggseq_funs,
  funsequence = funseq_funs,
  namehistory = FALSE
)

We can plot them to double check it worked.

agged_custom_funs |>
  dplyr::filter(scenario != "MAX") |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    y_lab = "Arithmetic Mean",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    facet_col = "Target",
    facet_row = "scenario",
    sceneorder = c("down4", "base", "up4")
  )

Dimensional information

The multi_aggregate function handles dimensional information about space, time, and theme. It does this dependent on knowing theme levels from a causal network, having a time column, and having geographic information. Thus, it is general, and will run fine without this information, e..g a dataframe without a geometry column will not trigger spatial checks, and one without a column in a time format will not trigger temporal checks. That raises some issues, however, in that if a dataframe should have those special columns, the dimensional safety will be lost. Thus, multi_aggregate doesn’t care what theme, temporal, or spatial scale its input data is- e.g. we could give it Objectives already at the Catchment scale, and then use it to move up.

Syntax for arguments

The groupers and aggCols arguments can take a number of different formats- character vectors, bare column names and sometimes tidyselect, though this is more limited for multi_aggregate than the dimension-specific aggregations. Moreover, functions can be passed as characters, bare names, and anonymous functions in lists. For more detail, see the syntax documentation.

Using `multi_aggregate` for one aggregation

If we’re only using one level of aggregation and nothing else, there may not be need for the multi_aggregate() wrapper, though in a formal analysis the safety provided by multi_aggregate() and even read_and_agg() are likely worth it. These wrappers do work even for single steps though. We do have a bit less flexibility with how we specify arguments, see syntax documentation.

Typically we could use namehistory = FALSE to avoid the horrible long name with all the transforms in it, but there’s no way for it to know the previous aggregation history when it’s been done in pieces (as we do in each of the theme, spatial, and temporal examples. As we do in those examples, it is possible to use agg_names_to_cols() with the full sequence including the earlier steps to extract them.

Value to aggregate

We have demonstrated everything here by aggregating EWR data with the ewr_achieved column. However, any numeric column can be aggregated, as we show in Section 1.10 for made up module outputs. Here, we also show aggregation on a different EWR metric, the achievement of interevent requirements:

allagg_interevent <- multi_aggregate(
  dat = ewrdata,
  causal_edges = causal_ewr,
  groupers = "scenario",
  aggCols = "interevent_achieved",
  aggsequence = aggseq,
  funsequence = funseq,
  auto_ewr_PU = TRUE,
  namehistory = FALSE,
  saveintermediate = TRUE
)

We’ll show that worked for just one sheet (the sdl units).

allagg_interevent$sdl_units |>
  dplyr::filter(env_obj %in% c("EF1", "WB1", "NF1")) |>
  dplyr::left_join(scenarios) |>
  plot_outcomes(
    outcome_col = "interevent_achieved",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "interevent_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "env_obj",
    sceneorder = c("down4", "base", "up4"),
    underlay_list = list(
      underlay = sdl_units,
      underlay_pal = "grey90"
    )
  )

Joining with `by = join_by(scenario)`

Using un-integrated response models

In some cases, response models may not yet be integrated into HydroBOT, or may not be able to be integrated (e.g. if they are proprietary or unscriptable). In these cases, we can still use multi_aggregate() for processing with the Aggregator and then Comparer. All we need is a dataframe. For aggregation along the theme, space, and time dimensions, it needs to have information about those dimensions (i.e. a time column, a geometry column, and a column matching something in its relevant causal network). It will typically have a ‘scenario’ column as well.

Setting up non-module data and relationships

To demonstrate, we will create some dummy data in the Australian states.

We need some values in different years.

# add a date column
state_inputs <- austates |>
  dplyr::mutate(date = lubridate::ymd("20000101"))

# add some values
withr::with_seed(
  17,
  state_inputs <- state_inputs |>
    dplyr::mutate(value = runif(nrow(state_inputs)))
)

withr::with_seed(
  17,
  # add some more days, each with different values
  state_inputs <- purrr::map(
    0:10,
    \(x) dplyr::mutate(state_inputs,
      date = date + x,
      value = value * rnorm(nrow(state_inputs),
        mean = x, sd = x / 2
      )
    )
  ) |>
    dplyr::bind_rows()
)

We need some scenarios too.

# add a scenario column, each with different values
state_inputs <- purrr::imap(
  letters[1:4],
  \(x, y) dplyr::mutate(state_inputs,
    scenario = x,
    value = value + y
  )
) |>
  dplyr::bind_rows()

We need some dummy ‘theme’ groupings and a network.

# add a theme-relevant column, each with different values
state_inputs <- purrr::imap(
  c("E", "F", "G", "H", "I", "J"),
  \(x, y) dplyr::mutate(state_inputs,
    theme1 = x,
    value = value + y
  )
) |>
  dplyr::bind_rows()

# make a simple 'causal' network
state_theme <- tibble::tibble(
  theme1 = c("E", "F", "G", "H", "I", "J"),
  theme2 = c(
    "vowel", "consonant", "consonant",
    "consonant", "vowel", "consonant"
  )
) |>
  list()

And finally, we will provide a larger spatial unit to aggregate into, all of Australia (though this loses the islands).

all_aus <- rnaturalearth::ne_countries(country = "australia") |>
  dplyr::select(geounit)

Using the Aggregator

First, we set up some aggregation steps. Will just use means throughout.

# This will aggregate into weeks, then to type, and then to the country.
ausseq <- list(
  week = "week",
  theme2 = c("theme1", "theme2"),
  all_aus = all_aus
)

# just use mean, since there are no NA in the data.
ausfuns <- list(
  week = "mean",
  type = "mean",
  all_aus = "mean"
)

Do the aggregation

# Do the aggregation
ausagg <- multi_aggregate(
  dat = state_inputs,
  causal_edges = state_theme,
  groupers = "scenario",
  aggCols = "value",
  aggsequence = ausseq,
  funsequence = ausfuns,
  namehistory = FALSE,
  saveintermediate = TRUE
)

Warning in filtergroups(thisdf, fromcol = p[1], tocol = p[2], fromfilter =
fromfilter, : Unable to cross-check gauges and planning units, trusting the
user they work together

Quick plots of a couple levels

ausagg$week |>
  filter(theme1 == "E") |>
  plot_outcomes(
    outcome_col = "value",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "value",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "date"
  )

ausagg$theme2 |>
  filter(date == lubridate::ymd("2000-01-03")) |>
  plot_outcomes(
    outcome_col = "value",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "value",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "theme2"
  )

ausagg$all_aus |>
  filter(date == lubridate::ymd("2000-01-03")) |>
  plot_outcomes(
    outcome_col = "value",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "value",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "scenario",
    facet_row = "theme2"
  )