Theme aggregation

library(HydroBOT)
library(dplyr)

Overview

Aggregating outcomes along the theme dimension is a key requirement of HydroBOT. For example, we might want to combine EWR pass/fails into the proportion of EWRs contributing to a proximate objective (‘environmental objective’ env_obj) that passed, and then translate that into outcomes for 5-year targets or Waterbirds, etc.

The input data is thus the data coming out of the response modules (e.g. EWR tool), which is then aggregated. This data can be the immediate product of the response models or any subsequent aggregations (e.g. following spatial or temporal aggregations), provided it includes a column defining the theme levels each row applies to. The demonstrations here are all about the EWR outputs, but the aggregator is agnostic to the input data, provided we have the causal relationships to define the aggregation groupings- we specify the columns to aggregate and any additional grouping variables, which can be anything.

Important

The relationships that define the aggregations are the same as those defining the causal networks- these map finer-scale groups to coarser. Thus, we need to access those relationships (and the make_edges function that builds the links).

The causal relationships for the EWR tool are provided in the {HydroBOT} package as causal_ewr, a list of all the mappings. More recently, the EWR tool itself provides its own causal networks. The version in HydroBOT is updated from these frequently and is tested, but to ensure the absolute most current causal networks, a user can use the get_causal_ewr() function to extract them from the EWR tool.

The {HydroBOT} package also provides all necessary aggregation functions and handling functions for the causal relationships, though like the polygons in the spatial aggregation it is possible to use relationships other than those provided by {HydroBOT} (see more detail). In particular, while the EWR tool provides these relationships externally to the response model itself, they may be embedded in other modules, particularly if the responses at different theme levels are modelled mechanistically.

We begin this demonstration with theme_aggregate() itself. Because of the centrality of the causal network, and so multiple dependent levels of theme aggregation, we then move on to examples of multi-step theme aggregation using multi_aggregate().

In practice, we expect to interleave spatial, temporal, and thematic aggregation steps- perhaps it makes sense to aggregate along the theme axis to the env_obj scale at a gauge, then scale to the SDL unit, then aggregate to the Objective, scale, and then scale to the basin and long-term targets. We demonstrate such interleaved aggregation elsewhere, and here focus on demonstrating and understanding the meaning of aggregation along the theme axis and how to do it. Similar notebooks for spatial aggregation and temporal aggregation go into the detail along the spatial and temporal dimension.

Demonstration

The theme relationships in the causal network causal_ewr provide the links and structure of the theme scaling, while the quantities to be scaled come out of the modules. For this demonstration, we provide a set of paths to point to the input data, in this case the outputs from the EWR tool, created by a controller notebook.

project_dir <- "hydrobot_scenarios"
hydro_dir <- file.path(project_dir, "hydrographs")
ewr_results <- file.path(project_dir, "module_output", "EWR")

Data

Input data to should be a dataframe (e.g. a dataframe of EWR outputs, sf object if they are spatial outcomes). If we want to pass a path instead of a dataframe (as we might for large runs), we would use read_and_agg, which wraps multi_aggregate, demonstrated in its own notebook. Thus, for the demonstration, we pull in the the EWR output produced from the HydroBOT-provided hydrographs (system.file('extdata/testsmall/hydrographs', package = 'HydroBOT'), which we have processed already here and are at the paths above.

We’ll pull in the data to use for demonstration so we can use theme_aggregate() and multi_aggregate() directly. If we want to feed a path instead of a dataframe, we would use read_and_agg().

The data comes in as a timeseries, so we do one initial level of temporal aggregation (the mean over the series) to make visualisation easier.

ewr_out <- prep_run_save_ewrs(
  hydro_dir = hydro_dir,
  output_parent_dir = project_dir,
  outputType = list("none"),
  returnType = list("yearly")
)


# This is just a simple prep step that is usually done internally to put the geographic coordinates on input data
ewrdata <- prep_ewr_output(ewr_out$yearly, type = "achievement", add_max = FALSE)

# This gets us to env_obj at the gauge over all time
preseq <- list(
  all_time = "all_time"
)

funseq <- list(
  "ArithmeticMean"
)

# Do the aggregation to get output at each gauge averaged over time
simpleAgg <- multi_aggregate(
  dat = ewrdata,
  causal_edges = causal_ewr,
  groupers = c("scenario", "gauge", "ewr_code_timing"),
  aggCols = "ewr_achieved",
  aggsequence = preseq,
  funsequence = funseq
)

Warning: ! EWR outputs detected without `group_until`!
ℹ EWR outputs should be grouped by `SWSDLName`, `planning_unit_name`, and `gauge` until aggregated to larger spatial areas.
ℹ Best to explicitly use `group_until` in `multi_aggregate()` or `read_and_agg()`.
ℹ Lower-level processing should include as `grouper` in `temporal_aggregate()`

simpleAgg

This provides a spatially-referenced (to gauge) temporally-aggregated tibble to use to demonstrate theme aggregation. Note that this has the initial theme level of the EWR outputs (ewr_code_timing), but also two groupings that we want to preserve when we aggregate along the theme dimension- scenario and the current level of spatial grouping, the gauge locations. We have dropped the time by taking the temporal average in step one, but that would be preserved as well if present. In typical use with the EWR tool, the planning_unit_name and SWSDLName columns should be preserved for pseudo-spatial aggregation.

We’ll choose an example gauge to make it easier to visualise the data.

# Dubbo is '421001', has 24 EWRs
# Warren Weir is '421004', has 30 EWRs.
example_gauge <- "421001"

Causal network

Theme dimension aggregation must use links defined in causal_ewrs (or other causal mappings) from the ‘from’ and ‘to’ levels at each step (the from_theme and to_theme arguments, which are characters for the column names in the causal network). In other words, we can’t scale between levels with no defined relationship. However, this does not mean every level must be included, and indeed they usually are not. If a relationship exists in the causal network dataframe(s), levels can be jumped, e.g. we could go straight from env_obj to target_20_year_2039 using the links defined in causal_ewr$obj2yrtarget without connecting to Target or Objective as intermediate steps. There may not be a defined ordering of some levels, and so it is perfectly reasonable to go from env_obj to both Objective and Target, depending on the question. Here, we use the causal_ewr list of relationships provided with HydroBOT, but other lists could be supplied.

Examples

We’ll now use that input data to demonstrate how to do theme aggregation singly or as a multi-step process to build outcomes on the causal network.

Single aggregation

We might just want to perform theme aggregation once. This is rare in practice, but is what multi_aggregate() uses internally for the theme steps, so is worth understanding. We can do this simply by passing the input data (in this case simpleAgg), a causal network, and providing a funlist. In this simple case, we just use a single character function name, here the custom 'ArithmeticMean', which is just a simple wrapper of mean with na.rm = TRUE. Any function can be passed this way, custom or in-built, provided it has a single argument. More complex situations are given below, and different syntax is possible.

Note

The funlist argument here specifies the function(s) to use at a single step. It is thus not the same as the funsequence list of multi_aggregate(); instead being a single item in that list, though it may include multiple functions (e.g. the mean and max).

Note

The aggCols argument is tidyselect::ends_with(original_name) to reference the original name of the column of values- it may have a long name tracking its aggregation history, so we give it the tidyselect::ends_with() to find the column. More generally, both aggCols and groupers can take any tidyselect syntax or bare names or characters, see here.

We demonstrate here by aggregating from the output of the EWR tool (at the ‘ewr_code_timing’ level) to the ‘env_obj’ level, which defines parts of lifecycles or components of larger Targets.

The theme_aggregate() function handles a single aggregation step along the theme dimension. For each step in the aggregation, we need to specify what levels we are aggregating from and to, the function to use to aggregate, and the mapping between the ‘from’ and ‘to’ levels.

The funlist argument works as in the spatial examples, and can be characters, bare names, or anonymous functions, with some limits. Perhaps most importantly, it can be a list of more than one function if we want to, for example, calculate the mean and maximum. Note that in theme_aggregate() itself, this list is for a single step, and would be one item in the funsequence argument to multi_aggregate().

timing_obj <- theme_aggregate(
  dat = simpleAgg,
  from_theme = "ewr_code_timing",
  to_theme = "env_obj",
  # 'gauge' is not strictly necessary because this is an sf, but including it retains that column
  groupers = c("scenario", "gauge"),
  aggCols = tidyselect::ends_with("ewr_achieved"),
  funlist = "ArithmeticMean",
  causal_edges = causal_ewr
)

Warning: ! EWR outputs detected without `group_until`!
ℹ EWR outputs should be grouped by `SWSDLName`, `planning_unit_name` and `gauge` until aggregated to larger spatial areas.
ℹ Best to explicitly use `group_until` in `multi_aggregate()` or `read_and_agg()`.
ℹ Lower-level processing should include as `grouper` in `theme_aggregate()`

timing_obj

Because we are using theme_aggregate directly, the dimensional safety provided by multi_aggregate() is not present. However, because this is an sf object, the geometry column retains the spatial information. This should not be relied on, and would not happen if theme-aggregating a nonspatial dataset. The multi_aggregate function automatically handles this preservation, but theme_aggregate is more general, and does not make any assumptions about the grouping structure of the data. Thus, to keep spatial or temporal groupings (as we should, otherwise we can inadvertently aggregate over all of them simultaneously), we should add polyID or geometry, along with any time columns if present, to the groupers argument.

The resulting column name is cumbersome, but provide a record of exactly what the aggregation sequence was.

names(timing_obj)

[1] "scenario"                                                   
[2] "gauge"                                                      
[3] "polyID"                                                     
[4] "env_obj"                                                    
[5] "env_obj_ArithmeticMean_all_time_ArithmeticMean_ewr_achieved"
[6] "geometry"

We can clean those up into columns with agg_names_to_cols() (which happens internally in multi_aggregate() and read_and_agg() with namehistory = FALSE).

to_rename <- agg_names_to_cols(timing_obj,
  aggsequence = c(names(preseq), "env_obj"),
  funsequence = c(funseq, "ArithmeticMean"),
  aggCols = "ewr_achieved"
)

to_rename

A quick plot shows the outcome. For more plotting details, see the plotting section. We’ll simplify the names and choose a subset of the environmental objectives.

env_pals <- list(
  EB = "grDevices::Grays",
  EF = "grDevices::Purp",
  NF = "grDevices::Mint",
  NV = "grDevices::Burg",
  OS = "grDevices::Blues 2",
  WB = "grDevices::Peach"
)


to_rename |>
  dplyr::mutate(env_group = stringr::str_extract(env_obj, "^[A-Z]+")) |>
  dplyr::filter(env_group != "EB") |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "scenario",
    y_lab = "Arithmetic Mean",
    colorgroups = "env_group",
    colorset = "env_obj",
    pal_list = env_pals,
    # facet_col = "env_obj",
    facet_row = "gauge",
    sceneorder = c("down4", "base", "up4")
  )

Multiple aggregation functions

If we give funlist more than one aggregation function, it calculates both. Here, we use the mean and minimum.

timing_obj_2 <- theme_aggregate(
  dat = simpleAgg,
  from_theme = "ewr_code_timing",
  to_theme = "env_obj",
  groupers = c("scenario", "gauge"),
  aggCols = tidyselect::ends_with("ewr_achieved"),
  funlist = c("ArithmeticMean", "Min"),
  causal_edges = causal_ewr
)

Warning: ! EWR outputs detected without `group_until`!
ℹ EWR outputs should be grouped by `SWSDLName`, `planning_unit_name` and `gauge` until aggregated to larger spatial areas.
ℹ Best to explicitly use `group_until` in `multi_aggregate()` or `read_and_agg()`.
ℹ Lower-level processing should include as `grouper` in `theme_aggregate()`

timing_obj_2 <- agg_names_to_cols(timing_obj_2,
  aggsequence = c(names(preseq), "env_obj"),
  funsequence = list(
    all_time = funseq,
    env_obj = list("ArithmeticMean", "Min")
  ),
  aggCols = "ewr_achieved"
)

Warning in funsequence[!fs] <- names(unlist(funsequence[!fs])): number of items
to replace is not a multiple of replacement length

And now we can compare what we get out of the different functions

timing_obj_2 |>
  dplyr::mutate(env_group = stringr::str_extract(env_obj, "^[A-Z]+")) |>
  filter(polyID != "r1zp2f5py7d") |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "aggfun_2",
    y_lab = "Arithmetic Mean",
    colorgroups = "env_group",
    colorset = "env_obj",
    pal_list = env_pals,
    facet_col = "scenario",
    facet_row = "gauge",
    sceneorder = c("down4", "base", "up4")
  )

Multiple theme levels

In general, multi_aggregate() can be used across theme, space, and temporal dimensions. But here, it is particularly helpful to allow multi-step aggregation along the theme dimension given the importance of the causal network for theme aggregation. To do that, we use the aggregations above with savehistory = TRUE and namehistory = FALSE.

To create the aggregation, we provide the sequence lists created above, along with the causal links, defined by the causal_edges argument. Because the make_edges() function also takes a sequence of node types, we can usually just call make_edges() on the list of relationships and the desired set of theme levels. We can also just pass in causal_edges = causal_ewr (the list with all possible links), and theme_aggregate() will auto-generate the edges it needs. That’s just a bit less efficient (the edges get generated each step instead of once).

aggseq <- list(
  all_time = "all_time",
  ewr_code = c("ewr_code_timing", "ewr_code"),
  env_obj = c("ewr_code", "env_obj"),
  Specific_goal = c("env_obj", "Specific_goal"),
  Objective = c("Specific_goal", "Objective"),
  target_5_year_2024 = c("Objective", "target_5_year_2024")
)

funseq <- list(
  all_time = "ArithmeticMean",
  ewr_code = c("CompensatingFactor"),
  env_obj = c("ArithmeticMean", "LimitingFactor"),
  Specific_goal = c("ArithmeticMean", "LimitingFactor"),
  Objective = c("ArithmeticMean"),
  target_5_year_2024 = c("ArithmeticMean")
)

theme_steps <- multi_aggregate(
  dat = ewrdata,
  causal_edges = causal_ewr,
  groupers = c("scenario", "gauge"),
  aggCols = "ewr_achieved",
  aggsequence = aggseq,
  funsequence = funseq,
  namehistory = FALSE,
  saveintermediate = TRUE,
  auto_ewr_PU = TRUE # avoid warnings and errors
)

Warning: Causal network does not have all groupers.
• Joining env_obj to Specific_goal
• Groupers are scenario, gauge, polyID, planning_unit_name, SWSDLName.
• expect causal network to have gauge, planning_unit_name, SWSDLName; it has planning_unit_name, SWSDLName, env_obj, Specific_goal, fromtype, totype, edgeorder.
• Do you need to use `group_until`? Or is your network missing columns?

Warning: Causal network does not have all groupers.
• Joining Specific_goal to Objective
• Groupers are scenario, gauge, polyID, planning_unit_name, SWSDLName.
• expect causal network to have gauge, planning_unit_name, SWSDLName; it has planning_unit_name, SWSDLName, Specific_goal, Objective, fromtype, totype, edgeorder.
• Do you need to use `group_until`? Or is your network missing columns?

Warning: Causal network does not have all groupers.
• Joining Objective to target_5_year_2024
• Groupers are scenario, gauge, polyID.
• expect causal network to have gauge; it has Objective, target_5_year_2024, fromtype, totype, edgeorder.
• Do you need to use `group_until`? Or is your network missing columns?

Causal plot

By returning values at each stage, we can map those to colour in a causal network. Here, we map the values of the aggregation to node colour. To do this, we follow the make_causal_plot() approach of making edges and nodes, and then use a join to attach the value to each node.

To keep this demonstration from becoming too unwieldy, we limit the edge creation to a single gauge, and so filter the theme aggregations accordingly (or just rely on the join to drop).

The first step is to generate the edges and nodes for the network we want to look at.

edges <- make_edges(causal_ewr,
  fromtos = aggseq[2:length(aggseq)],
  gaugefilter = example_gauge
)

nodes <- make_nodes(edges)

Now, extract the values we want from the aggregation and join them to the nodes.

# need to grab the right set of aggregations if there are multiple at some stages
whichaggs <- c(
  "ArithmeticMean",
  "CompensatingFactor",
  "ArithmeticMean",
  "ArithmeticMean",
  "ArithmeticMean",
  "ArithmeticMean"
)

# What is the column that defines the value?
valcol <- "ewr_achieved"

# Get the values for each node
aggvals <- extract_vals_causal(theme_steps,
  whichaggs = whichaggs,
  valcol = "ewr_achieved",
  targetlevels = names(aggseq)[2:6]
) # don't use the first one, it's time at _timing

aggvals <- aggvals |>
  # the NA gauges are for levels past gauge definitions
  filter(gauge == example_gauge | is.na(gauge)) |>
  st_drop_geometry()

# join to the nodes
nodes_with_vals <- dplyr::left_join(nodes, aggvals)

Now we can make the causal network plot with the nodes we chose and colour them by the values we’ve just attached to them from the aggregation. At present, it is easiest to make separate plots per scenario or other grouping ( Figure 1 , Figure 2 ). For example, in the increased watering scenario, we see more light colours, and so better performance across the range of outcomes. Further network outputs are provided in the Comparer.

aggNetwork_base <- make_causal_plot(
  nodes = dplyr::filter(
    nodes_with_vals,
    scenario == "base"
  ),
  edges = edges,
  edge_pal = "black",
  node_pal = list(value = "scico::tokyo"),
  node_colorset = "ewr_achieved",
  render = FALSE
)

DiagrammeR::render_graph(aggNetwork_base)

Figure 1: Causal network for baseline scenario at example gauge, coloured by proportion passing at each node, e.g. Arithmetic Means at every step. Light yellow is 1, dark purple is 0.

aggNetwork_4 <- make_causal_plot(
  nodes = dplyr::filter(
    nodes_with_vals,
    scenario == "up4"
  ),
  edges = edges,
  edge_pal = "black",
  node_pal = list(value = "scico::tokyo"),
  node_colorset = "ewr_achieved",
  render = FALSE
)

DiagrammeR::render_graph(aggNetwork_4)

Figure 2: Causal network for 4x scenario at example gauge, coloured by proportion passing at each node, e.g. Arithmetic Means at every step. Light yellow is 1, dark purple is 0.