Spatial aggregation

library(HydroBOT)
library(dplyr)
library(ggplot2)
library(patchwork)

Aggregating data to larger spatial scales is a key requirement of the aggregator component of HydroBOT. To do this spatially, we need to specify the larger units into which input data (typically in smaller spatial units) gets aggregated and the functions to use to do that aggregation. This aggregation needs to respect grouping, most obviously for scenarios, but we also need to keep theme groupings separated during spatial aggregation steps to ensure dimensional safety. Moroever, there is the special case of nonspatial aggregation of spatial data where units at one scale may be relevant to other units they are not geographically related to (e.g. gauges providing information for distant SDL units). Finally, the spatial aggregator enforces recording of the aggregations to ensure the meaning of the data isn’t lost. This results in column names like spatial_ArithmeticMean_ewr_achieved, which are cumbersome but ensure the data provenance is not lost. They can be dealt with more cleanly in multi_aggregate and with agg_names_to_cols(), which turn them into columns.

Here, we explore the spatial aggregator, spatial_aggregate, in isolation to discuss its specific characteristics, though in practice it is typically used within multi_aggregate and read_and_agg which wrap spatial_aggregate, temporal_aggregate, and theme_aggregate to allow interleaved aggregation steps in a standardised format. See the page for multi_aggregate and read_and_agg for more details.

We often will want to only perform a single spatial aggregation (e.g. from gauges to sdl units), but there are instances where that isn’t true- perhaps we want to aggregate from sdl units to states or the basin. Thus, multi-step spatial aggregation is possible, including the situation where aggregation units (polygons) are not nested, as would be the case for sdl units and states, for example.

This document delves fairly in-depth into capabilities, including things like argument types and how they relate to other functions and permit certain tricks. Not all of these will be used or needed to understand by most users- typically there will be a set of aggregation steps fed to read_and_agg and that will be that. This sort of simpler setup is shown in the combined aggregation notebook, read-in and aggregate, and in a combined workflow. See the syntax notebook for a detailed look at argument construction for various purposes. Here, we use that syntax to demonstrate how the spatial aggregation works and the different ways it can be done.

As with all of the internal HydroBOT aggregations, the spatial_aggregate() function is exposed and can be used directly, but can be very ad-hoc and easy to forget theme- and temporal-axis grouping. In general, best practice is to pass a list giving the aggregation sequence to multi_aggregate.

Inputs

spatial_aggregate accept inputs at arbitrary aggregation levels (theme, spatial, or temporal), provided that data is spatial. In other words, the spatial aggregation should aggregate any input spatial data into any set of spatial units, whatever that input data represents. Though we demonstrate here with the polygons provided by {HydroBOT}, the user can supply any arbitrary spatial units in sf format.

EWR-specific note: in some cases, the definitions for outcomes along the theme axis are defined spatially; for example, the definition of Specific_objectives might vary between planning units. However, these are the scale at which definitions of outcomes change, not the scale at which those outcomes must be assessed. For example, just because Specific_objectives are differently defined between planning units, we can still scale them up in space from gauge to basin, with no reference to planning unit.

Demonstration setup

For this demonstration, we start with gauge-referenced data at the env_obj theme scale, and so do a bit of data pre-processing to get there.

project_dir <- "hydrobot_scenarios"
hydro_dir <- file.path(project_dir, "hydrographs")
ewr_results <- file.path(project_dir, "module_output", "EWR")

ewr_out <- prep_run_save_ewrs(
  hydro_dir = hydro_dir,
  output_parent_dir = project_dir,
  outputType = list("none"),
  returnType = list("yearly")
)


# This is just a simple prep step that is usually done internally to put the geographic coordinates on input data
ewrdata <- prep_ewr_output(ewr_out$yearly, type = "achievement")

# This gets us to env_obj at the gauge over all time
preseq <- list(
  all_time = "all_time",
  ewr_code = c("ewr_code_timing", "ewr_code"),
  env_obj = c("ewr_code", "env_obj")
)

funseq <- list(
  all_time = "ArithmeticMean",
  ewr_code = "CompensatingFactor",
  env_obj = "ArithmeticMean"
)

# Do the aggregation to get env_obj at each gauge
simpleThemeAgg <- multi_aggregate(
  dat = ewrdata,
  causal_edges = causal_ewr,
  groupers = c("scenario", "gauge"),
  aggCols = "ewr_achieved",
  aggsequence = preseq,
  funsequence = funseq
)

simpleThemeAgg

This provides a spatially-referenced (to gauge) theme-aggregated tibble to use to demonstrate spatial aggregation. Note that this has the gauge (spatial unit), but also two groupings that we want to preserve when we spatially aggregate- scenario and the current level of theme grouping, env_obj. We have dropped the time by taking the temporal average in step one, but that would be preserved as well if present.

Spatial inputs (polygons)

Spatial aggregation requires polygons to aggregate into, and we want the capability to do that several times. The user can read in any desired polygons with sf::read_sf(path/to/polygon.shp), but here we use those provided in the standard set with {HydroBOT}. We’ll use SDL units, catchments, and the basin to show how the aggregation can have multiple steps with polygons that may not be nested (though care should be taken when that is the case).

Examples

We’ll now use that input data to demonstrate how to do the spatial aggregation, demonstrate capabilities and options provided by the function, and provide additional information useful to the user.

Single aggregation

We might just want to aggregate spatially once. We can do this simply by passing the input data (anything spatial, in this case simpleThemeAgg), a set of polygons, and providing a funlist. In this simple case, we just use a bare function name, here the custom ArithmeticMean which is just a simple wrapper of mean with na.rm = TRUE. Any function can be passed this way, custom or in-built, provided it has a single argument. More complex situations are given below, and different syntax is possible.

Note

The funlist argument here specifies the function(s) to use at a single step. It is thus not the same as the funsequence list of multi_aggregate(); instead being a single item in that list, though it may include multiple functions (e.g. the mean and max).

Note

The aggCols argument is ends_with(original_name) to reference the original name of the column of values- it may have a long name tracking its aggregation history, so we give it the tidyselect::ends_with() to find the column. More generally, both aggCols and groupers can take any tidyselect syntax or bare names or characters, see here.

obj2poly_lost <- spatial_aggregate(
  dat = simpleThemeAgg,
  to_geo = sdl_units,
  groupers = "scenario",
  aggCols = ends_with("ewr_achieved"),
  funlist = ArithmeticMean
)

Warning: !
EWR gauge to sdl units or planning units detected without `pseudo_spatial`!
ℹ Gauges inform multiple SDLs and PUs; this will be lost.
ℹ EWR outputs should be joined to `sdl_units` (or `planning_units`) pseudo-spatially (by column names), not with a spatial join
ℹ Best to explicitly use `pseudo_spatial = 'sdl_units'` in `multi_aggregate()` or `read_and_agg()`.
ℹ Lower-level processing should include as `joinby = 'nonspatial'` in `spatial_aggregate()`

obj2poly_lost

Because we are using spatial_aggregate directly, the dimensional safety provided by multi_aggregate() is not present, and this example has lost the theme levels- e.g. the different env_objs are no longer there and were all averaged together. The multi_aggregate function automatically handles this preservation, but spatial_aggregate is more general, and does not make any assumptions about the grouping structure of the data. Thus, to keep the env_obj groupings (as we should, otherwise we’re inadvertently theme-aggregating over all of them), we need to add env_obj to the groupers argument.

Tip

We use the prefix = 'sdl_units_' argument (which tracks aggregation steps) to mimic what multi_aggregate does internally. The default is spatial_, but providing the units is more informative.

obj2poly <- spatial_aggregate(
  dat = simpleThemeAgg,
  to_geo = sdl_units,
  groupers = c("scenario", "env_obj"),
  aggCols = ends_with("ewr_achieved"),
  funlist = ArithmeticMean,
  prefix = "sdl_units_"
)

Warning: !
EWR gauge to sdl units or planning units detected without `pseudo_spatial`!
ℹ Gauges inform multiple SDLs and PUs; this will be lost.
ℹ EWR outputs should be joined to `sdl_units` (or `planning_units`) pseudo-spatially (by column names), not with a spatial join
ℹ Best to explicitly use `pseudo_spatial = 'sdl_units'` in `multi_aggregate()` or `read_and_agg()`.
ℹ Lower-level processing should include as `joinby = 'nonspatial'` in `spatial_aggregate()`

obj2poly

The resulting column name is cumbersome, but provide a record of exactly what the aggregation sequence was.

names(obj2poly)

[1] "scenario"                                                                                                        
[2] "env_obj"                                                                                                         
[3] "polyID"                                                                                                          
[4] "sdl_units_ArithmeticMean_env_obj_ArithmeticMean_ewr_code_CompensatingFactor_all_time_ArithmeticMean_ewr_achieved"
[5] "SWSDLID"                                                                                                         
[6] "StateID"                                                                                                         
[7] "SWSDLName"                                                                                                       
[8] "geometry"

We can clean those up into columns with agg_names_to_cols() (which happens internally in multi_aggregate() and read_and_agg() with namehistory = FALSE).

obj2poly_rename <- agg_names_to_cols(obj2poly,
  aggsequence = c(names(preseq), "sdl_units"),
  funsequence = c(funseq, "ArithmeticMean"),
  aggCols = "ewr_achieved"
)
obj2poly_rename

A quick plot shows the outcome. Because we have used the default keepAllPolys = FALSE, only the sdl units with data are here. See Section 2.4 for retaining them. For more plotting details, see the plotting section. We’ll simplify the names and choose a subset of the environmental objectives.

obj2poly_rename |>
  filter(grepl("^EF[1-3]", env_obj)) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    y_lab = "Arithmetic Mean",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "env_obj",
    facet_row = "scenario",
    sceneorder = c("down4", "base", "up4")
  )

Multiple aggregations at a step

If we give funlist more than one aggregation function, it calculates both. Here, we use the mean and median.

medna <- function(x) {
  median(x, na.rm = TRUE)
}


obj2poly_2 <- spatial_aggregate(
  dat = simpleThemeAgg,
  to_geo = sdl_units,
  groupers = c("scenario", "env_obj"),
  aggCols = ends_with("ewr_achieved"),
  funlist = c("ArithmeticMean", "medna"),
  prefix = "sdl_units_"
)

Warning: !
EWR gauge to sdl units or planning units detected without `pseudo_spatial`!
ℹ Gauges inform multiple SDLs and PUs; this will be lost.
ℹ EWR outputs should be joined to `sdl_units` (or `planning_units`) pseudo-spatially (by column names), not with a spatial join
ℹ Best to explicitly use `pseudo_spatial = 'sdl_units'` in `multi_aggregate()` or `read_and_agg()`.
ℹ Lower-level processing should include as `joinby = 'nonspatial'` in `spatial_aggregate()`

# we need a full sequence
funseq2 <- funseq
funseq2$sdl_units <- list("ArithmeticMean", "medna")

obj2poly_2 <- agg_names_to_cols(obj2poly_2,
  aggsequence = c(names(preseq), "sdl_units"),
  funsequence = funseq2,
  aggCols = "ewr_achieved"
)

Attempting to deparse lambda function in funsequence. 
ℹ It is better to use named functions than lambdas in the funsequence, because they are more clearly defined and can be more easily known later.

Warning in funsequence[!fs] <- names(unlist(funsequence[!fs])): number of items
to replace is not a multiple of replacement length

And now we can compare what we get out of the different functions (just for the baseline scenario or we run out of dimensions).

obj2poly_2 |>
  filter(scenario == "base") |>
  filter(grepl("^EF[1-3]", env_obj)) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "env_obj",
    facet_row = "aggfun_4"
  )

Multiple spatial levels

There are a number of polygon layers we might want to aggregate into in addition to SDL units, e.g. resource plan areas, hydrological catchments, or the whole basin. We can aggregate directly into them just as we have above for SDL units. However, we might also want to have several levels of spatial aggregation, which may be nested or nearly so, e.g. from SDL units to the basin, or may be nonnested, e.g. from SDL units to catchments. Typically, this would happen with intervening theme aggregations, as in the interleaved example.

Important

An area column is always created for polygons, making it available for things like area-weighted means.

Nested

A straightforward example of aggregating one set of polygons into a larger set is to move from the sdl units we’ve just used to the larger Murray-Darling Basin. Note the use of tidyselect::ends_with() to get the right column to aggregate and the weighted mean, see syntax for more detail.

simplepolypoly <- spatial_aggregate(
  dat = obj2poly,
  to_geo = basin,
  groupers = c("scenario", "env_obj"),
  aggCols = ends_with("ewr_achieved"),
  funlist = rlang::quo(list(wmna = ~ weighted.mean(., .data$area, na.rm = TRUE))),
  na.rm = TRUE,
  failmissing = FALSE,
  prefix = "basin_"
)
simplepolypoly

simplepolypoly |>
  rename("ewr_achieved" = ends_with("ewr_achieved")) |>
  filter(grepl("^EF[1-3]", env_obj)) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    y_lab = "Arithmetic Mean",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "env_obj",
    facet_row = "scenario",
    sceneorder = c("down4", "base", "up4")
  )

Non-nested

Aggregating from SDL units to catchments (cewo_valleys) requires addressing issues of overlaps among the various polygons. As such, it makes a good test case that catches issues with the intersection of polygons that might not happen with a simpler set of polygons Figure 1 .

Note

The name cewo_valleys reflects this dataset’s original name and provenance from the Commonwealth Environmental Water Office.

Caution

The aggregation process for multiple spatial levels works the same whether or not the smaller levels nest into the larger, but more care should be taken (and more explanation is needed) in the nonnested case.

#|
overlay_cewo_sdl <- ggplot() +
  geom_sf(data = sdl_units, aes(fill = SWSDLName), color = NA, alpha = 0.5) +
  geom_sf(data = cewo_valleys, aes(color = ValleyName), fill = NA) +
  theme(legend.position = "none") +
  ggtitle("both")

valleys <- ggplot() +
  geom_sf(data = cewo_valleys, aes(color = ValleyName), fill = "white") +
  theme(legend.position = "none") +
  ggtitle("catchments")

sdls <- ggplot() +
  geom_sf(data = sdl_units, aes(fill = SWSDLName), alpha = 0.5) +
  theme(legend.position = "none") +
  ggtitle("sdl_units")

valleys + sdls + overlay_cewo_sdl

Figure 1: Catchments (coloured lines; `cewo_valleys`) and SDL units (coloured fills) alone (a & b) and overlain (c), showing these are not nested, but instead are intersecting polygons.

To aggregate from one set of polygons into the other, spatial_aggregate() splits them up and aggregate in a way that respects area and borders. In other words, if a polygon lays across two of the next level up, only the bits that overlap a given polygon in that next level up are included. Under the hood, sf::st_intersection() splits the polygons to make a new set of non-overlapping polygons. Then these pieces are aggregated into the higher level. This aggregation should carefully consider area- things like means should be area-weighted, and things like sums, minima, and maxima should be thought about carefully- if the lower ‘from’ data are already sums, for example, an area weighting might make sense to get a proportion of the sum, but this is highly dependent on the particular sequence of aggregation. For other functions like minima and maxima, area-weighting may or may not be appropriate, and so careful attention should be paid to constructing the aggregation sequence and custom functions may be involved.

To illustrate what happens internally, the intersection of sdl_units and cewo_valleys chops up sdl_units so there are unique polygons for each sdl unit - valley combination.

joinpolys <- st_intersection(sdl_units, cewo_valleys)
joinpolys

To better see the many-to-many chopping we get with this particular pair of intersecting shapefiles, we can isolate an SDL unit (Victorian Murray) and see that it contains bits of 8 catchments. Likewise, the Loddon catchment contains bits of 4 SDL units Figure 2 .

nvorig <- ggplot() +
  geom_sf(data = dplyr::filter(sdl_units, SWSDLName == "Victorian Murray"))

nvpostjoin <- ggplot() +
  geom_sf(
    data = dplyr::filter(
      joinpolys,
      SWSDLName == "Victorian Murray"
    ),
    aes(fill = ValleyName)
  )

avorig <- ggplot() +
  geom_sf(data = dplyr::filter(cewo_valleys, ValleyName == "Loddon"))

avpostjoin <- ggplot() +
  geom_sf(
    data = dplyr::filter(
      joinpolys,
      ValleyName == "Loddon"
    ),
    aes(fill = SWSDLName)
  )

(nvorig + nvpostjoin) / (avorig + avpostjoin)

Figure 2: SDL units intersected with catchment (`cewo_valleys`) and split to allow aggregation

In practice, non-nested aggregation works just like nested; a one-off aggregation from polygons to another set of polygons uses spatial_aggregate() as before. Here, we could use the obj2poly aggregation into SDL units created above as the starting point. Here, we use the pre-defined 'SpatialWeightedMean', which is the same as the anonymous function passed to the basin aggregation above.

nestedpolypoly <- spatial_aggregate(
  dat = obj2poly,
  to_geo = cewo_valleys,
  groupers = c("scenario", "env_obj"),
  aggCols = ends_with("ewr_achieved"),
  funlist = "SpatialWeightedMean",
  na.rm = TRUE,
  failmissing = FALSE,
  prefix = "cewo_valleys_"
)
nestedpolypoly

Those are now in the catchments (cewo_valleys), with values from the intersecting parts of the sdl_units.

Warning

Even catchments with only a small part of an sdl_unit have values here, illustrating one of the dangers of such non-nested aggregations.

nestedpolypoly |>
  rename("ewr_achieved" = ends_with("ewr_achieved")) |>
  filter(grepl("^EF[1-3]", env_obj)) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    y_lab = "Arithmetic Mean",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "env_obj",
    facet_row = "scenario",
    sceneorder = c("down4", "base", "up4")
  )

Polygon retention with `keepAllPolys`

In some cases, we might want to keep or drop polygons that don’t have data to have better-looking maps. By default, polygons without data are dropped, both for memory management and to avoid clutter. But if we set keepAllPolys = TRUE, we can keep them.

obj2poly_keep <- spatial_aggregate(
  dat = simpleThemeAgg,
  to_geo = sdl_units,
  groupers = c("scenario", "env_obj"),
  aggCols = ends_with("ewr_achieved"),
  funlist = ArithmeticMean,
  prefix = "sdl_units_"
)
obj2poly_keep

obj2poly_rename |>
  rename("ewr_achieved" = ends_with("ewr_achieved")) |>
  filter(grepl("^EF[1-3]", env_obj)) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    y_lab = "Arithmetic Mean",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    facet_col = "env_obj",
    facet_row = "scenario",
    sceneorder = c("down4", "base", "up4")
  )

Now we have retained polygons without data. The default is keepAllPolys = FALSE. Another option for plotting though would be to use FALSE as earlier, and include an underlay. For more detail, see the plotting.

obj2poly_rename |>
  rename("ewr_achieved" = ends_with("ewr_achieved")) |>
  filter(grepl("^EF[1-3]", env_obj)) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    y_lab = "Arithmetic Mean",
    plot_type = "map",
    colorgroups = NULL,
    colorset = "ewr_achieved",
    pal_list = list("scico::lapaz"),
    pal_direction = -1,
    underlay_list = list(underlay = "sdl_units", pal_list = "azure"),
    facet_col = "env_obj",
    facet_row = "scenario",
    sceneorder = c("down4", "base", "up4")
  )

Next steps

The examples here are designed to dig into capability of the spatial aggregator in fairly high detail. In typical use, the capabilities here would be studied to develop an aggregation sequence that is best tailored to the study, and then use the automated and safer wrapper functions, as in the complete workflows, with more detail around those wrappers here.This document provides valuable demonstrations of capability and potential for how each spatial step in that sequence might work and could be set up.

Inputs

Demonstration setup

Spatial inputs (polygons)

Examples

Single aggregation

Multiple aggregations at a step

Multiple spatial levels

Nested

Non-nested

Polygon retention with keepAllPolys

Next steps

Polygon retention with `keepAllPolys`