library(werptoolkitr)
Theme aggregation
Overview
We need to aggregate outcomes along the theme dimension. For example, we might want to combine EWR pass/fails into the proportion of EWRs contributing to a proximate objective (‘environmental objective’ env_obj
) that passed, and then translate that into outcomes for 5-year targets or Waterbirds, etc.
The input data is thus the data coming out of the theme modules (e.g. EWR tool), which is then aggregated. The relationships that define the aggregations are the same as those defining the causal networks- these map finer-scale groups to coarser. Thus, we need to access those relationships (and the make_edges
function that builds the links). The demonstrations here are all about the EWR outputs, but the aggregator is agnostic to the input data, provided we have the causal relationships to define the aggregation groupings- we specify the columns to aggregate and any additional grouping variables, which can be anything.
The causal relationships for the EWR tool are provided in the {werptoolkitr} package as causal_ewr
, a list of all the mappings. The {werptoolkitr} package also provides all necessary aggregation functions and handling functions for the causal relationships, though like the polygons in the spatial aggregation it is possible to use relationships other than those provided by {werptoolkitr}. In particular, while the EWR tool provides these relationships externally to the module, they may be embedded in other modules, particularly if the responses at different theme levels are modelled mechanistically.
In practice, we expect to interleave spatial, temporal, and thematic aggregation steps- perhaps it makes sense to aggregate along the theme axis to the env_obj
scale at a gauge, then scale to the SDL unit, then aggregate to the Objective
, scale, and then scale to the basin and long-term targets. We demonstrate such interleaved aggregation elsewhere, and here focus on demonstrating and understanding the meaning of aggregation along the theme axis and how to do it. A similar notebook for spatial aggregation goes into the detail along the spatial dimension.
Inputs
The theme relationships in the causal network causal_ewr
provide the links and structure of the theme scaling, while the values to be scaled come out of the modules. For this demonstration, we provide a set of paths to point to the input data, in this case the outputs from the EWR tool, created by a controller notebook. Spatial units could be any arbitrary polygons, but we use those provided by {werptoolkitr}.
<- file.path('more_scenarios')
project_dir <- file.path(project_dir, 'module_output', 'EWR') ewr_results
Input data to multi_aggregate
should be a dataframe (e.g. a dataframe of EWR outputs, sf
of spatial outcomes). If we want to pass a path instead of a dataframe (as we might for large runs), we would use read_and_agg
, which wraps multi_aggregate
, demonstrated in the interleaved notebook. Thus, for the demonstration, we pull in the data produced by the demonstration controller and contained in-repo at the ewr_results
path using get_ewr_output
below.
Setting aggregation parameters
For each step in the aggregation, we need to specify what levels we are aggregating from and to, the function to use to aggregate, and the mapping between the ‘from’ and ‘to’ levels.
The aggsequence
list in multi_aggregate
(and theme_aggregate
) needs to be nested, e.g. links must be defined in causal_ewrs
(or other causal mappings) from the ‘from’ and ‘to’ levels at each step. In other words, we can’t go backwards, and we can’t scale between levels with no defined relationship. However, this does not mean we have to always include every level. If a relationship exists, levels can be jumped (e.g. we could go straight from env_obj
to target_20_year_2039
), and indeed there may not be a defined ordering of some levels, and so it is perfectly reasonable to go from env_obj
to both Objective
and Target
. For EWR outputs, the aggsequence
list will typically need to start with ewr_code_timing
and aggregate from there into ewr_code
and env_obj
as everything else flows from that.
The funsequence
is a list instead of a simple vector because multiple functions can be used at each step. When multiple functions are passed, they are factorial (each function is calculated on the results of all previous aggregations). This keeps the history clean, and allows us to easily unpick the meaning of each value in the output.
These aggregation sequences and their associated functions are arguments to multi_aggregate
. In practice, they will typically be set to a default value in a parameter file. There is a theme_aggregate
function that performs a single-level of theme aggregation, much like spatial_aggregate
does for space. I do not focus on that here, because single theme aggregations tend to be less complex than single spatial aggregations, and the capability to pass different sorts of arguments is discussed in space and the syntax notebook. Instead, here we focus on the sequential aggregation provided by multi_aggregate
to understand theme aggregation, which is most interesting as a multi-step process along the causal network.
Data overview
The EWR results come in three main flavours- summary
over the span of the run, annual
, and all
, a continuous set. At the time this was written, the all
didn’t exist and annual
was broken, so we focus here on the summary
data, with updating to handle the others very high priority. We read them in with get_ewr_output
also provides some cleanup. The wrapper function read_and_agg
reads them in internally and then calls multi_aggregate
to avoid having all objects for each scenario in memory. That’s what we’d do in production, typically, but here the goal is to see how the theme aggregation works. The get_ewr_output
function has additional arguments to filter by gauge and scenario on read-in, allowing parallelisation without overloading memory.
We typically would use prep_ewr_agg
(and this is wrapped by multi_aggregate
) but that not only calls get_ewr_output
, it then makes the data geographic. The geography doesn’t need to happen here since we’re just doing theme.
Demonstration
Data
We’ll pull in the summary data to use for demonstration so we can use multi_aggregate
directly. If we want to feed a path instead of a dataframe, we would need to use read_and_agg
.
<- get_ewr_output(ewr_results, type = 'summary')
sumdat # Would make it geographic:
# sumdat <- prep_ewr_agg(ewr_results, type = 'summary', geopath = gpath)
sumdat
There’s an issue with outputType = 'annual'
in the version of the EWR tool this was built with. Until I update and test the new EWR tool, skip the annual data.
<- get_ewr_output(ewr_results, type = 'annual')
anndat anndat
Note: I originally wrote get_ewr_output
to automatically get both the annual and summary ewr output files, but I think given a typical workflow it will make more sense to just wrap it if we want to more than one type of output. Even if we want to use multiple output types (summary, annual, all, etc), they won’t talk to each other for any of the processing steps, so might as well run in parallel.
We’ll choose an example gauge to make it easier to visualise the data.
# Dubbo is '421001', has 24 EWRs
# Warren Weir is '421004', has 30 EWRs.
<- '421001' example_gauge
Aggregation
As a demonstration, I’ve set a range of theme levels that hit all the theme relationship dataframes from the causal networks defined in causal_ewr
, and set the aggregation functions fairly simply, but with two multi-aggregation steps to illustrate how that works. For more complexity in these aggregation functions, see the spatial notebook and aggregation syntax.
I’m using CompensatingFactor
as the aggregation function for the ewr_code_timing
to ewr_code
step here, assuming that passing either timing sub-code means the main code passes. A similar approach could be done if we want to lump the ewr_code
s themselves, e.g. put EF4a,b,c,d into EF4. I use both ArithmeticMean
and LimitingFactor
for the 2nd and 3rd levels to demonstrate multiple aggregations and how the outputs from those steps get carried through subsequent steps.
<- list(c('ewr_code_timing', 'ewr_code'),
aggseq c('ewr_code', "env_obj"),
c('env_obj', "Specific_goal"),
c('Specific_goal', 'Objective'),
c('Objective', 'target_5_year_2024'))
<- list(c('CompensatingFactor'),
funseq c('ArithmeticMean', 'LimitingFactor'),
c('ArithmeticMean', 'LimitingFactor'),
c('ArithmeticMean'),
c('ArithmeticMean'))
The groupers
and aggCols
arguments can take a number of different formats- character vectors, bare column names and sometimes tidyselect
, though this is more true in theme_aggregate
and limited for multi_aggregate
as discussed in the syntax documentation. This capability gives the user quite a few options for specifying the columns to use. The use of selectcreator
makes it robust to nonexistent columns with failmissing = FALSE
.
To create the aggregation, we provide the sequence lists created above, along with the causal links, defined by the causal_edges
argument. Because the make_edges
function also takes a sequence of node types, we can usually just call make_edges
on the list of relationships and the desired set of theme levels. We can also just pass in causal_edges = causal_ewr
(the list with all possible links), and theme_aggregate
will auto-generate the edges it needs. That’s just a bit less efficient.
<- multi_aggregate(dat = sumdat,
simpleThemeAgg causal_edges = make_edges(causal_ewr, aggseq),
groupers = c('scenario', 'gauge'),
aggCols = 'ewr_achieved',
aggsequence = aggseq,
funsequence = funseq)
simpleThemeAgg
That output has 4 columns of output values because aggregation steps are factorial in the number of aggregation functions applied. The second step found the ArithmeticMean
and LimitingFactor
for ewr_achieved
into env_obj
and then the third step found the ArithmeticMean
and LimitingFactor
for each of those outcomes into Specific_goal
. Each subsequent step only found the ArithmeticMean
for each, and so the number of output columns stopped growing.
Tracking aggregation steps
Tracking aggregation steps is critical for knowing the meaning of the numbers produced. We can do that in two different ways- in column headers (names) or in columns themselves.
Tracking history in column names is unweildy, but describes exactly what the numbers are and is smaller in memory. For example, the last column is
names(simpleThemeAgg)[ncol(simpleThemeAgg)]
[1] "target_5_year_2024_ArithmeticMean_Objective_ArithmeticMean_Specific_goal_LimitingFactor_env_obj_LimitingFactor_ewr_code_CompensatingFactor_ewr_achieved"
This says the values in this column are the 5-year targets
, calculated as the arithmetic mean of Objectives
, which were the arithmetic mean of Specific goals
, which were calculated from env_obj
as limiting factors, which were obtained from the ewr_code
as limiting factors and those were calculated from the ewr_code_timing
as compensating factors.
It may be easier to think about the meaning of the names from the other direction- ewr_achieved
were aggregated from ewr_code_timing
into ewr_code
as Compensating Factors, then into env_obj
as limiting factors- for the env_obj
to pass, all ewr_code
s contributing to it must pass. Then the env_obj
s were aggregated into Specific_goal
, again as limiting factors, so to meet a goal, all contributing env_obj
must pass. Those Specific_goal
s were then aggregated into Objectives
with the arithmetic mean, so the value for an Objective
is then the average of the contributing Specific_goal
s. Since in this example the Specific_goals
will be either 1 or 0, this average gives the proportion of Specific_goal
s that are met for each Objective
. Similarly, the 5-year targets were obtained by averaging the Objective
s contributing to them.
A different way to track the aggregations is possible by including them in columns instead of the names. This takes more memory, but can be clearer and makes subsequent uses easier in many cases. For memory purposes, I currently parse the names into columns post-hoc. When we develop the ability to do different aggregations on different groups within a node type, we will need to make the columns as we go (and we will have to use the column method to track, since different things will happen to different values in a column). This is very high priority.
In the example above, we can feed the output data to agg_names_to_cols
to put the history in columns instead of names.
agg_names_to_cols(simpleThemeAgg, aggsequence = aggseq, funsequence = funseq, aggCols = 'ewr_achieved')
In practice, what makes most sense is to use a switch (namehistory = FALSE
) inside multi_aggregate
to return this format.
<- multi_aggregate(dat = sumdat,
simpleColHistory causal_edges = make_edges(causal_ewr, aggseq),
groupers = c('scenario', 'gauge'),
aggCols = 'ewr_achieved',
aggsequence = aggseq,
funsequence = funseq,
namehistory = FALSE)
simpleColHistory
Additional demonstrations
The above is a simple example of how to use the aggregation function along the theme dimension but there are more things we can do with it.
Arbitrary input data
There are three main types of input EWR data (summary
, annual
, and all
), but we expect there will be any number of input datasets once other modules exist. Provided the causal relationships are defined for the input sets, the specific input datasets and columns of data to aggregate are general. For example, we can feed it the annual data (type = 'annual'
) and aggregate two different columns (feed aggCols
a vector of column names). In the case of the annual data, we might also want to group by year in addition to gauge and scenario, and so we add year to the groupers
vector. We’ll keep the same sequences of aggregation and functions, noting that they now are calculated for both aggCols
.
TODO turn back on once update to new EWR without annual bug.
<- multi_aggregate(dat = anndat,
annualColHistory causal_edges = make_edges(causal_ewr,
aggseq),groupers = c('scenario', 'gauge', 'year'),
aggCols = c('num_events', 'event_length'),
aggsequence = aggseq,
funsequence = funseq,
namehistory = FALSE)
annualColHistory
Returning every stage
By default, we return only the final aggregation after stepping up the full sequence set by aggsequence
. But in some cases, we might want to return all of the intermediate aggregations (e.g. at each step of aggsequence
). This full aggregation sequence can be useful for testing and checking, but probably more importantly, allows more complete understanding of the results. For visualization, this allows each step to be fed as colour or other attributes to the causal network. To save all steps in the aggregation sequence, we pass saveintermediate = TRUE
to multi_aggregate
, and it returns a list of tibbles named by the aggregation level instead of a single final tibble. We cannot just attach the stage results as columns to a flat dataframe because the aggregation is many-to-many, and so the rows do not match and are not strictly nested. Thus, each stage needs its own dataframe to avoid duplicating or deleting data. Moreover, using the additional flexibility of a list of dataframes is necessary for interleaved aggregations across theme, space, and time axes.
<- multi_aggregate(dat = sumdat,
allsteps causal_edges = make_edges(causal_ewr, aggseq),
groupers = c('scenario', 'gauge'),
aggCols = 'ewr_achieved',
aggsequence = aggseq,
funsequence = funseq,
saveintermediate = TRUE,
namehistory = FALSE)
names(allsteps)
[1] "ewr_code_timing" "ewr_code" "env_obj"
[4] "Specific_goal" "Objective" "target_5_year_2024"
Causal plot
By returning values at each stage, we can map those to colour (and later size) in a causal network. In practice, this will happen in the Comparer (and the initial setup data arrangement will be made into a function there), but we can demonstrate it quickly here. Here, we map the values of the aggregation to node color. To do this, I’ll follow the usual causal_plots approach of making edges and nodes, and then use a join to attach the value to each node.
To keep this demonstration from becoming too unwieldy, we limit the edge creation to a single gauge, and so will filter the theme aggregations accordingly (or just rely on the join to drop). The ewr_node_timing
outcomes are likely just confusing to include here, so we cut it off.
The first step is to generate the edges and nodes for the network we want to look at.
<- make_edges(causal_ewr,
edges fromtos = aggseq[2:length(aggseq)],
gaugefilter = example_gauge)
<- make_nodes(edges) nodes
Now, extract the values we want from the aggregation and join them to the nodes.
TODO this will all be done in a data prep function in the comparer that processes multi-step aggregation lists. This is high priority, but needs thought for how to handle interleaved aggregation along different axes.
# need to grab the right set of aggregations if there are multiple at some stages
<- c('CompensatingFactor',
whichaggs 'ArithmeticMean',
'ArithmeticMean',
'ArithmeticMean',
'ArithmeticMean')
# What is the column that defines the value?
<- 'ewr_achieved'
valcol
# Get the values for each node
<- names(allsteps)
targetlevels 1] <- 'ewr_code_timing'
targetlevels[<- extract_vals_causal(allsteps, whichaggs, valcol,
aggvals targetlevels = targetlevels)
# Cut off the ewr_code_timing- should really have a drop_step argument to just not return it? In the extract_vals_causal
<- aggvals %>% dplyr::filter(NodeType != 'ewr_code_timing')
aggvals # cut to relevant gauge, then remove- causes problems since node levels above env_obj aren't gauge-referenced
<- aggvals %>% dplyr::filter(gauge == example_gauge) %>%
aggvals ::select(-gauge)
dplyr
# join to the nodes
<- dplyr::left_join(nodes, aggvals) nodes_with_vals
Make the causal network plot with the nodes we chose and colour by the values we’ve just attached to them from the aggregation. At present, it is easiest to make separate plots per scenario or other grouping ( Figure 1 , Figure 2 ). For example, in the increased watering scenario, we see more light colours, and so better performance across the range of outcomes. Further network outputs are provided in the Comparer.
<- make_causal_plot(nodes = dplyr::filter(nodes_with_vals,
aggNetwork == 'base'),
scenario edges = edges,
edge_pal = 'black',
node_pal = list(value = 'scico::tokyo'),
node_colorset = 'ewr_achieved',
render = FALSE)
::render_graph(aggNetwork) DiagrammeR
<- make_causal_plot(nodes = dplyr::filter(nodes_with_vals,
aggNetwork == 'up2'),
scenario edges = edges,
edge_pal = 'black',
node_pal = list(value = 'scico::tokyo'),
node_colorset = 'ewr_achieved',
render = FALSE)
::render_graph(aggNetwork) DiagrammeR
User-set functions
We have established a simple set of default aggregation functions (ArithmeticMean
, GeometricMean
, LimitingFactor
, and CompensatingFactor
), available in default_agg_functions.R
. I expect that list to grow, but it is also possible to supply user-defined functions to include in funsequence
. Previously, we have used this sort of approach for things like threshold functions. For example, we might want to know the mean event length, but only for events longer than 2 days for the ewr to objective aggregation, and thereafter scale up with ArithmeticMean
. We can do this by specifying a new function and including it in the list given to funsequence
.
<- function(x) {
event2 mean(ifelse(x > 2, x, NA), na.rm = TRUE)
}
<- list(c('CompensatingFactor'),
newfuns c('event2'),
c('event2'),
c('ArithmeticMean'),
c('ArithmeticMean'))
This is currently turned off until we update to the new EWR without a bug in the annual results.
<- multi_aggregate(dat = anndat,
annualEv2 causal_edges = make_edges(causal_ewr, aggseq),
groupers = c('scenario', 'gauge', 'year'),
aggCols = 'event_length',
aggsequence = aggseq,
funsequence = newfuns,
namehistory = FALSE)
# lots of NaN because many years and locations didn't have events > 2, so for ease of viewing, filter
%>% dplyr::filter(!is.nan(event_length)) annualEv2
Gauge and scenario -filtering
Reading in all of the EWR results across all gauges and scenarios could be massive, depending on the spatial scale and the number of scenarios, and so we might want to parallelise over gauges or scenarios. We also might only be interested in some subset for things like plotting. To address this, get_ewr_output
has gaugefilter
and scenariofilter
arguments. This will is particularly useful once we have lots of data that doesn’t fit in memory or want to parallel process - if we have all the data in memory already, we can just pipe it in through a filter (or filter the first argument), but if we read in in parallel from a path, we can greatly speed up processing.
To only read-in the relevant data, we use the read_and_agg
wrapper. The gaugefilter
argument only works (currently) if there are separate files for each gauge. Once we settle on a data format, I will re-write the gaugefilter differently to only read the desired gauge from the file, though that won’t be possible with interleaved spatial aggregation.
<- read_and_agg(datpath = ewr_results, type = 'summary',
smallreadagg geopath = bom_basin_gauges,
causalpath = causal_ewr,
groupers = c('scenario', 'gauge'),
aggCols = 'ewr_achieved',
aggsequence = aggseq,
funsequence = funseq,
namehistory = FALSE,
gaugefilter = NULL,
scenariofilter = 'base')
table(smallreadagg$gauge, smallreadagg$scenario)
base
412002 304
412004 304
412005 304
412011 304
412012 304
412016 208
412033 304
412038 304
412039 304
412046 208
412122 76
412124 76
412163 196
412188 200
412189 280
419001 280
419006 280
419007 280
419012 280
419015 280
419016 280
419020 280
419021 280
419022 280
419026 284
419027 280
419028 280
419032 280
419039 280
419045 280
419049 280
419091 284
420020 4
421001 192
421004 288
421011 280
421012 280
421019 208
421022 276
421023 276
421090 276
421146 4
422001 292
422028 292
For a one-off that fits in memory, this is slower than filtering the data after it’s in-memory, since the read-in happens first. The advantage comes when we don’t want (or can’t fit) all of the original data in memory, such as parallelisation over scenarios.
The read_and_agg
function is also helpful if we just want to use paths as arguments instead of reading the data in and then calling multi_aggregate
. In that case, we might not use any *filter
arguments, in which case it works just like multi_aggregate
, but takes paths instead of objects as arguments. For example, saving all intermediate and no filtering can be done with the path to data ewr_results
.
<- read_and_agg(datpath = ewr_results, type = 'summary',
readallsteps geopath = bom_basin_gauges,
causalpath = causal_ewr,
groupers = c('scenario', 'gauge'),
aggCols = 'ewr_achieved',
aggsequence = aggseq,
funsequence = funseq,
saveintermediate = TRUE,
namehistory = FALSE)
names(readallsteps)
[1] "ewr_code_timing" "ewr_code" "env_obj"
[4] "Specific_goal" "Objective" "target_5_year_2024"
Parallelisation
The gauge and scenario filtering gives an easy way to parallelise. I haven’t written this into a function yet until we settle on how to use Azure batching with the toolkit. It will likely involve a wrapper around read_and_agg
, but could be incorporated as parameters fed to read_and_agg
itself. We demo how it works here. Parallelisation will not only speed up the processing, but because we can do the data reads inside the function, parallelisation over scenarios (and gauges, if not spatially-aggregating) avoids reading all the data in at once and so reduces unnecessary memory use.
For this demonstration, we are getting the gauge and scenario lists from previously-read data, but in typical use they would be available from scenario metadata.
::: {#future-export .border: .2px .solid .gray; .color: .gray} Note: future
is supposed to handle the .export
from the calling environment, and seems to do just fine with everything except the aggregation functions. That can happen with nested foreach
inside functions, but I think here it might be happening because of the way we’re using {{}}
to pass an arbitrary set of functions. Easy enough to fix, by passing something to .export
, but annoying. If we end up not using {future} for parallelisation on Azure, this will be moot. :::
The example below performs the same processing as above to produce output identical to simpleThemeAgg
, but done in parallel. This is slower for this small simple demonstration because of overhead, but has the potential to be much faster for larger jobs.
We’re not parallelizing over gauges here because we’re unlikely to be able to do so with interleaved aggregation steps, but a nested loop would work if we are only aggregating in time or theme dimensions.
library(foreach)
library(doFuture)
registerDoFuture()
plan(multisession)
<- 'all' # unique(simpleThemeAgg$gauge)
allgauges <- unique(simpleThemeAgg$scenario)
allscenes
<- foreach(s = allscenes,
parThemeAgg .combine = dplyr::bind_rows) %dopar% {
# If parallel over gauges
# foreach(g = allgauges,
# .combine = dplyr::bind_rows) %dopar% {
read_and_agg(datpath = ewr_results, type = 'summary',
geopath = bom_basin_gauges,
causalpath = causal_ewr,
groupers = c('scenario', 'gauge'),
aggCols = 'ewr_achieved',
aggsequence = aggseq,
funsequence = funseq,
namehistory = TRUE,
gaugefilter = NULL,
scenariofilter = s)
}
parThemeAgg
If we’re doing something here that is too big to return the full output (likely in practice), it would also be straightforward for the parallel loop to save the iterations and not return anything. Then we could read the output in in pieces into the comparer.