Comparer overview

Author

Galen Holt

The Comparer really has two components- underlying functions and structure to perform comparisons and other analyses, and plotting (and other presentation types in future) capabilities to produce some standardized plots that capture important data visualisation.

There is quite a lot of flexibility built into all of the comparer, because different uses and different questions will require different outputs to assess, whether that means different scales of analysis, different types of plots, or different numerical comparisons. It is also likely that the desired outputs will change as we work through options.

While this is called the ‘Comparer’, it also contains other functionality related to analysis generally, and can produce plots that do not include comparisons, e.g. hydrograps to simply illustrate historical flows.

Nearly all plots of outcomes (e.g not hydrographs) are made with plot_outcomes, including bars, lines, and maps. This is because at their foundation, they area all plotting a quantitative outcome with grouping of some sort. The data preparation is the same across all of them, as well as many of the arguments to ggplot. We may, though, eventually separate out the data preparation from the plots to make this function cleaner.

Nearly all plots (with the current exception of the causal networks) are made in ggplot2 and return ggplot2 objects, which can easily be further modified. The plot functions here just wrap the ggplot2 to standardise appearance and data preparation. Though it can be annoying to not use ggplot() directly to make the plots, one MAJOR advantage of the plotting function here is that any data changes that clean it for a given plot aren’t preserved, and so it’s far easier to keep the data clean, know what the data is, and avoid accidental overwriting or mislabeling of data. If we consistently do the same changes with slight modification, we can write a function, e.g. plot_prep , avoiding lots of copy-paste and its attendant errors.

There is clear opportunity for reactivity with nearly all plots, allowing a user to select plot types, any filtering (espcially for networks, spatial units, etc), and produces the plot.

Comparisons

Nearly all analyses will be comparisons of some kind, and so the Comparer provides the capacity to produce plots comparing scenarios, time, space, and themes. It can also calculate comparison values and plot these. There are a few primary sorts of plots,

Input descriptions (e.g. hydrographs)
Scaling descriptions (e.g. tiles showing how the aggregation proceeded)
Quantitative outcome plots (values on y - 1d)
- Bar
- Line
- Timeseries
  - currently only hydrographs, ongoing development for other variables
Quantitative outcome plots (values in color - 2d)

Most but not all of these plots are currently implemented. Those that aren’t (timeseries, better heatmaps) are high priority, but waiting for data formats and inputs to shape up a bit more. They will use the same underlying plot scaffolding and data preparation functions, and so be quick to implement.

Standardization

The plot_* functions use a set of theming and color controls to maintain a standard look and calculation structure. These are described below. It is perhaps most important to know that these functions are also exposed to the user, and so these themes and standard approaches to appearance and baselining can be used for one-off ad-hoc plots. While the use of the plot_* functions automates much of this and keeps the environment clean and reduces errors in data management and plotting, we will sometimes just want to throw together a quick ggplot call, and these functions will be useful for that as well. An intermediate approach is to call plot_prep, which automates much of the colouring and baseline_compareing, and then make ad-hoc plots with the resulting dataframe.

Theme

I have developed a theme_werp_toolkit() ggplot theme that we use to get a consistent look. We can build on this and change it as we go, as it is fairly simple at present. Additional theme arguments can be passed to it, if we want to change any of the other arguments in ggplot2::theme() on a per-plot basis. By default, theme_werp_toolkit is applied when making the plots inside plot_outcomes and plot_hydrographs, though it can be applied to any ggplot object.

Colour

At present, I do not enforce a standard set of colors- they’ll change between scenarios/projects and there are too many possibilities of what we might plot. I do provide default palettes for the plotting functions, but we will likely want to change them depending on what we plot. We could enforce palettes within projects, however, once the plots firm up. In general, colors can either be specified manually (usually with the help of make_pal) or with {paletteer} palettes because of the wide range of options with standard interface and ability to choose based on names. A good reference for the available palettes is here, and demonstrations of both ways of specifying colors are throughout the examples, but specifically bar plots.

Though we do not enforce standard colors, we have established the infrastructure to set consistent colors within a project by using named color objects. This is particularly useful for scenarios, but also can be used for other qualitative categories. Quantitative outcomes (e.g. matching different palettes to outcome x vs y) is not handled automatically at present, but is left to the user. I expect that as a project proceeds and settles on desired outcomes, we will standardize color palettes for different outcome variables, scenarios, etc.

There is some interesting ability to set colors within a single column based on different palettes, which can be a useful way to indicate grouping variables. This is available everywhere, but is best demonstrated in the bar plots and causal plots.

Internal calculations and structure

While the plots are the typical outputs of the Comparer, it has a set of useful functions for preparing data, including calculating values relative to a baseline (baseline_compare) using either default functions difference and relative, or with any user-supplied function.

There is an internal function plot_prep that does all the data prep, including applying baseline_compare, finding colors, and setting established scenario orders. This keeps plots and the data processing consistent, and dramatically reduces the error-prone copy-pasting of data processing with minor changes for different plots. Instead, we can almost always feed the plotting functions the same set of clean data straight out of the aggregator, and just change the arguments to the plot functions.

Baselining is available as a standalone function (baseline_compare) and can be done automatically in the plot_* functions. This capacity is demonstrated in all the plot examples, but in most detail in the hydrographs notebook.

One critical issue, particularly with complex data, is being unaware of overplotting values. The plot_* functions have internal checks that the number of rows of data matches the number of axes on which the data is plotted (including facets, colors, linetype, etc). This prevents things like plotting a map of env_obj data facetted only by scenario, and so each fill represents outcomes for all env_obj, which is meaningless but very easy to do. The exception is that points are allowed to overplot, though we can use the position = 'position_jitter' to avoid even that.