Line plots (quantitative x)

library(HydroBOT)
library(ggplot2)
library(dplyr)
library(sf)

Overview

This notebook provides examples of creating line plots, e.g. plots with one quantitative y-axis for outcome, and a quantitative x-axis. The x-axis is typically a quantitative representation of the scenario, e.g. a temperature increase, or for the demonstration here, the flow multiplier. This can be a very powerful analysis, as we can actually investigate numerical changes between scenarios, identifying potential nonlinearities, predictions, and interpolation.

These plots also have the ability to use colour and different colour palettes to include additional information, including spatial unit and type of response. These settings are dealt with more completely in the bar_plots, with much of the mechanics the same for lines, e.g. the use of colorgroups and a list for pal_list.

For a qualitative x, we would typically use bar plots.

Demonstration setup

As usual, we need paths to the data. We use the ‘more scenarios’ examples for all the plots, with processing as in the website workflow.

project_dir <- file.path("more_scenarios")
hydro_dir <- file.path(project_dir, "hydrographs")
agg_dir <- file.path(project_dir, "aggregator_output")

Read in the data

We read in the example data we will use for all plots.

agged_data <- readRDS(file.path(agg_dir, "achievement_aggregated.rds"))

That has all the steps in the aggregation, but most of the plots here will only use a subset to demonstrate.

To make visualisation easier, the SDL units data is given a grouping column that puts the many env_obj variables in groups defined by their first two letters, e.g. EF for Ecosystem Function. These correspond to the ‘Target’ level, but it can be useful to have the two groupings together for some examples.

If we had used multiple aggregation functions at any step, we should filter down to the one we want here, but we only used one for this example.

For simplicity here, we will only look at a small selection of the scenarios (multiplicative changes of 0.5,1, and 2). Thus, we make two small dataframes for our primary examples here.

scenarios_to_plot <- c("climatedown2adapt0", "climatebaseadapt0", "climateup2adapt0")

scenarios <- yaml::read_yaml(file.path(hydro_dir, "scenario_metadata.yml")) |>
  tibble::as_tibble()

basin_to_plot <- agged_data$mdb |>
  dplyr::filter(scenario %in% scenarios_to_plot) |>
  dplyr::left_join(scenarios, by = "scenario")

# Create a grouping variable
obj_sdl_to_plot <- agged_data$sdl_units |>
  dplyr::filter(scenario %in% scenarios_to_plot) |>
  dplyr::mutate(env_group = stringr::str_extract(env_obj, "^[A-Z]+")) |>
  dplyr::arrange(env_group, env_obj) |>
  dplyr::left_join(scenarios, by = "scenario")

Standard scenario appearance

We will typically have a consistent look for the scenarios across the project, with a logical ordering and standard colours. Such standard colours are not included in the {HydroBOT} package because they are project/analysis- specific, but they could be set at project-level, e.g. in the .Rprofile, if desired.

Here, we use the special arguments refvals and refcols to make a colour palette from a standard {paletter} option (“ggsci::nrc_npg”) while setting a specific level to a specified value. We will use the codes (see scenario definitions) rather than the names to make plots readable.

There is a sceneorder argument to plot_outcomes() that lets us explicitly set the order of the scenarios. However, it is typically easiest to simply make the scenarios a factor, though we use the sceneorder argument here. It operates only on a column named ‘scenario’, though, so if other columns need to be ordered they should be made factors before feeding to plot_outcomes().

sceneorder <- forcats::fct_reorder(
  scenarios$scenario,
  scenarios$flow_multiplier
)

scene_pal <- make_pal(unique(scenarios$climate_code),
  palette = "ggsci::nrc_npg",
  refvals = "E", refcols = "black"
)

scene_pal

<colors>
black #E64B35FF #4DBBD5FF #00A087FF #3C5488FF #F39B7FFF #8491B4FF #91D1C2FF #DC0000FF #7E6148FF

Make line plots

We make two sorts of line plots- either straight lines through all the data points, which shows exactly what the results are, and ‘smoothed’ lines, which are fit to the data in some way to summarise a group of outputs. These can yield smoothed curves, but can also be linear regressions or other fits available from ggplot2::geom_smooth.

Lines through all data

A simple plot would be to look at all the outcomes, separated by colour (though we ignore the added water scenarios). Even this simple plot is quite informative- we can see that the env_obj outcomes are differently sensitive to both decreases and increases in flow, and that this differs across space.

sdl_line <- obj_sdl_to_plot |>
  filter(adapt_code == 1) |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "flow_multiplier",
    colorgroups = NULL,
    colorset = "env_obj",
    pal_list = list("scico::berlin"),
    facet_row = "SWSDLName"
  )

sdl_line

Separate values within groups

We might not care so much about individual outcomes, but about their groupings, and we can plot those in colour for each env_obj, according to the larger grouping it is part of by changing colorset = 'env_group'. We need to use point_group here to separate out the points for each env_obj, since several will belong to each env_group.

This plot also demonstrates the use of some additional arguments. We’re also using transx to log the x-axis, which is particularly appropriate for the multiplicative flow scaling in this demonstration. We also log the y-axis with transoutcome since we’re using a base_list to baseline (with comp_fun = relative to look at the multiplicative shift in each env_obj to baseline). We’re using various *_lab arguments to adjust the labelling.

Tip

When we want to transform the y-axis and the outcome is on y, we need to use transoutcome instead of transoutcome. That is because the consistent processing of the outcome variable needs to be treated differently than y for plots like heatmaps and maps.

Important

The group_cols argument in base_list is needed to specify unique rows for the baselining, and so must be chosen to provide appropriate groupings (unique levels) for that baselining. For example, if we are baselining scenarios, we need to include every column that might vary within scenarios. This might differ from point_group, which only determines the visual grouping with group in ggplot2::aes().

Scientifically, one important thing to note here is that the range on y (0-10) is much greater than the range on x (0.3 - 3), and so (unsurprisingly), some outcomes are disproportionately impacted by flow. Other outcome values are less than the relative shift in flow, and so there are others that are disproportionately insensitive. These disproportionate responses also depend on whether flows decrease or increase- they are not symmetric.

sdl_line_options <- obj_sdl_to_plot |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "flow_multiplier",
    y_lab = "Proportion met",
    x_lab = "Change in flow",
    transx = "log10",
    transoutcome = "log10",
    color_lab = "Environmental\ngroup",
    colorset = "env_group",
    pal_list = list("scico::berlin"),
    point_group = "env_obj",
    facet_row = "SWSDLName",
    sceneorder = sceneorder,
    base_list = list(
      base_lev = "climatebaseadapt0",
      comp_fun = "relative",
      group_cols = c("env_obj", "polyID")
    )
  )

sdl_line_options

We can also give the groups different palettes, as demonstrated more completely in the bar plots and causal networks. Now, we don’t need point_group anymore, since the colours are assigned to the unique env_objs.

# Create a palette list
grouplist <- list(
  EF = "grDevices::Purp",
  NF = "grDevices::Mint",
  NV = "grDevices::Burg",
  OS = "grDevices::Blues",
  WB = "grDevices::Peach"
)

sdl_line_groups <- obj_sdl_to_plot |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "flow_multiplier",
    y_lab = "Proportion met",
    x_lab = "Change in flow",
    transx = "log10",
    transoutcome = "log10",
    color_lab = "Environmental\ngroup",
    colorgroup = "env_group",
    colorset = "env_obj",
    pal_list = grouplist,
    facet_row = "SWSDLName",
    base_list = list(
      base_lev = "climatebaseadapt0",
      comp_fun = "relative",
      group_cols = c("env_obj", "polyID")
    )
  )

sdl_line_groups

Figure 1: Change in proportion of environmental objectives met in each scenario, relative to the historical baseline, dependent on the shift in flow. Groups of environmental objectives plotted from different colour palettes.

That’s fairly complex, so we can facet it, as we did with the bars to make the individual env_objs easier to see.

sdl_line_groups_facet <- obj_sdl_to_plot |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "flow_multiplier",
    y_lab = "Proportion met",
    x_lab = "Change in flow",
    transx = "log10",
    transoutcome = "log10",
    color_lab = "Environmental\ngroup",
    colorgroup = "env_group",
    colorset = "env_obj",
    pal_list = grouplist,
    facet_row = "SWSDLName",
    facet_col = "env_group",
    base_list = list(
      base_lev = "climatebaseadapt0",
      comp_fun = "relative",
      group_cols = c("env_obj", "polyID")
    )
  )

sdl_line_groups_facet

The above is typically how we would go about this facetting, but it is worth reiterating that these are just ggplots, and so we can post-hoc add facetting. Using the version with only spatial facetting (Figure 1), we can add the env_group facet on, matching Figure 2. Note that we re-build all the facets here, due to the specification of ggplot2::facet_grid.

sdl_line_groups + facet_grid(SWSDLName ~ env_group)

As with the bar plots, we can colour by any column we want, and the spatial units is a logical choice. We again use point_group, since multiple env_obj rows are mapped to each colour (spatial unit). The overplotting gets unreadable here and so I’ve retained some facetting, but the best combination of colours, row facets, and column facets will be project dependent and also depend on the number of values being plotted. Another solution is to summarise the data with a smoother- see Section 4.2.

sdl_line_sdl <- obj_sdl_to_plot |>
  # filter(env_group == "EF") |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "flow_multiplier",
    y_lab = "Proportion met",
    x_lab = "Change in flow",
    transx = "log10",
    transoutcome = "log10",
    color_lab = "SDL unit",
    colorset = "SWSDLName",
    pal_list = list("ggsci::default_jama"),
    point_group = "env_obj",
    facet_col = "env_group",
    base_list = list(
      base_lev = "climatebaseadapt0",
      comp_fun = "relative",
      group_cols = c("env_obj", "polyID")
    )
  )

sdl_line_sdl

Figure 3: Change in proportion of environmental objectives met in each scenario, relative to the historical baseline, dependent on the shift in flow. colours indicate SDL unit, each line is an `env_obj`.

Smoothing (fit lines)

We can use smoothing to fit lines through multiple points, e.g. if we want to group data in some way- maybe use it to put a line through the colour groups and ignore individual levels. We demonstrate here using them to illustrate unique outcomes, as well as more typical uses as lines of best fit that aggregate over a number of outcomes.

To get smoothed lines, we use smooth_arglist = list(). By default, (just an empty list()), that produces a loess fit (as with ggplot2::geom_smooth, but we can also pass arguments to geom_smooth() in smooth_arglist, and so allow things like lm and glm fits.

Unique points

Fitting lines through unique points at each scenario level is a bit contrived vs. just plotting lines as above, but it can be useful if we want to accentuate nonlinear relationships. Linear fits are possible too, though these are typically less useful.

With unique points, this just fits a single curved line through each env_obj. Recapitulating the above, we colour here from SDL unit.

sdl_smooth_sdl <- obj_sdl_to_plot |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "flow_multiplier",
    y_lab = "Proportion met",
    x_lab = "Change in flow",
    transx = "log10",
    color_lab = "Catchment",
    colorgroups = NULL,
    colorset = "SWSDLName",
    point_group = "env_obj",
    pal_list = list("ggsci::default_jama"),
    facet_row = "env_group",
    base_list = list(
      base_lev = "climatebaseadapt0",
      comp_fun = "relative",
      group_cols = c("env_obj", "polyID")
    ),
    smooth_arglist = list()
  )

sdl_smooth_sdl

And we can do the same for environmental groupings. This makes much more sense as an analysis if we fit the line through the group, as we do in Section 4.2.2. We use zero_adjust = 'auto' to adjust values off zero for the log transform of y.

sdl_smooth_groups <- obj_sdl_to_plot |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "flow_multiplier",
    y_lab = "Proportion met",
    x_lab = "Change in flow",
    transx = "log10",
    transoutcome = "log10",
    color_lab = "Environmental grouping",
    colorgroups = NULL,
    colorset = "env_group",
    point_group = "env_obj",
    pal_list = list("scico::berlin"),
    facet_row = "env_group",
    facet_col = "SWSDLName",
    base_list = list(
      base_lev = "climatebaseadapt0",
      comp_fun = "relative",
      group_cols = c("env_obj", "polyID")
    ),
    smooth_arglist = list(),
    zero_adjust = "auto"
  )
sdl_smooth_groups

Using method = 'lm' in smooth_arglist yields a linear fit (and any other method accepted by ggplot2::geom_smooth() should work). It does not recapitulate the simple lines above, however, because it fits the line through all the scenario data points, rather than simply joining them together. I have passed se = FALSE to ggplot2::geom_smooth() here because with unique groups the standard errors are enormous. Again, this is a bit contrived unless we want to look for a trend for each outcome at this level. A more typical way of looking at things would be to look at the trend across lots of values within some group, Section 4.2.2.

obj_sdl_to_plot |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "flow_multiplier",
    y_lab = "Proportion met",
    x_lab = "Change in flow",
    transx = "log10",
    color_lab = "Environmental grouping",
    colorset = "env_group",
    point_group = "env_obj",
    pal_list = list("scico::berlin"),
    facet_row = "env_group",
    facet_col = "SWSDLName",
    base_list = list(
      base_lev = "climatebaseadapt0",
      comp_fun = "relative",
      group_cols = c("env_obj", "polyID")
    ),
    smooth_arglist = list(method = "lm", se = FALSE)
  )

Fit multiple points

Fitting lines is most often associated with things like regression and loess smoothing, where we use it to aggregate over a number of datapoints to find the line of best fit. We can do that here, simply by not having all points accounted for across the facetting, point_group, and colorset.

Note

group_cols in base_list should still include unique values, because group_cols determines the baselining (e.g. what gets compared), not the plot groupings.

One example would be to perform the same analysis as in Figure 3, but instead of plotting each point, fit a line to show the mean change within each SDL unit. We’ve pulled env_obj out of point_group, but left it in group_cols, because we still want each env_obj baselined with itself, not to the mean of env_group. Now, we can look at all the env_groups, because there are far fewer lines and so the overplotting isn’t an issue.

We use a small add_eps to avoid zeros and allow all data to be relativised and plotted.

sdl_fit_sdl <- obj_sdl_to_plot |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "flow_multiplier",
    y_lab = "Proportion met",
    x_lab = "Change in flow",
    transx = "log10",
    transoutcome = "log10",
    color_lab = "SDL unit",
    colorset = "SWSDLName",
    pal_list = list("ggsci::default_jama"),
    facet_wrapper = "env_group",
    base_list = list(
      base_lev = "climatebaseadapt0",
      comp_fun = "relative",
      add_eps = "auto",
      group_cols = c("env_obj", "polyID")
    ),
    smooth_arglist = list()
  )

sdl_fit_sdl

Figure 4: Change in proportion of environmental objectives met in each scenario, relative to the historical baseline, dependent on the shift in flow. Fits are loess smoothers. colours indicate SDL unit, which have single lines. Each point is an `env_obj`.

We can make a very similar plot, looking at the environmental groups, a smooth fit of Figure 1 . We use a position argument (which passes to {ggplot2}, and so has the same syntax) to see overplotted points, and an add_eps to avoid zeros to relativise and plot all the data.

sdl_fit_groups <- obj_sdl_to_plot |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "flow_multiplier",
    y_lab = "Proportion met",
    x_lab = "Change in flow",
    transx = "log10",
    transoutcome = "log10",
    color_lab = "Environmental\ngroup",
    colorset = "env_group",
    pal_list = list("scico::berlin"),
    facet_row = "SWSDLName",
    facet_col = ".",
    base_list = list(
      base_lev = "climatebaseadapt0",
      comp_fun = "relative",
      add_eps = "auto",
      group_cols = c("env_obj", "polyID")
    ),
    smooth_arglist = list(),
    position = position_jitter(width = 0.01, height = 0)
  )

sdl_fit_groups

Figure 5: Change in proportion of environmental objectives met in each scenario, relative to the historical baseline, dependent on the shift in flow. Fits are loess smoothers. colours indicate Environmental groups, which have single lines. Each point is an `env_obj`.

As we saw above, we can use method = 'lm' to plot a regression, though in general we do not expect these relationships to be linear, and mathematically characterising them will be a complex task that is not the purview of plotting.

A linear fit of the SDL units ( Figure 6 ) is one example of how this might work. It is useful to know here that deviations from a 1:1 line on logged axes as here means that the outcomes are responding disproportionately more (steeper) or less (shallower) than the underlying changes to flow.

sdl_lm_sdl <- obj_sdl_to_plot |>
  plot_outcomes(
    outcome_col = "ewr_achieved",
    x_col = "flow_multiplier",
    y_lab = "Proportion met",
    x_lab = "Change in flow",
    transx = "log10",
    transoutcome = "log10",
    color_lab = "SDL unit",
    colorset = "SWSDLName",
    pal_list = list("ggsci::default_jama"),
    facet_wrapper = "env_group",
    base_list = list(
      base_lev = "climatebaseadapt0",
      comp_fun = "relative",
      add_eps = "auto",
      group_cols = c("env_obj", "polyID")
    ),
    smooth_arglist = list(method = "lm")
  )

sdl_lm_sdl

Figure 6: Change in proportion of environmental objectives met in each scenario, relative to the historical baseline, dependent on the shift in flow. Fits are linear regressions. colours indicate SDL unit, which have single lines. Each point is an `env_obj`.