Important outcome of 2019 review

WGSAM notes that streamlining processes for updating data would be helpful to facilitate future key-runs. For all meetings after 2019, WGSAM will require draft key-run results and documentation 1 month to 2 weeks prior to the meeting so that a more thorough review can be completed prior to the meeting, reserving meeting time for model comparisons, ensemble modelling, and coming to agreement on recommendations. This means that data updates must be requested and provided in a timely manner. If the full results and documentation are not available prior to the meeting, no key-run review will be conducted by WGSAM.

Background

This document provides criteria for consistent review of models by the multispecies assessment working group of the International Council for the Exploration of the Sea (ICES WGSAM). For nearly a decade, WGSAM has reviewed model “key-runs” as part of its Terms of Reference. Recently, WGSAM reviewed key-runs for the North Sea SMS model in 2014 and 2017, the North Sea EwE model in 2015, and the Baltic EwE model in 2016. key-run reviews are scheduled for Baltic Sea Gadget and SMS models and the Irish Sea EwE model in 2019.

WGSAM Term of Reference b for 2019-2021 reads:

Update of key-runs (standardized model runs updated with recent data, producing agreed output and agreed upon by WGSAM participants) of multispecies and ecosystem models for different ICES regions. The key-runs provide information on natural mortality for inclusion in various single species assessments. Deliverables: Report on output of multispecies models including stock biomass and numbers and natural mortalities for use by single species assessment groups and external users.

Because WGSAM is increasingly asked to provide model framework reviews as well as key-run reviews, we have drafted this document to provide consistent guidelines and review critiera for both reviewers and groups submitting models for review. Guidelines are based on experience from past reviews (see WGSAM reports from 2013-2018 as well as, e.g., https://www.st.nmfs.noaa.gov/science-quality-assurance/cie-peer-reviews/peer-review-reports) as well as best practices outlined in the literature [1,2].

Model Life Cycle and Objectives for Evaluation

The U.S. National Research Council has summarized the general objectives for model evaluation and tailored them to different stages of the model life cycle with reference to models used in enviromental regulation processes[1]. The application of multispecies and ecosystem models within fishery management processes is similar enough that this summary provides a useful framework for our criteria.

The general objectives of model review are threefold:[1], p 108

Is the model based on generally accepted science and computational methods?
Does it work, that is, does it fulfill its designated task or serve its intended purpose?
Does its behavior approximate that observed in the system being modeled?

The model life cycle further specifies review priorities.

Model Life Cycle, NRC 2007, redrawn

Model Life Cycle, NRC 2007, redrawn

WGSAM receives most requests for model review after the problem identification and conceptual model stage. However, it is important to provide documentation of these processes to reviewers so that the completed model can be evaluated.

In addition, models involved in a management process may face the tradeoff between complexity and transparency, where the need to account for many interactions and processes may render the model harder to explain, and perhaps accept, by decision-makers [1]. Because the audience for WGSAM key-runs tends to be other scientsts, evaluating the extent to which models are transparent to a scientific, stock assessment oriented audience is appropriate here.

We consider WGSAM reviews to be “peer review.”

Peer review attempts to ensure that the model is technically adequate, competently performed, properly documented, and satisfies established quality requirements through the review of assumptions, calculations, extrapolations, alternate interpretations, methodology, acceptance criteria, and/or conclusions pertaining from a model or its application.[1]

Key-run Reviews

As described above, model key-runs are currently used to provide inputs to other assessment models; specifically, natural mortality (\(M\)) time series. This places key-runs clearly within the “Model Use” phase of the life cycle. This means that reviews should evaluate (from the figure above):

  1. Appropriateness of the model for the problem (problem identification)
  2. Assumptions (scientific basis, computational infrastructure; adequacy of conceptual model)
  3. Input data quality
  4. Comparison with observations
  5. Uncertainty/sensitivity analysis
  6. Peer review (WGSAM’s role, but consider previous reviews from model construction)

Reviewers will rely on submitted documentation to address these issues. At each point, if documentation is inadequate to address the problem, that will be noted. Review criteria for each point are outlined below, and presentations should include this information.

Is the model appropriate for the problem?

Define the problem, and why this model is (or reviewers to explain why it is not) appropriate.

Define the focal species, spatial, and temporal resolution needed to address the problem.

Current uses:

For example, we are asked to provide M-at-age time series for North Sea and Baltic herring, cod, whiting, haddock, sprat, sandeel, and possibly other species. Spatial scale is at the stock level and temporal resolution is annual, starting at a stock-specific year and going to 2018.

Therefore, the multispecies model(s) must provide this output and sensitivity in this particular output is most important. However, there are other potential uses for these models that have yet to be defined.

A new use for WKIRISH:

The aim with the Irish Sea Ecopath is to use the model to “fine tune” the quota advice within the predefined EU Fmsy ranges. In “good” conditions you could fish at the top of the range, in “poor” conditions you should fish lower in the range. The range has already been evaluated as giving good yield while still being precautionary, so this should be fine for ICES to use in advice, so any reviewers should have this in mind.

For the Irish Sea EwE model, key outputs will be used to determine where the reference point should be within the MSY range for each species. Therefore, outputs defining Ecosystem conditions and both ecosystem and species productivity under the prevailing conditions are most important.

Is the scientific basis of the model sound?

Make a general comment that it is an established and reviewed model and move on. Unless it isnt, then WGSAM would use methods outlined for “Constructed Model” review in the flowchart above, or a model framework review.

WGSAM has provided model framework reviews for the LeMans ensemble (2016), FLBEIA (2017), and a multispecies state-space model (2017). Here we outline more general model framework review guidelines for future meetings.

Model frameworks may be at different stages of the model life cycle than the key runs described above, although to date WGSAM has received requests for review closest to the “Constructed Model” phase. This means that reviews should evaluate (from the figure above):

  1. Spatial and temporal resolution
  2. Algorithm choices
  3. Assumptions (scientific basis, computational infrastructure; adequacy of conceptual model)
  4. Data availability/software tools
  5. Quality assurance/quality control (code testing)
  6. Test scenarios
  7. Corroboration with observations
  8. Uncertainty/sensitivity analysis
  9. Peer review (WGSAM’s role, but consider previous reviews from prior steps)

Is the input data quality and parameterization sufficient for the problem?

See above defining the problem. Which datasets are adequate, which could be improved, and which are missing?

Show the input data as a simple chart: beginning and end of time series, gaps, different length of time series, spatial resolution of data.

Give information on input data pedigree/quality, reference for where it comes from, whether it is survey data or comes from other model output, whether confidence intervals or other uncertainty measures are available and used in the model.

Categorize the assumptions behind modeled ecological or biological processes. Emphasize those related to species interactions (predation, competition), environmental pressures, and also fleet dynamics if needed to address the problem. If the model is spatial, how do these processes happen in space?

Is the parameterization consistent with scientific knowledge (e.g. (PREBAL) diagnostics [3] for general relationships across trophic levels, sizes, etc).

Does model output compare well with observations?

Here we refer to the more detailed performance criteria developed in [2]. We have modified them for our purposes.

Characterize the reference dataset used for comparisons. Has the data been used to construct this model? Is the reference dataset from another model? Describe referece data source(s).

  1. (if important to use–projection) All functional groups persist in an unfished unperturbed run

  2. (if important to use–projection) Model stabilizes for the last ~20 years of an unfished, unperturbed 80-100 year run

  3. The keyrun should define the hindcast time period where agreement with other data sources or assessments is needed. Review will determine if the model fits adequately within that time period. Error ranges are needed for comparison or reference datasets.

  4. Focal species should match biomass and catch trends over the hindcast time period. For full system models, species comprising a majority of biomass should also match general hindcast trends. Suggested tests include modeling efficiency, RMSE, etc. [48,joliff_summary_2009?,allen_multivariate_200?]

  5. Patterns of temporal variability captured (emergent or forced with e.g. recruitment time series)

  6. Productivity for focal species (or groups totaling ~80% of system biomass in full system models) should qualitatively match life history expectations (prebal diagnostics)

  7. Natural mortality decreases with age for majority of groups

  8. Age and length structure qualitatively matches expectations for majority of groups

  9. Diet predicted qualitatively matches empirical diet comp for majority of groups

  10. Spatial distribution of outputs match reference datasets for spatial models (most important if output required at spatial resolution of model, comment if a match in aggregate but not at higher resolution)

  11. Ecosystem indicators (relationship between abundance and body size, pelagic to demersal, Large Fish Indicator) match reference data if needed for problem

Uncertainty

Has uncertainty been assessed in the output of interest? Has sensitivity analysis been performed and how does it affect those outputs?

The key-run should show estimates of uncertainty in the output quantity of interest. Uncertainty analysis is best if possible to estimate confidence intervals. If not possible list key sources of uncertainty, expected bounds on outputs based on those (possibly from sensitivity analysis)–i.e. design sensitivity analysis to approximate uncertainty analysis.

Specific analyses, sensitivity of key output in:

  1. Retrospective analysis (5 year peel of all input data)

  2. Forecast uncertainty: remove last 3-5 years of survey index only to see how well the model works in forecast mode, given the catch that actually happened.

  3. Sensitivity to stomach data and other key or low-confidence data sources

  4. Sensitivity to key parameters: consumption rates, residual mortality (M1, M0)

  5. Sensitivity to initial conditions

For complex models with long runtimes, simpler ways to address uncertainty may be appropriate [2].

Best practice is to retain multiple parameterizations that meet the above criteria to allow scenario testing across a range of parameterizations. Parameter uncertainty can be addressed even in complex models. A simple method uses bounding (e.g. base, low bound, and high bound productivity scenarios; [9]).

Previous peer review

What did they point out and have issues been addressed?

Review of constructed models should have evaluated spatial and temporal resolution, algorithm choices, data availability and software tools, quality assurance/quality control of code, and test scenarios.

Review recommendations

WGSAM key-run review reports will address the sections above, and then make a recommendation for the appropriate uses of model outputs.

WGSAM key-run review reports will also end with a list of recommendations for items to be addressed in future key-runs.

Section below not used in 2019

General Model Reviews

WGSAM has provided model framework reviews for the LeMans ensemble (2016), FLBEIA (2017), and a multispecies state-space model (2017). Here we outline more general model framework review guidelines for future meetings.

Model frameworks may be at different stages of the model life cycle than the key-runs described above, although to date WGSAM has received requests for review closest to the “Constructed Model” phase. This means that reviews should evaluate (from the figure above):

  1. Spatial and temporal resolution
  2. Algorithm choices
  3. Assumptions (scientific basis, computational infrastructure; adequacy of conceptual model)
  4. Data availability/software tools
  5. Quality assurance/quality control (code testing)
  6. Test scenarios
  7. Corroboration with observations
  8. Uncertainty/sensitivity analysis
  9. Peer review (WGSAM’s role, but consider previous reviews from prior steps)

Spatial and temporal resolution

Does the model address the scales appropriate to its proposed use? Or, reviewers should identify what spatial and temporal scale the model is best suited to.

Algorithm choices

Assumptions (scientific basis, computational infrastructure; adequacy of conceptual model)

Data availability/software tools

Quality assurance/quality control (code testing)

Has code been made available to reviewers? While a full code review is beyond the scope of WGSAM, reviewers with expertise in a particular model type may ask to review portions of the model code and/or data inputs.

Test scenarios

Corroboration with observations

Once again, criteria developed by [2] are relevant; see above.

Uncertainty/sensitivity analysis

Peer review (WGSAM’s role, but consider previous reviews from prior steps)

References

1.
NRC. Chapter 4. Model Evaluation. Models in Environmental Regulatory Decision Making. Washington D.C.: The National Academies Press; 2007. pp. 104–169. doi:10.17226/11972
2.
Kaplan IC, Marshall KN. A guinea pig’s tale: Learning to review end-to-end marine ecosystem models for management applications. ICES Journal of Marine Science. 2016;73: 1715–1724. doi:10.1093/icesjms/fsw047
3.
Link JS. Adding rigor to ecological network models by evaluating a set of pre-balance diagnostics: A plea for PREBAL. Ecological Modelling. 2010;221: 1580–1591. doi:10.1016/j.ecolmodel.2010.03.012
4.
Sterman JD. Appropriate summary statistics for evaluating the historical fit of system dynamics models. Dynamica. 1984;10 Part II Winter (1984): 51–66. Available: https://www.systemdynamics.org/assets/dynamica/102/4.pdf
5.
Stow CA, Jolliff J, McGillicuddy DJ, Doney SC, Allen JI, Friedrichs MAM, et al. Skill assessment for coupled biological/physical models of marine systems. Journal of Marine Systems. 2009;76: 4–15. doi:10.1016/j.jmarsys.2008.03.011
6.
Lehuta S, Petitgas P, Mahévas S, Huret M, Vermard Y, Uriarte A, et al. Selection and validation of a complex fishery model using an uncertainty hierarchy. Fisheries Research. 2013;143: 57–66. doi:10.1016/j.fishres.2013.01.008
7.
Lehuta S, Girardin R, Mahévas S, Travers-Trolet M, Vermard Y. Reconciling complex system models and fisheries advice: Practical examples and leads. Aquatic Living Resources. 2016;29: 208. doi:10.1051/alr/2016022
8.
Olsen E, Fay G, Gaichas S, Gamble R, Lucey S, Link JS. Ecosystem Model Skill Assessment. Yes We Can! Bianchi CN, editor. PLOS ONE. 2016;11: e0146467. doi:10.1371/journal.pone.0146467
9.
Saltelli A, Annoni P. How to avoid a perfunctory sensitivity analysis. Environmental Modelling & Software. 2010;25: 1508–1517. doi:10.1016/j.envsoft.2010.04.012