--- title: "Getting Started" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, message = FALSE} library(dplyr) # Data manipulation library(tidyr) # Tidy data library(sf) # Simple features for R (spatial objects) library(sit) # Analyse MRR data from SIT experiments ``` ## Overview The purpose of `sit` is to help analyse mark-release-recapture (MRR) data from sterile insect technique (SIT) field experiments. There are two basic stages in any analysis: 1. Importing data into `sit` 2. Retrieving results from `sit` Here, we will quickly review some concepts needed to work with `sit` while going through a fictitious case study. Links to more detailed materials will be provided along the way. ## Data model ```{r data-structure, echo = FALSE, fig.cap = cap, dpi=72, out.width = ifelse(knitr::is_latex_output(), "50%", "100%")} cap <- "Data model diagram." knitr::include_graphics("data_model.png") ``` The main concepts involved in a MRR SIT experiment are represented in the Figure above and can be summarised as follows: Sterile male mosquitoes are released in a __release event__. Relevant properties are: the number of released individuals, `n`, the dye `colour` use for identification, the `date` and time of release and the release `site`. __Release sites__ can be either _points_ or the whole study _area_ (in the case of areal/line releases). In the former case, release sites have a `geometry` represented by spatial coordinates, while in the latter case the `geometry` is empty. The relationship between `release_events` and `release_sites` implies that there can be any number of release events from the same site. Mosquitoes (both sterile and wild) are subsequently captured in adult __traps__. Relevant properties are: their location (point `geometry`), the `area` (_study_ or _control_) and a identification `code` as used in the field. Furthermore, wild females lay their eggs in another type of __traps__ (ovitraps). Both adult and egg traps can be of different __types__, which is why each trap is associated with a specific __trap type__. Properties of trap types are: its full `name` (e.g. _BG-Sentinel_) and a shorter `label` (e.g. `BGS`), the target `stage` (adult or egg) and an optional `description`. Finally, traps are regularly __surveyed__ in order to collect the experimental data. Both adult and egg surveys record the surveyed trap and the period of capture. __Adult surveys__ also record the `population_id` of the captured individuals (either _wild_ or the _colour_ of one of the released populations), their `sex` and the total number `n` of individuals of each group. __Egg surveys__ instead, record the number of eggs which were `fertile` or not. ## Package organisation First, load the package as follows: ```{r load-sit} library(sit) ``` The package provides a series of _functions_ to __import__ data and to __retrieve__ results. All functions in the package are prefixed by `sit_`. This is convenient for finding function names in combination with the auto-complete feature that is provided by most _development environments_ such as RStudio or VSCode. The typical workflow is as follows: 1. __Import__ data tables from files (`.csv`, `.xls(x)`, ...) into R `data.frame`s. 2. __Transform__ and reshape the data into the structures required by import functions in `sit`. Here is a list of the main `sit` functions used at this stage: ```{r import-functions, echo = FALSE} tibble::tribble( ~Function, ~Description, "`sit_revents()`", "Import release events and sites.", "`sit_traps()`", "Import traps", "`sit_adult_surveys()`", "Import adult surveys", "`sit_egg_surveys()`", "Import egg surveys", "`sit()`", "Gather data about traps, release events and field survey data into a `sit` object." ) %>% knitr::kable( caption = "Import functions." ) ``` 3. __Gather__ all the data (traps, releases, surveys) into a `sit` object. 4. __Query__ the `sit` object to verify that everything was correctly interpreted. 5. __Retrieve__ results from the `sit` object. A list of relevant `sit` functions is shown below: ```{r estimating-functions, echo = FALSE} tibble::tribble( ~Function, ~Description, "`sit_competitiveness()`", "Competitiveness of sterile males", "`sit_mdt()`", "Mean Distance Travelled (MDT)", "`sit_flight_range()`", "Flight Range", "`sit_diffusion()`", "Diffusion coefficient", "`sit_survival()`", "Survival statistics", "`sit_wild_size()`", "Size of the wild male population" ) %>% knitr::kable( caption = "Estimating functions." ) ``` ## Introductory example This section will skim through a complete analysis. Most details will be deferred to more specific articles. The objective is to quickly acquire a perspective of the general workflow. See `vignette("importing", package = "sit")` for specific details on the format and structure required by the import functions. The package provides some realistic but fictitious data files that can be located with `system.file()`. Specifically: ```{r list-extdata} list.files(system.file("extdata", package = "sit")) ``` ### 1. Read data tables into R ```{r read-data} releases_raw <- read.csv( system.file("extdata/releases.csv", package = "sit") ) bgs_traps_raw <- read.csv( system.file("extdata/bgs_traps_coordinates.csv", package = "sit") ) adults_raw <- read.csv( system.file("extdata/adult_surveys.csv", package = "sit") ) eggs_raw <- read.csv( system.file("extdata/egg_surveys.csv", package = "sit") ) ``` ### 2. Transform raw data This step will require some basic skills of data-manipulation in R. An accessible introduction to what we'll use can be found in the [_Introduction to `dplyr`_](https://dplyr.tidyverse.org/articles/dplyr.html) and `tidyr`'s tutorial on [_Pivoting_](https://tidyr.tidyverse.org/articles/pivot.html). The raw data of __release events__ have geographical coordinates for the _point releases_. There is also one _areal release_ with missing coordinates. ```{r head-releases} head(releases_raw) ``` The former need to be transformed into a spatial object of POINT features (using the [package `sf`](https://r-spatial.github.io/sf/)), while the areal release is treated separately. ```{r point-releases} ## Point releases p_releases <- releases_raw %>% dplyr::filter(!is.na(lon)) %>% # Observations with non-missing coords st_as_sf( # Convert to spatial coords = c("lon", "lat"), # Variables with coordinates crs = 4326 # Code for GPS coordinates ) ``` ```{r print-point-releases, echo = FALSE} print(p_releases) ``` ```{r areal-releases} ## Areal release a_releases <- releases_raw %>% dplyr::filter(is.na(lon)) %>% # Observations with missing coords dplyr::select(-lon, -lat) # Select variables (to remove) ``` ```{r print-areal-releases, echo = FALSE} print(a_releases) ``` __Combine__ _point_ and _areal_ releases using `c()`. ```{r combine-releases} ix_releases <- c(sit_revents(p_releases), sit_revents(a_releases)) ``` The __traps__ data file contains only an id and GPS coordinates. The file name indicates they are _BGS_ traps (a type of _adult_ traps). We have to know (e.g. from project documentation) that they are located in the study area. ```{r head-traps} head(bgs_traps_raw) ``` We need to transform the object into a spatial object of POINT features (using the [package `sf`](https://r-spatial.github.io/sf/)), and add some of the missing variables. ```{r adult-traps} a_traps <- bgs_traps_raw %>% mutate( # Modify (add) variables type = "BGS", area = "sit" ) %>% st_as_sf( # Convert to spatial coords = c("lon", "lat"), # Variables with coordinates crs = 4326 # Code for GPS coordinates ) ``` Note that declaring `type = 'BGS'` implies that these traps target the _adult_ stage, since: ```{r sit-trap-types} sit_trap_types() ``` Furthermore, there are 14 ovitraps in the study area and 7 more in a control area, although there is no data file for these, since coordinates are not strictly required (albeit recommended). We declare these simply by passing a vector of trap codes, the _area_ and a valid _type_ to `sit_traps()`. Finally, we combine (`c()`) all trap types together into a single object. ```{r ix-traps} ix_traps <- c( sit_traps(a_traps), sit_traps(sprintf("CS%02d", 1:14), area = 'sit', type = 'OVT'), sit_traps(sprintf("BL%02d", 1:7), area = 'control', type = 'OVT') ) ``` Survey data of adult traps have been collected across multiple columns. ```{r head-adults} head(adults_raw) ``` We need to _reshape_ (pivot) the table to make it [_tidy_](https://tidyr.tidyverse.org/articles/tidy-data.html) (one variable per column) and add the implicit information about the _duration_ of each survey (1 day). ```{r tidy-adults} ( adult_surveys <- adults_raw %>% ## Reshape into long format pivot_longer( cols = wild_female:pink_male, names_to = "pop_sex", values_to = "n" ) %>% ## Separate variables separate( pop_sex, into = c("population", "sex"), sep = "_" ) %>% ## Include the missing duration of the sampling period mutate( duration = 1 ) ) ``` ```{r ix-adults} ix_adults <- sit_adult_surveys(adult_surveys) ``` We have a similar situation with egg surveys: ```{r head-eggs} head(eggs_raw) ``` ```{r tidy-eggs} egg_surveys <- eggs_raw %>% ## Reshape into long format and convert the hatched ## status into a logical variable. pivot_longer( cols = fertile:sterile, names_to = "fertile", names_transform = list( fertile = function(.) {. == "fertile"} # makes it logical ), values_to = "n" ) %>% ## Include the missing duration of the sampling period mutate( duration = 7 ) ``` ```{r ix-eggs} ix_eggs <- sit_egg_surveys(egg_surveys) ``` ### 3. Build the `sit` object. Finally, integrate all the data about release events, traps and surveys into a `sit` object, using the function of the same name: ```{r ix-sit} ix_sit <- sit( traps = ix_traps, release_events = ix_releases, adult_surveys = ix_adults, egg_surveys = ix_eggs ) ``` This object that we have just built is (almost) identical to the data set `sit_prototype` provided by the package. The only difference is the [coordinate reference system](importing.html#spatial-coordinates) used. Thus, for testing the package, you can directly use `sit_prototype` which is a fully specified `sit` object, and skip the importing operations entirely. ### 4. __Query__ the `sit` object Various standard methods allow you to verify that everything was correctly interpreted, whereas more specific methods let you retrieve the data from the `sit` object. _Printing_ the `sit` object provides a short numerical overview of the data it contains: ```{r print-sit} print(ix_sit) # ix_sit # Same ``` More detailed information about the `sit` object can be retrieved using other generic methods. See [Retrieving results](retrieving.html). The same functions used for importing traps (`sit_traps()`), release events (`sit_revents()`) and surveys (`sit_adult_surveys` and `sit_egg_surveys()`) also work as _extractors_ of the corresponding data from a `sit` object. ```{r extract-release-events} sit_revents(ix_sit) ``` ```{r extract-traps} sit_traps(ix_sit) ``` ```{r extract-adult-surveys} sit_adult_surveys(ix_sit) ``` ```{r extract-egg-surveys} sit_egg_surveys(ix_sit) ``` ### 5. __Retrieve__ results from the `sit` object. The main objective of the package is to facilitate the estimation of relevant quantities concerning the __competitiveness__ of sterile male, their __dispersal__ dynamics, their __survival__ rates and the __size__ of the wild male population. See the vignette [_Retrieving Results_](retrieving.html) and the reference documentation for more details about all the options for using these functions. ```{r competitiveness} sit_competitiveness(ix_sit) ``` ```{r mdt} sit_mdt(ix_sit, by = "population") sit_mdt(ix_sit, by = "age") ``` ```{r flight_range} sit_flight_range(ix_sit) ``` ```{r diffusion} sit_diffusion(ix_sit) ``` ```{r survival} sit_survival(ix_sit) ``` ```{r wild-population-size} sit_wild_size(ix_sit) ``` ## Saving and loading Once you have your data imported as a `sit` object, you might want to save it for back up, for continuing work at a later time without having to re-import the original data from scratch, for sharing the `sit` object with someone else, etc. R provides saving and loading mechanisms for any R object as follows: ```{r} ## Rather than a temporary file, you might want to define a file name in your ## working directory (`getwd()`) or somewhere else as a character string. ## Use the conventional extension `rds`. E.g. ## my_file <- 'C:/work/project/my_sit.rds' my_file <- tempfile() saveRDS(ix_sit, file = my_file) new_sit <- readRDS(my_file) identical(new_sit, ix_sit) ``` ## Reproducing results Beyond saving the `sit` object with the experimental data, it is also a good idea to save the __analyses__ and the results, rather than copying them over to a document and then forgetting how they were computed. Working with R is very convenient, since you can simply save the _script_ that produces all the results. A __R script__ is just a text file (with conventional extension `.R`) with the sequence of commands and function calls that perform the desired analysis. Should you want to check where does a result come from, you simply need to look in the script, which is self-documenting. This makes it (almost) trivial to __reproduce__ the results. Either by you or by someone else. In all justice, issues often arise due to differences in the underlying environment (package versions, platform, etc.). However, it is a good start. An even better approach is to _code_ the analysis report entirely. Making the analysis (e.g. with a R script) and then transferring results (e.g. copying-and-pasting) to a manuscript in separate steps is error-prone and difficult to keep in sync as (inevitably) the calculations get updated and corrected. Instead, we can write the manuscript with the code required to produce the results, all in a single document, which is then _compiled_ into a report. This approach is called [_Literate Programming_](https://en.wikipedia.org/wiki/Literate_programming) and is not precisely new: it was introduced by Donald Knuth back in 1984. In R, a popular implementation of this idea is [R Markdown](https://rmarkdown.rstudio.com/). It is, by the way, how this document has been produced. RStudio integrates R Markdown in its interface, which makes it very easy to get started. Go to `File > New File > R Markdown...` and follow the instructions. Otherwise, you can write a R Markdown file (which is a text file with a conventional extension `.Rmd`) and _render_ the compiled report using `rmarkdown::render()`. A final recommendation. Keep all the files (data, code, docs, reports) from a data project organised within a directory. If you use RStudio, setting up a _RStudio project_ in that directory to facilitates switching to and out of it quickly and sharing projects across computers and people. Some useful references for more insight: - Hadley Wickham and Garrett Grolemund (2017) _R for Data Science_. - Chapter 8: _Workflow: projects_. https://r4ds.had.co.nz/workflow-projects.html. - Chapter 27: _R Markdown_. https://r4ds.had.co.nz/r-markdown.html - Rafael A. Irizarry (2021)_Introduction to Data Science_. Chapter 40: _Reproducible projects with RStudio and R markdown_. https://rafalab.github.io/dsbook/reproducible-projects-with-rstudio-and-r-markdown.html