---
title: "Getting Started"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


```{r setup, message = FALSE}
library(dplyr)  # Data manipulation
library(tidyr)  # Tidy data
library(sf)     # Simple features for R (spatial objects)
library(sit)    # Analyse MRR data from SIT experiments
```


## Overview

The purpose of `sit` is to help analyse mark-release-recapture (MRR) data from
sterile insect technique (SIT) field experiments.

There are two basic stages in any analysis:

1. Importing data into `sit`

2. Retrieving results from `sit`

Here, we will quickly review some concepts needed to work with `sit` while
going through a fictitious case study.
Links to more detailed materials will be provided along the way.


## Data model

```{r data-structure, echo = FALSE, fig.cap = cap, dpi=72, out.width = ifelse(knitr::is_latex_output(), "50%", "100%")}
cap <- "Data model diagram."
knitr::include_graphics("data_model.png")
```


The main concepts involved in a MRR SIT experiment are represented in the Figure
above and can be summarised as follows:

Sterile male mosquitoes are released in a __release event__. Relevant properties
are: the number of released individuals, `n`, the dye `colour` use for
identification, the `date` and time of release and the release `site`.

__Release sites__ can be either _points_ or the whole study _area_ (in the case
of areal/line releases). In the former case, release sites have a `geometry`
represented by spatial coordinates, while in the latter case the `geometry` is
empty.

The relationship between `release_events` and `release_sites` implies that there
can be any number of release events from the same site.

Mosquitoes (both sterile and wild) are subsequently captured in adult __traps__.
Relevant properties are: their location (point `geometry`), the `area` (_study_
or _control_) and a identification `code` as used in the field. Furthermore,
wild females lay their eggs in another type of __traps__ (ovitraps).

Both adult and egg traps can be of different __types__, which is why each trap
is associated with a specific __trap type__. Properties of trap types are: its
full `name` (e.g. _BG-Sentinel_) and a shorter `label` (e.g. `BGS`), the target
`stage` (adult or egg) and an optional `description`.

Finally, traps are regularly __surveyed__ in order to collect the experimental
data. Both adult and egg surveys record the surveyed trap and the period of
capture.
__Adult surveys__ also record the `population_id` of the captured individuals
(either _wild_ or the _colour_ of one of the released populations), their `sex`
and the total number `n` of individuals of each group.
__Egg surveys__ instead, record the number of eggs which were `fertile` or not.


## Package organisation

First, load the package as follows:
```{r load-sit}
library(sit)
```

The package provides a series of _functions_ to __import__ data and to 
__retrieve__ results.

All functions in the package are prefixed by `sit_`. This is convenient for
finding function names in combination with the auto-complete feature that is
provided by most _development environments_ such as RStudio or VSCode.

The typical workflow is as follows:

1. __Import__ data tables from files (`.csv`, `.xls(x)`, ...) into R
`data.frame`s.

2. __Transform__ and reshape the data into the structures required by import
functions in `sit`. Here is a list of the main `sit` functions used at this stage:

    ```{r import-functions, echo = FALSE}
    
    tibble::tribble(
      ~Function, ~Description,
      "`sit_revents()`", "Import release events and sites.",
      "`sit_traps()`", "Import traps",
      "`sit_adult_surveys()`", "Import adult surveys",
      "`sit_egg_surveys()`", "Import egg surveys",
      "`sit()`", "Gather data about traps, release events and field survey data into a `sit` object."
    ) %>% 
      knitr::kable(
        caption = "Import functions."
      )
    ```

3. __Gather__ all the data (traps, releases, surveys) into a `sit` object.

4. __Query__ the `sit` object to verify that everything was correctly
interpreted.

5. __Retrieve__ results from the `sit` object. A list of relevant `sit`
functions is shown below:

    ```{r estimating-functions, echo = FALSE}
    
    tibble::tribble(
      ~Function, ~Description,
      "`sit_competitiveness()`", "Competitiveness of sterile males",
      "`sit_mdt()`", "Mean Distance Travelled (MDT)",
      "`sit_flight_range()`", "Flight Range",
      "`sit_diffusion()`", "Diffusion coefficient",
      "`sit_survival()`", "Survival statistics",
      "`sit_wild_size()`", "Size of the wild male population"
    ) %>% 
      knitr::kable(
        caption = "Estimating functions."
      )
    ```


## Introductory example

This section will skim through a complete analysis.
Most details will be deferred to more specific articles.
The objective is to quickly acquire a perspective of the general workflow.
See `vignette("importing", package = "sit")` for specific details on the format and structure required by the import functions.

The package provides some realistic but fictitious data files that can be
located with `system.file()`. Specifically:

```{r list-extdata}
list.files(system.file("extdata", package = "sit"))
```


### 1. Read data tables into R

```{r read-data}
releases_raw <- read.csv(
  system.file("extdata/releases.csv", package = "sit")
)

bgs_traps_raw <- read.csv(
  system.file("extdata/bgs_traps_coordinates.csv", package = "sit")
)

adults_raw <- read.csv(
  system.file("extdata/adult_surveys.csv", package = "sit")
)

eggs_raw <- read.csv(
  system.file("extdata/egg_surveys.csv", package = "sit")
)
```

### 2. Transform raw data

This step will require some basic skills of data-manipulation in R.
An accessible introduction to what we'll use can be found in the [_Introduction to `dplyr`_](https://dplyr.tidyverse.org/articles/dplyr.html) and `tidyr`'s tutorial on [_Pivoting_](https://tidyr.tidyverse.org/articles/pivot.html).

The raw data of __release events__ have geographical coordinates for the _point
releases_. There is also one _areal release_ with missing coordinates.
```{r head-releases}
head(releases_raw)
```

The former need to be transformed into a spatial object of POINT features
(using the [package `sf`](https://r-spatial.github.io/sf/)),
while the areal release is treated separately.


```{r point-releases}
## Point releases
p_releases <- releases_raw %>%
  dplyr::filter(!is.na(lon)) %>%      # Observations with non-missing coords
  st_as_sf(                           # Convert to spatial
    coords = c("lon", "lat"),           # Variables with coordinates
    crs = 4326                          # Code for GPS coordinates
  )
```

```{r print-point-releases, echo = FALSE}
print(p_releases)
```

```{r areal-releases}
## Areal release
a_releases <- releases_raw %>%
  dplyr::filter(is.na(lon)) %>%    # Observations with missing coords
  dplyr::select(-lon, -lat)        # Select variables (to remove)
```

```{r print-areal-releases, echo = FALSE}
print(a_releases)
```

__Combine__ _point_ and _areal_ releases using `c()`.

```{r combine-releases}
ix_releases <- c(sit_revents(p_releases), sit_revents(a_releases))
```


The __traps__ data file contains only an id and GPS coordinates. The file name
indicates they are _BGS_ traps (a type of _adult_ traps). We have to know (e.g.
from project documentation) that they are located in the study area.

```{r head-traps}
head(bgs_traps_raw)
```

We need to transform the object into a spatial object of POINT features
(using the [package `sf`](https://r-spatial.github.io/sf/)), and add some
of the missing variables.

```{r adult-traps}
a_traps <- bgs_traps_raw %>% 
  mutate(                         # Modify (add) variables
    type = "BGS",
    area = "sit"
  ) %>% 
  st_as_sf(                       # Convert to spatial
    coords = c("lon", "lat"),         # Variables with coordinates
    crs = 4326                        # Code for GPS coordinates
  )
```

Note that declaring `type = 'BGS'` implies that these traps target the _adult_
stage, since:
```{r sit-trap-types}
sit_trap_types()
```

Furthermore, there are 14 ovitraps in the study area and 7 more in a control
area, although there is no data file for these, since coordinates are not
strictly required (albeit recommended).
We declare these simply by passing a vector of trap codes, the _area_ and
a valid _type_ to `sit_traps()`.
Finally, we combine (`c()`) all trap types together into a single object.

```{r ix-traps}
ix_traps <- c(
  sit_traps(a_traps),
  sit_traps(sprintf("CS%02d", 1:14), area = 'sit', type = 'OVT'),
  sit_traps(sprintf("BL%02d", 1:7), area = 'control', type = 'OVT')
)
```

Survey data of adult traps have been collected across multiple columns. 

```{r head-adults}
head(adults_raw)
```

We need to _reshape_ (pivot) the table to make it [_tidy_](https://tidyr.tidyverse.org/articles/tidy-data.html) (one variable per column) and add
the implicit information about the _duration_ of each survey (1 day).

```{r tidy-adults}
(
  adult_surveys <- 
    adults_raw %>% 
    ## Reshape into long format
    pivot_longer(
      cols = wild_female:pink_male,
      names_to = "pop_sex",
      values_to = "n"
    ) %>% 
    ## Separate variables
    separate(
      pop_sex,
      into = c("population", "sex"),
      sep = "_"
    ) %>% 
    ## Include the missing duration of the sampling period
    mutate(
      duration = 1
    )
)
```

```{r ix-adults}
ix_adults <- sit_adult_surveys(adult_surveys)
```

We have a similar situation with egg surveys:

```{r head-eggs}
head(eggs_raw)
```

```{r tidy-eggs}
egg_surveys <-
  eggs_raw %>%
  ## Reshape into long format and convert the hatched
  ## status into a logical variable.
  pivot_longer(
    cols = fertile:sterile,
    names_to = "fertile",
    names_transform = list(
      fertile = function(.) {. == "fertile"}  # makes it logical
    ),
    values_to = "n"
  ) %>%
  ## Include the missing duration of the sampling period
  mutate(
    duration = 7
  )
```


```{r ix-eggs}
ix_eggs <- sit_egg_surveys(egg_surveys)
```


### 3. Build the `sit` object.

Finally, integrate all the data about release events, traps and surveys into
a `sit` object, using the function of the same name:

```{r ix-sit}
ix_sit <- sit(
  traps          = ix_traps,
  release_events = ix_releases,
  adult_surveys  = ix_adults,
  egg_surveys    = ix_eggs
)
```

This object that we have just built is (almost) identical to the data set
`sit_prototype` provided by the package.
The only difference is the [coordinate reference system](importing.html#spatial-coordinates) used.

Thus, for testing the package, you can directly use `sit_prototype` which is
a fully specified `sit` object, and skip the importing operations entirely.


### 4. __Query__ the `sit` object

Various standard methods allow you to verify that everything was correctly
interpreted, whereas more specific methods let you retrieve the data from the
`sit` object.

_Printing_ the `sit` object provides a short numerical overview of the data it
contains:

```{r print-sit}
print(ix_sit)
# ix_sit         # Same
```

More detailed information about the `sit` object can be retrieved using other 
generic methods. See [Retrieving results](retrieving.html).

The same functions used for importing traps (`sit_traps()`), release events 
(`sit_revents()`) and surveys (`sit_adult_surveys` and `sit_egg_surveys()`)
also work as _extractors_ of the corresponding data from a `sit` object.


```{r extract-release-events}
sit_revents(ix_sit)
```

```{r extract-traps}
sit_traps(ix_sit)
```

```{r extract-adult-surveys}
sit_adult_surveys(ix_sit)
```

```{r extract-egg-surveys}
sit_egg_surveys(ix_sit)
```


### 5. __Retrieve__ results from the `sit` object.

The main objective of the package is to facilitate the estimation of relevant
quantities concerning the __competitiveness__ of sterile male, their
__dispersal__ dynamics, their __survival__ rates and the __size__ of the wild
male population.

See the vignette [_Retrieving Results_](retrieving.html) and the reference
documentation for more details about all the options for using these functions.

```{r competitiveness}
sit_competitiveness(ix_sit)
```


```{r mdt}
sit_mdt(ix_sit, by = "population")
sit_mdt(ix_sit, by = "age")
```

```{r flight_range}
sit_flight_range(ix_sit)
```

```{r diffusion}
sit_diffusion(ix_sit)
```


```{r survival}
sit_survival(ix_sit)
```


```{r wild-population-size}
sit_wild_size(ix_sit)
```

## Saving and loading

Once you have your data imported as a `sit` object, you might want to save it
for back up, for continuing work at a later time without having to re-import the
original data from scratch, for sharing the `sit` object with someone else, etc.

R provides saving and loading mechanisms for any R object as follows:

```{r}
## Rather than a temporary file, you might want to define a file name in your
## working directory (`getwd()`) or somewhere else as a character string.
## Use the conventional extension `rds`. E.g.
## my_file <- 'C:/work/project/my_sit.rds'
my_file <- tempfile()
saveRDS(ix_sit, file = my_file)
new_sit <- readRDS(my_file)
identical(new_sit, ix_sit)
```


## Reproducing results

Beyond saving the `sit` object with the experimental data, it is also a good
idea to save the __analyses__ and the results, rather than copying them over to
a document and then forgetting how they were computed.

Working with R is very convenient, since you can simply save the _script_ that
produces all the results.
A __R script__ is just a text file (with conventional extension `.R`) with the
sequence of commands and function calls that perform the desired analysis.

Should you want to check where does a result come from, you simply need to look
in the script, which is self-documenting. This makes it (almost) trivial to
__reproduce__ the results. Either by you or by someone else. In all justice,
issues often arise due to differences in the underlying environment (package
versions, platform, etc.). However, it is a good start.

An even better approach is to _code_ the analysis report entirely.
Making the analysis (e.g. with a R script) and then 
transferring results (e.g. copying-and-pasting) to a manuscript in separate
steps is error-prone and difficult to keep in sync as (inevitably) the
calculations get updated and corrected.
Instead, we can write the manuscript with the code required to produce the
results, all in a single document, which is then _compiled_ into a report.

This approach is called [_Literate
Programming_](https://en.wikipedia.org/wiki/Literate_programming) and is not
precisely new: it was introduced by Donald Knuth back in 1984. In R, a popular
implementation of this idea is [R Markdown](https://rmarkdown.rstudio.com/). It
is, by the way, how this document has been produced.

RStudio integrates R Markdown in its interface, which makes it very easy to get
started. Go to `File > New File > R Markdown...` and follow the instructions.
Otherwise, you can write a R Markdown file (which is a text file with a 
conventional extension `.Rmd`) and _render_ the compiled report using `rmarkdown::render()`.

A final recommendation. Keep all the files (data, code, docs, reports) from a
data project organised within a directory. If you use RStudio, setting up a
_RStudio project_ in that directory to facilitates switching to and out of it
quickly and sharing projects across computers and people.

Some useful references for more insight:

- Hadley Wickham and Garrett Grolemund (2017) _R for Data Science_. 
  - Chapter 8: _Workflow: projects_. https://r4ds.had.co.nz/workflow-projects.html.
  - Chapter 27: _R Markdown_. https://r4ds.had.co.nz/r-markdown.html

- Rafael A. Irizarry (2021)_Introduction to Data Science_. Chapter 40: _Reproducible projects with RStudio and R markdown_. https://rafalab.github.io/dsbook/reproducible-projects-with-rstudio-and-r-markdown.html