---
title: "Retrieving Results"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Retrieving Results}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: sit.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, message = FALSE}
library(dplyr)   # Data manipulation
library(ggplot2) # Plotting
library(tidyr)   # Tidy data
library(sf)      # Simple features for R (spatial objects)
library(sit)     # Analyse MRR data from SIT experiments
```


```{r plot-theme, echo = FALSE}
theme_set(
  theme_bw() +
    theme(
      panel.grid.minor = element_blank(),
      panel.grid.major.x = element_blank()
    )
)
```


## Overview of the experimental setup

Printing the `sit` object gives a very short description of its contents:

```{r print-sit}
sit_prototype
```

Use `summary()` to get a more detailed summary of each of the `sit` components:

```{r summary-sit}
summary(sit_prototype)
```


Use `plot()` to get a very basic image of the experimental spatial set up.

```{r}
plot(sit_prototype)
```

You can make more professional and even interactive maps by leveraging other R
packages of your choice. This is beyond the scope of `sit`, but I'll provide a
few examples here using the [package `tmap`](https://r-tmap.github.io/tmap/).
To learn more on mapping, see the chapter [Making maps with
R](https://geocompr.robinlovelace.net/adv-map.html) in Robin Lovelace's
excellent book _Geocomputation with R_.

```{r interactive-map, results = "hold"}
library(tmap)
tmap_cmode <- tmap_mode("view")
tm_basemap("CartoDB") +
tm_shape(
  sit_traps(sit_prototype, area = "sit", stage = "adult")
) +
  tm_dots() +
  tm_shape(
    sit_revents(sit_prototype, type = "point") %>%
      st_as_sf()
  ) +
  tm_bubbles(
    col = "colour",
    jitter = 1       # Jitter a bit to see the overlapping release points
  )
tmap_mode(tmap_cmode)
```

Note how we retrieved only `adult` traps from the `sit` area, and only the
`point` releases, out of all available traps and releases.
We will see how that worked in the following section.

## Extraction functions

Extraction functions allow retrieving the data contained in a `sit` object,
while filtering specific groups as needed.

Here we describe the list of arguments of each extraction function, with a few
examples. This is very much borrowed from the help pages, but focused on the
extraction method only.

### `sit_trap_types()`

Arguments:
```{r echo = FALSE}
print_variable_table(
  "x", "A `sit` object.",
  header = FALSE
)
```

This one is simple: it only takes a `sit` object and returns the list of
trap types. E.g.

```{r}
sit_trap_types(sit_prototype)
```

The result is a `data.frame` itself. Thus, if you really need to
filter some observations, you can put your R-skills to use:

```{r}
sit_trap_types(sit_prototype) %>% 
  dplyr::filter(stage == "egg")
```


### `sit_traps()`

Arguments:
```{r echo = FALSE}
print_variable_table(
  "x", "A `sit` object.",
  "type", "Character vector. Labels of requested trap types. By default, `sit_trap_types()$label`.",
  "stage", "Character vector. Either `adult`, `egg` or both (default).",
  "area", "Character vector. Either `control`, `sit` or both (default).",
  header = FALSE
)
```

Retrieve the table of ovitraps:
```{r}
sit_traps(sit_prototype, type = "OVT")
```

Note that this coincides with the table of traps for the `stage = 'egg'`.

However, traps in the `control` area are only a subset:
```{r}
sit_traps(sit_prototype, area = "control")
```


### `sit_revents()`

Arguments:
```{r echo = FALSE}
print_variable_table(
  "x",    "A `sit` object.",
  "type", "Character vector. Which __types__ of release events to retrieve.
  Either `point` or `areal` or both (default).",
  header = FALSE
)
```

```{r}
sit_revents(sit_prototype, type = 'point')
sit_revents(sit_prototype, type = 'areal')
sit_revents(sit_prototype)
```


### `sit_adult_surveys()` and `sit_egg_surveys()`

Extract survey data, possibly filtering results.

Arguments:
```{r echo = FALSE}
print_variable_table(
  "x",         "A `sit` object",
  "area",      "Character. Either `control`, `sit` or both (default).",
  "trap_type", "Character. Any of `sit_trap_types(x)$label`. Typically, `OVT` for `sit_egg_surveys()` and `BGS` or `HLC` for `sit_adult_surveys()`.",
  "following_releases", "A `sit_release` object with a subset of release events or missing (default) for all release events. Use in combination with `following_days` to filter surveys of the target populations within a given number of days after release. Note that counts of wild populations will __always__ be included in the results with a value of `NA` in `pop_col`.",
  "following_days", "Integer or missing (default). Number of days after releases to return, if `following_releases` is not missing.",
  "species",   "Character or NULL (default). Filter surveys for a particular species, or ignore the species if NULL.",
  header = FALSE
)
```


Retrieve all the egg surveys in the control area:
```{r ex-adults-area}
sit_egg_surveys(sit_prototype, area = "control")
```

Retrieve the adult surveys from the day that followed the _areal_ release.
Only counts from the target release, or the wild population, please.
```{r}
sit_adult_surveys(
  sit_prototype,
  following_releases = sit_revents(sit_prototype, type = 'areal'),
  following_days = 1
)
```


## SIT results

### 1. Descriptive summaries

Dispersal of sterile males in space and time. 

```{r}
## Retrieve age since release date of each survey
sm_ages <- survey_ages(
  surveys  = sit_adult_surveys(sit_prototype),
  releases = sit_revents(sit_prototype, type = 'point')
) 

## Summarise counts by trap and age-group (by 5 days)
sm_summary <- 
  sm_ages %>% 
  mutate(
    age_group = cut(age, 5*(0:6))
  ) %>% 
  filter(!is.na(age_group)) %>%   # remove ages 30+ (max is 36)
  group_by(trap_code, age_group) %>% 
  summarise(n = sum(n), .groups = 'drop')

## Join this data to the geospatial information of the traps
sm_sf <- sit_traps(sit_prototype, area = "sit", stage = "adult") %>% 
  left_join(sm_summary, by = c(code = 'trap_code'))

tm_shape(sm_sf) +
  tm_bubbles(size = 'n') +
  tm_facets(by = 'age_group')
```


### 2. Competitiveness


The competitiveness $\gamma$ of the sterile male individuals is defined as their relative capacity to mate with a wild female, compared to a wild male.

Thus, in a homogeneously mixed population with $M_s$ sterile males and $M_w$ wild males, the probability that a mating occurs with a sterile individual is 
\[
  p_s = \frac{\gamma M_s}{M_w + \gamma M_s} = 
  \frac{\gamma R_{sw}}{1 + \gamma R_{sw}},
\]
where $R_{sw} = M_s/M_w$ is the sterile-wild male ratio.
At a given sterile-wild male ratio $R_{sw} = M_s/M_w$ we observe a fertility rate $H_s$ in the field.

Assuming a residual fertility rate $H_{rs}$ for sterile males and a natural fertility rate $H_w$ for wild males, the observed fertility rate $H_s$ in the field is:
\begin{equation}
  H_s = p_s H_{rs} + (1-p_s)H_w = 
  \frac{\gamma R_{sw}}{1 + \gamma R_{sw}} H_{rs} + 
  \frac{1}{1 + \gamma R_{sw}} H_w.
\end{equation}

Solving the above equation for $\gamma$ we obtain the Fried index, which
estimates the competitiveness based on observed fertility rates and the
sterile-wild male ratio:
\begin{equation}
  F = \frac{H_w – H_s}{H_s-H_{rs}} \bigg/ R_{sw}
\end{equation}
where $H_w$ is the percentage egg hatch in the control area (i.e. the natural fertility); $H_s$ is the percentage egg hatch in the release area (observed fertility under a sterile-wild male ratio $R_{sw}$) and $H_{rs}$ is the residual fertility of sterile males.

This requires the estimation of:

1. The sterile-wild males ratio $R_{sw}$. Estimated from the observed ratios in __adult surveys in the following 7 days after areal releases__ by default.

1. The natural fertility $H_w$. Estimated from egg-hatching rates in the control area.

1. The observed fertility in the release area $H_s$. Estimated from egg-hatching rates in the SIT area.

1. The residual fertility of sterile males $H_{rs}$.
    This depends on the level of radiation used for sterilisation, and the influx of females from beyond the SIT area. This cannot be estimated from SIT data and
    needs to be fixed by the user. Using $H_{rs} = 0$ reduces to the so-called _CIS index_.
    

```{r compet-default}
sit_competitiveness(sit_prototype, residual_fertility = 0.05)
```

If there were more than one areal release, we could compute the estimated
competitiveness for each one separately by specifying `following_release`.
We can also control the number of days following the release to be used for the
calculation of the sterile-wild male ratio. 

```{r compet-ten-days}
sit_competitiveness(
  sit_prototype,
  following_releases = sit_revents(sit_prototype, type = "areal")[1,],
  following_days = 10,
  residual_fertility = 0.05
)
```

We can be shocked by the difference in the estimate after simply widening the
observation window to 10 days rather than 7, and legitimately ask which
time-span is more appropriate.

Notice that the only component of the calculation that changed sensibly is the sterile-wild male ratio, which essentially halved.

Let's see how the relationship between captured sterile and wild males evolve
in time. We can look at the ratio, but the difference is a more natural scale.

```{r sterile-wild-male-ratio}
areal_release <- sit_revents(sit_prototype, type = "areal")

areal_surveys <- sit_adult_surveys(
    sit_prototype,
    area = "sit",
    following_releases = areal_release,
    following_days = 10
  )

sterile_wild_male_ratio(areal_surveys) %>% 
  mutate(
    age = as.numeric(datetime_end - areal_release$date)
  ) %>% 
  group_by(age) %>% 
  summarise(
    sterile = sum(sterile),
    wild = sum(wild),
    swdiff = sterile - wild
  ) %>% 
  ggplot(aes(age, swdiff)) +
  geom_point() +
  geom_smooth(method = 'loess', formula = 'y ~ x') +
  labs(x = "Age", y = "Sterile - wild difference")
```

Starting from day 8, the number of captured sterile males drops consistently
with respect to the captured wild males,
which explains the difference in the estimate of competitiveness.

Since the calculation pools observations assuming a constant ratio, we can 
conclude that 7 days was an appropriate time-span for these data.

### 3. Dispersal

#### Mean Distance Travelled (MDT)

Since the dispersion evolves in time, typically increasing with time since
release, we compute the MDT at each day $j$ and,
potentially, for the population corresponding to the release event $k$.

\begin{equation}
  \text{MDT}_{jk} = \sum_{i = 1}^{n_t} w_{ik}\,n_{ijk}\,d_i \bigg/
  \sum_{i = 1}^{n_t} w_{ik}\,n_{ijk}
\end{equation}
where $n_t$ is the total number of traps, $d_i$ is the distance from trap $i$ to the release site of event $k$ and $n_{ijk}$ is the number of sterile males surveyed at trap $i$, in day $j$ after release event $k$.

Furthermore, a set of _weights_ $w_{ik}$ for each trap and release event can
be used, in order to account for a inhomogeneous arrangement of the traps,
as explained below.
This is activated by the argument `spatial_adjustment` in `sit_mdt()` and 
`sit_flight_range()`.
Without spatial adjustment, the weights are set as $w_{ik} = 1$ and the MDT
reduces to the so-called Mean Dispersal Distance (MDD).

Depending on the needs and hypotheses, the calculation could also aggregate 
observations by days, populations or both.

If most traps were concentrated far away from the release points, then MDT would
be artificially inflated as a consequence.
A way of correcting for this is to weight the counts by the inverse relative 
local trap density at each trap.
However, a spatial density estimate of traps would be too noisy with only a couple dozen traps and suffer from considerable edge effects.
Thus, we assume _isotropy_ (neglect the direction) and consider
only the __radial__ density from the release site.

Trap density with respect to the distance from the release point increases
linearly with the distance for a uniform distribution of traps.

The relative density is the ratio between the observed density and the expected density under uniformity (straight line).
While the $\omega_i$ is the inverse of the ratio at the distance of the corresponding trap (Figure below).

```{r plot-inverse-density, echo = FALSE}
dispersal_dat <- attr(sit_mdt(sit_prototype), "dispersal_data")
plot(inverse_radial_density(unique(dispersal_dat$dist_m)))
```

This approach also suffers from edge effect.
However, the trick to solve it is much simpler and the results more robust.
The figure above shows the uncorrected density in grey and the adjusted 
density as a black curve with points at the observed distances.

The only issue would be for trap patterns that are strongly concentrated in specific directions at different distances.
However, this seems unlikely.

With this spatial adjustment (performed by default), we can assess the MDT by
each released population (_point releases_ only), and we can also __plot__ the
histogram of travelled distances weighted by the number of observations and the 
spatial adjustment:

```{r mdt-demo}
(mdt_prototype <- sit_mdt(sit_prototype, by = "population"))
plot(mdt_prototype)
```

The histogram __excludes__ the most extreme 2.5 % of observations on each side,
to improve visualisation.
The vertical lines display the mean distances travelled by population.
They appear shifted towards 0 due to a large number of adjusted distances near
0 that are not shown.


Contrast with the results without the spatial adjustment.
The estimated distances are shorter, due to a over-represented number of traps
at shorter distances.

```{r mdt-noadjust}
sit_mdt(sit_prototype, by = "population", spatial_adjustment = FALSE)
```

We can also consider the MDT by age (pooling across populations).

```{r mdt-age}
sit_mdt(sit_prototype, by = "age")
```

There is no obvious increase of MDT with time. It could be one particular
population which is distorting the results, but we don't have enough
observations to split the MDT by age and population.

On the other hand, estimates get noisier for larger ages since observations are
more scarce. If we restricted the analysis to the first 10 days of age, the 
pattern is more apparent.

```{r mdt-trend}
sit_mdt(sit_prototype, by = "age") %>% 
  filter(age <= 10) %>% 
  ggplot(aes(age, mdt)) +
  geom_point() +
  geom_smooth(method = "lm", formula = 'y ~ x')
```

#### Flight Range

This is an alternative measure of the dispersal capacity of sterile males based
on __quantiles__ rather than __averages__.

The method regresses the log-distances ($+1$ to avoid divergence) from the release point as a function of the cumulative (trap-density adjusted) catches.
This relationship is modelled linearly, and the expected distances at the 50 and 90% adjusted catches with respect to the total are defined as the corresponding flight ranges `FR50` and `FR90`.

The function `sit_flight_range()` uses the levels 50 and 90 by default and
allows making a basic plot of the underlying model for verification.

```{r flight-range-pool}
(fr_prototype <- sit_flight_range(sit_prototype))
```


```{r flight-range-plot}
fr_plot <- plot(fr_prototype)
```

Moreover, the `plot` method returns the underlying data invisibly, which allows
making a more customised figure. E.g.

```{r fr-custom-plot}
fr_plot %>% 
  ggplot(aes(cx, logd, colour = population)) +
  geom_point(show.legend = FALSE) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, show.legend = FALSE) +
  labs(x = "Cumulative captures", y = "Log-distance (log-m)") +
  scale_colour_manual(values = unique(fr_plot$population))
```


The function also allows defining any number of quantiles and gives results
either disaggregated by population (default) or all populations pooled together.


```{r flight-range-levels}
sit_flight_range(sit_prototype, levels = c(25, 50, 75), pool = TRUE)
```

#### Diffusion

Assuming that mosquitoes fly randomly (Brownian motion), their dispersion
follow [Fick's first
law](https://en.wikipedia.org/wiki/Fick%27s_laws_of_diffusion#Fick's_first_law)
of diffusion, which postulates that flux goes from regions of high
concentration to regions of low concentration, with a magnitude that is
proportional to the concentration gradient (spatial derivative). The
proportionality constant is the diffusivity $D$ and depend of the species'
behaviour for that location, season and weather conditions.

Under this model, [the __Mean Squared Displacement__ (MSD) of mosquitoes from
their release point at time $t$
is](https://en.wikipedia.org/wiki/Fick%27s_laws_of_diffusion#Example_solution_2:_Brownian_particle_and_Mean_squared_displacement)
\[
  \text{MSD}(t) = 4Dt.
\]

From which we can solve for the Diffusion coefficient by using the distances
from the release point at which sterile males were captured to estimate the
MSD at a given age.

The function `sit_diffusion()` computes the MSD of the individuals (by
population, by default, or possibly pooling populations) at the different
observed ages, and derives $D$ as the regression slope. The resulting object can
be plotted to see the underlying regression model.

```{r diffusion}
(diff_prototype <- sit_diffusion(sit_prototype))
```

```{r diffusion-plot}
plot(diff_prototype)
```

### 4. Survival

We assume that for a released population of sterile males there is a certain _probability of daily survival_ ($\pi_\text{PDS}$), which represents the probability that an individual (or the proportion of individuals that) makes it to the next day.

The proportion $p_j$ of sterile males still alive at day $j$ decreases geometrically, starting from some initial survival $p_0$ at day 0:

\[
  p_j = p_0 \, \pi_\text{PDS}^j.
\]

Assuming that each day $j$ we capture a fraction $c$ of the released individuals that remain alive, we have that the expected number of surveyed individuals at day $j$ is 


\[
  E[y_j] = c\, p_0 \, \pi_\text{PDS}^j.
\]

In log-scale,
\[
  \log(E[y_j]) = \alpha + \beta \cdot j,
\]
where $\alpha = \log(c\, p_0)$ and $\beta = \log(\pi_\text{PDS})$.

This leads to estimating PDS by regressing the log-number of catches (+ 1, to avoid divergences) on the number of days since release (_age_).

The methodology was taken from @kay_aedes_1998, but originally developed in @gillies_studies_1961.
There is no appropriate statistical model other than the observation the the rate of decrease in the caught individuals is approximately constant, which leads to the regression approach.

The __recapture__ $\theta$ and __survival__ $S$ rates are computed as

\begin{equation}
  \begin{aligned}
    \theta & = \exp(\alpha) / (N + \exp(\alpha)) \\
    S & = \exp(\beta) / (1 + \theta)^{1/d},
  \end{aligned}
\end{equation}
where $\alpha$ and $\beta$ are the regression coefficients of the log-catches against age, $N$ is the number of individuals released and $d$ is the number of days after release (_age_ in days). We use $d = 1$.

The function `sit_survival()` returns a table with the _Probability of Daily Survival_ (PDS), the _Average Life Expectancy_ `ALE`, _Recapture Rate_ `RR`, _Survival Rate_ `SR`), by population (by default, unless `pool = TRUE`)

```{r survival-prototype}
(surv_prototype <- sit_survival(sit_prototype))
```

```{r survival-plot}
plot(surv_prototype)
```


### 5. Density or size of the wild population

An estimation of the __density__, or equivalently, the __size__ of the wild
mosquito population is necessary for the determination of the appropriate
quantity of sterile males to be released for controlling the population
effectively.

A simple estimate is obtained as follows [@thompson_sampling_2012, Ch. 18].
Let the total captures at a day $t$ be the sum of $m_t$ marked and $n_t$ wild mosquitoes.
Assuming that the proportion of marked individuals in the sample equals that in the population of size $P$:
\[
  \frac{m_t}{m_t + n_t} = \frac{M_t}{M_t + P},
\]
where $M_t = R\,S^{a_t}$ is the number of marked individuals captured at time $t$, with $R$ the number of released adults, $S$ the daily survival rate and $a_t$ the number of days since release (_age_). 
I.e., the number of marked individuals at time $t$ is the
remaining number from those released that survived for $a_t$ days.

The __Lincoln Index__ (_a.k.a._ the __Petersen estimator__) has been used as a simple estimate of the wild __male__ population size, assuming that the survival rate of an individual remains constant.
Here we use a _modified_ version that corrects for small samples and compensates for daily survival.
\[
  P_t = R\,S^{a_t}\,(n_t + 1) / (m_t + 1).
\]

The values of $R$, $n_t$, $m_t$ and $t$ can be gathered from the __adult surveys
data__. The calculation requires the estimation of the survival rate $S$.

This provides multiple estimates of the population size, one for each age $t$ and for each released population, which can be averaged to compute a final estimate based on all the available data.

The calculation needs to be split by released population. First, because
survival rates could be different among populations, but also because the number
$n_t$ of captured wild males at a given date can be compared against the number
of captured sterile males $m_t$ at different ages. Still, the function allows
pooling data across population as if they were the same.

The function returns a point estimate and the standard deviation.

```{r wild-size-prototype}
sit_wild_size(sit_prototype)
```


## Bibliography