---
title: "Using the Fully Bayesian Model in rubias"
author: "Ben Moran"
date: "6/3/2019"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Using the Fully Bayesian Model in rubias}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  fig.width = 7,
  fig.height = 5
)
```

The default model in `rubias` is a conditional model in which inference is done with the baseline
(reference) allele counts fixed.  The package now also includes a fully Bayesian GSI model, as is
implemented in the software packages `BAYES` and `cBayes`, among others. This
model differs from the conditional model in two main respects: 

 - Individuals within the mixture that are allocated (on any particular step of the MCMC) to one of
 the reference samples have their genotypes added to that reference sample's allele counts. 
 - These updated allele counts are used as Dirichlet parameters to explicitly draw an estimate of
 baseline allele frequecies, from which genotype likelihoods are calculated (in the conditional
 model, genotype likelihoods are pulled straight from the Dirichlet parameters using a Compound
 Dirichlet-Multinomial distribution).

Ideally, updating the reference allele counts with mixture genotypes will refine the estimate of
allele frequencies in that sample. This is especially useful in cases where the reference dataset 
is small relative to the number of mixture individuals, as well as those in which the mixture 
contains populations not present in the baseline. However, the fully Bayesian model is much more 
computationally intensive, and can exhibit pathological behavior when reference collections are 
not well-differentiated.

## Running the Model with Baseline Resampling

Load the required packages first:
```{r, eval=FALSE}
library(rubias)
library(tidyverse)
```

```{r, echo=FALSE}
# this is what we actually evaluate.
library(rubias)

# all the following libraries can be loaded with "library(tidyverse)"
# but then you have to put tidyverse in the Suggests because this is
# in the vignette, and that is bad practice, so, load the packages separately...
library(tibble)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
```

The basic way to invoke the fully Bayesian model is to use the `infer_mixture` function with
the `method` option set to "BR" (for "baseline resampling"). **NOTE:** We use very few reps here
to not take too much time when checking this package for CRAN.  You should use more...
```{r test_br}
full_model <- infer_mixture(
  reference = small_chinook_ref, 
  mixture = small_chinook_mix, 
  gen_start_col = 5, 
  method = "BR",
  reps = 100,
  burn_in = 35
  )
```

Note that the output generated by "BR" is still a list of four tidy data frames, but the 
`bootstrapped_proportions` data frame of methods "MCMC" and "PB" is replaced with an 
`allele_frequencies` data frame. This data frame contains the posterior mean allele frequencies 
(theta) for each population, updated based on the allocations of mixture individuals throughout 
MCMC. The first three columns specify the mixture collection, locus and allele in question, 
and all subsequent columns report the frequency for a particular population. 

```{r view_theta}
full_model$allele_frequencies
```

Also note that the log-likelihoods and Z-scores included in the `indiv_posteriors` output data 
frame are calculated using only the reference allele counts prior to MCMC, and so do **not** 
utilize the full model results.

Let's compare the results of this to those from the conditional model:
```{r compare}
set.seed(15)
cond_model <- infer_mixture(
  reference = small_chinook_ref, 
  mixture = small_chinook_mix, 
  gen_start_col = 5, 
  method = "MCMC"
  )

comppi <- cond_model$mixing_proportions %>%
  mutate(cond_pi = pi, full_pi = full_model$mixing_proportions$pi)

ggplot(comppi, aes(x = cond_pi, y = full_pi, colour = collection)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  facet_wrap(~mixture_collection) +
  theme(legend.position = "bottom")
  

```

The two methods are largely in agreement for this small test set.

### Initializing Mixture Proportions with the Conditional Model

When collections are poorly resolved, the fully Bayesian model may show pathological behaviors 
such as allocating all individuals to only one of the closely related populations. One way to 
reduce this behavior is to initialize the fully Bayesian model with the output from the conditional
model. `rubias` explicitly supports this through the options `prelim_reps` and `prelim_burn_in`: if
changed from the `NULL` default, these parameters specify conditional MCMC cycles to perform, the
posterior mean mixing proportions of which are used as the initial mixing proportion estimates in
the fully Bayesian model. For example:

```{r full_prelim}

set.seed(15)
prelim_full_model <- infer_mixture(
  reference = small_chinook_ref, 
  mixture = small_chinook_mix, 
  gen_start_col = 5, 
  method = "BR",
  reps = 100,
  burn_in = 35,
  prelim_reps = 100,
  prelim_burn_in = 50,
  )

prelimpi <-  prelim_full_model$mix_prop_traces %>%
  filter(sweep == 0) %>%
  select(-sweep) %>%
  mutate(prelim_pi = pi, full_pi = prelim_full_model$mixing_proportions$pi)

ggplot(prelimpi, aes(x = prelim_pi, y = full_pi, colour = collection)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  facet_wrap(~mixture_collection) +
  theme(legend.position = "bottom")

```

In this case, the results are largely identical to those from uniform initial proportions, 
potentially because the collections are well-resolved to begin with.

## Managing Parallelization

To incorporate baseline updates while keeping runtimes reasonable, genotype likelihood calculations
in the fully Bayesian model are parallelized using `RcppParallel`. By default, `RcppParallel`
runs one thread per core on your machine; to check the number of cores available, use the
`detectCores()` function from package `parallel`. However, this default behavior is dangerous on
HPC systems, where a mismatch between the number of cores requested by the user and the number of
threads demanded by `RcppParallel` might cause the job to abort. In these cases, the number of
threads should be manually set with `RcppParallel::setThreadOptions`. For example, to set the
number of threads used to 1:
```{r setcores}
RcppParallel::setThreadOptions(numThreads = 1)
```
Then you would run it just like before:

```r
full_model <- infer_mixture(
  reference = small_chinook_ref, 
  mixture = small_chinook_mix, 
  gen_start_col = 5, 
  method = "BR"
  )
```

This will slow things down relatively to using multiple cores.

To reset the threading options to utilize all 
available cores, just use:

```{r resetcores}
RcppParallel::setThreadOptions(numThreads = RcppParallel::defaultNumThreads())
```