Title: | Evaluation of Genotyping Error in Genotype-by-Sequencing Data |
---|---|
Description: | This is a small, lightweight package that lets users investigate the distribution of genotypes in genotype-by-sequencing (GBS) data where they expect (by and large) Hardy-Weinberg equilibrium, in order to assess rates of genotyping errors and the dependence of those rates on read depth. It implements a Markov chain Monte Carlo (MCMC) sampler using 'Rcpp' to compute a Bayesian estimate of what we call the heterozygote miscall rate for restriction-associated digest (RAD) sequencing data and other types of reduced representation GBS data. It also provides functions to generate plots of expected and observed genotype frequencies. Some background on these topics can be found in a recent paper "Recent advances in conservation and population genomics data analysis" by Hendricks et al. (2018) <doi:10.1111/eva.12659>, and another paper describing the MCMC approach is in preparation with Gordon Luikart and Thierry Gosselin. |
Authors: | Eric C. Anderson [aut, cre] |
Maintainer: | Eric C. Anderson <[email protected]> |
License: | CC0 |
Version: | 0.0.2 |
Built: | 2024-11-07 03:57:48 UTC |
Source: | https://github.com/eriqande/whoa |
Under the assumption of Hardy-Weinberg equilibrium, this function uses the estimated allele frequencies from the data set in v to compute expected genotype frequencies, and then reports these along with the observed genotype frequencies. Loci come out named as CHROM–POS.
exp_and_obs_geno_freqs( v = NULL, d012 = NULL, prop_indv_required = 0.5, prop_loci_required = 0.5 )
exp_and_obs_geno_freqs( v = NULL, d012 = NULL, prop_indv_required = 0.5, prop_loci_required = 0.5 )
v |
a ‘vcfR’ object. Exactly one of |
d012 |
an integer matrix (or a numeric matrix, which will be coerced to
be of integer type) with individuals in columns, and markers in rows.
0 denotes a genotype homozygous for the reference allele, 1 is a heterozygote, 2 is a
homozygote for the alternate allele, and -1 denotes missing data. This matrix is not
required to have column (sample) names. They won't be used if they are present. But,
the matrix must have rownames, which should be in the format of CHROM–POS (i.e. the "chromosome"
name (or the "contig" name) followed by a "–" followed by the position of the marker in the "chromosome").
Exactly one of |
prop_indv_required |
loci will be dropped if a proportion of individuals less than prop_indv_required have non-missing data at that locus. Default is 0.5 |
prop_loci_required |
individual will be dropped if their proportion of non-missing loci is less than prop_loci_required. Default is 0.5. |
Returns a tibble with the following columns: snp
= the locus name
as CHROM–POS; p
= The frequency of the alternate (ALT) allele; ntot
= the total
number of individuals with no missing data at the locus; geno
= column
telling which genotype (0, 1, or 2) is referred to; p_exp
= expected
frequency of the genotype; p_obs
= observed frequency of genotype;
n_exp
= expected number of such genotypes; n_obs
= observed
number of such genotypes; z_score
= simple statistic giving how far
the observed genotype frequency is from that expected under Hardy-Weinberg
equilibrium.
eao <- exp_and_obs_geno_freqs(v = lobster_buz_2000) # if you wanted to run that on an 012 matrix, # it would be like this: eao012 <- exp_and_obs_geno_freqs(d012 = lobster_buz_2000_as_012_matrix)
eao <- exp_and_obs_geno_freqs(v = lobster_buz_2000) # if you wanted to run that on an 012 matrix, # it would be like this: eao012 <- exp_and_obs_geno_freqs(d012 = lobster_buz_2000_as_012_matrix)
return a ‘ggplot2’ plot object of observed and expected genotype freqs
geno_freqs_scatter(gfc, alpha = 0.2, max_plot_loci = 500)
geno_freqs_scatter(gfc, alpha = 0.2, max_plot_loci = 500)
gfc |
a tibble like that created by exp_and_obs_geno_freqs() |
alpha |
the transparency (alpha) parameter to apply to the points in the scatterplot. Default is 0.2. |
max_plot_loci |
By default this plots only 500 loci, sampled randomly, to keep ‘ggplot2’ taking forever to plot, for example, 100K points. If you want to plot all the points, set this to a number larger than the number of single nucleotide polymorphisms (SNPs) in the data set. |
# get the expected and observed geno freqs gfreqs <- exp_and_obs_geno_freqs(lobster_buz_2000) g <- geno_freqs_scatter(gfreqs) # now g is a 'ggplot2' object.
# get the expected and observed geno freqs gfreqs <- exp_and_obs_geno_freqs(lobster_buz_2000) g <- geno_freqs_scatter(gfreqs) # now g is a 'ggplot2' object.
This function calls internal C++ routines that perform Markov chain Monte Carlo (MCMC) to sample from the posterior distribution of the heterozygote miscall rate for each read depth category.
infer_m( v, minBin, indivs = NULL, init_m = 0.1, num_sweeps = 500, burn_in = 100 )
infer_m( v, minBin, indivs = NULL, init_m = 0.1, num_sweeps = 500, burn_in = 100 )
v |
a ‘vcfR’ object holding the information from a variant call format (VCF) file with the genotype and depth data. If you are going to be breaking the estimates down by read depth categories, the VCF file must have a DP field for every genotype. |
minBin |
minimum number of observations for each read depth bin. If you have 10K markers and 50 individuals then you have about 500,000 genotypes to play with. Requiring bins with at least 5,000 genotypes in them will give you less than 100 bins. You can play around with this number to get the right number of bins. The algorithm breaks the read depths up into bins that have at least minBin genotypes in them. |
indivs |
a character vector holding the names of the individuals from the VCF file to include in the analysis. By default this is NULL, in which case everyone from the file is included. |
init_m |
the initial value of the heterozygote miscall rate assumed for each read depth bin. By default this is 0.1. |
num_sweeps |
the number of sweeps of the MCMC algorithm to do. Default is 500, which is a little on the short side. Run multiple times and make sure the values obtained are similar across runs to assess convergence. |
burn_in |
how many sweeps from the beginning of the chain to discard when computing the posterior means and quantiles. Default is 100. Note that full traces of the visited m values for every read depth bin are returned as well, so that the behavior of the chain in those early steps can be investigated. |
The read depth bins are
determined by passing to the function minBin
—the minimum number of observations desired for each read
depth bin. The function then breaks the observations into bins so that each read depth bin
has at least minBin observations.
Note that if you want to estimate the heterozygote miscall rate overall (i.e., not conditioning each estimate on a read depth bin), then simply give a very large number (larger than the number of markers times the number of individuals) for minBin. For example, you could use a number like 1e15 for minBin. As a consequence, all the genotypes will be put into a single read depth bin.
A list with six components:
A tibble with 6 columns: bin = the index of the read depth bin; mean = the posterior mean estimate of the the heterozygote miscall rate in that bin; lo95 = the low endpoint of the 95 95 and mean_dp = the mean read depth in the bin.
A tibble with all the values visited for m for every read depth bin. This tibble has three columns: bin = the index of the read depth bin; sweep = the sweep number, value = the value of m for that read depth bin in that particular sweep.
A tibble summarizing how many genotypes of different read depths appear in each bin.
A tibble with a different summary of the read-depth bins.
Number of MCMC sweeps used.
Number of sweeps discarded as burn in.
# Shorter run than recommended (for quick example...) im <- infer_m(lobster_buz_2000, minBin = 1000, num_sweeps = 100, burn_in = 20)
# Shorter run than recommended (for quick example...) im <- infer_m(lobster_buz_2000, minBin = 1000, num_sweeps = 100, burn_in = 20)
A data set from the study by Benestan et al. (2016).
lobster_buz_2000
lobster_buz_2000
A ‘vcfR’ object in which the original data set has been reduced to just the 36 lobsters from the BUZ population and a randomly sampled 2000 SNPs from the 10,156 originally available.
This is the sort of object obtained after calling vcfR::read.vcfR()
on a variant call format (VCF) file.
https://datadryad.org/stash/dataset/doi:10.5061/dryad.q771r
A data set from the study by Benestan et al. (2016).
lobster_buz_2000_as_012_matrix
lobster_buz_2000_as_012_matrix
This is an integer matrix with positions in rows and samples in columns.
It is an 012 matrix that corresponds to lobster_buz_2000. It is useful as
an example of the necessary format for the d012 argument to exp_and_obs_geno_freqs
.
https://datadryad.org/stash/dataset/doi:10.5061/dryad.q771r
The standard way within R of pulling values out of a named vector really bogs down on large data sets. So I will do this instead.
make_it_012(M)
make_it_012(M)
M |
a character matrix of variant call format (VCF) genotypes and no dimnames. Allowable values are "0/0", and "0|0", which get coverted to integer 0; "0/1", "0|1", "1/0", and "1|0", which get converted to integer 1; and "1/1", and "1|1", which get converted to integer 2. Everything else gets converted to -1 to denote missing data. |
An integer matrix of values which are 0, 1, 2, or -1.
# get an 012 matrix from the lobster data tmp <- t(vcfR::extract.gt(lobster_buz_2000, element = "GT")) dimnames(tmp) <- NULL g <- make_it_012(tmp)
# get an 012 matrix from the lobster data tmp <- t(vcfR::extract.gt(lobster_buz_2000, element = "GT")) dimnames(tmp) <- NULL g <- make_it_012(tmp)
This just returns a ‘ggplot2’ plot object that plots the read depth bins on the x-axis and the posterior mean m estimates (and credible intervals) on the y-axis, and depicts the number of genotypes in each read depth bin using color.
posteriors_plot(P)
posteriors_plot(P)
P |
the tibble that is the m_posteriors component of |
a ‘ggplot2’ plot object.
# get something to plot (short run for example) im <- infer_m(lobster_buz_2000, minBin = 1000, num_sweeps = 100, burn_in = 20) # then plot it g <- posteriors_plot(im$m_posteriors) # now g is a 'ggplot2' object
# get something to plot (short run for example) im <- infer_m(lobster_buz_2000, minBin = 1000, num_sweeps = 100, burn_in = 20) # then plot it g <- posteriors_plot(im$m_posteriors) # now g is a 'ggplot2' object
The name is found in the capitals here: W
here's my H
eterozygotes?! O
bservations
of genotyping A
ccuracy.
whoa
main high-level functionsThe main function in the package whoa is infer_m
.
This function infers the heterozygote miscall rate (the rate at
which true heterozygotes have been miscalled as homozygotes in
genotype-by-sequencing data) for calls made upon genotypes falling
within different read depth categories.
The output from infer_m
is easily plotted by passing the
m_posteriors component of the output list from infer_m
into
posteriors_plot
.
The package comes with a small data set, lobster_buz_2000
, which
was read in from a variant call format (VCF) file and is now stored in the package as a ‘vcfR’ object.