Title: | Simulating Admixed Genotypes Without Replacement |
---|---|
Description: | A genomic simulation approach for creating biologically informed individual genotypes from empirical data that 1) samples alleles from populations without replacement, 2) segregates alleles based on species-specific recombination rates. 'gscramble' is a flexible simulation approach that allows users to create pedigrees of varying complexity in order to simulate admixed genotypes. Furthermore, it allows users to track haplotype blocks from the source populations through the pedigrees. |
Authors: | Eric C. Anderson [aut, cre] , Rachael M. Giglio [aut] , Matt G. DeSaix [aut] , Timothy J. Smyser [aut] |
Maintainer: | Eric C. Anderson <[email protected]> |
License: | CC0 |
Version: | 1.0.1.9000 |
Built: | 2024-11-18 03:09:28 UTC |
Source: | https://github.com/eriqande/gscramble |
For 'gscramble' to work properly, all variant positions on a chromosome (found in the meta data file) must be equal to or less than the total chromosome length found in the recombination map. In other words, the variant positions must be within the the total length of each chromosome. The check_chrom_lengths() function checks that variant positions on each chromosome do not exceed the total chromosome length from the recombination map. Input files for the function include 1) the meta data file which contains 3 columns (the chromosome ID as a character, the position of the variant as a numeric, and the name of the variant as a character) and 2) the recombination map which contains 5 columns (the chromosome ID as a character, the total length of the chromosome as a numeric, the starting position of the recombination bin as a numeric, the last position of the recombination bin as a numeric, and the recombination probability for the given bin as a numeric).
check_chrom_lengths(meta, rec)
check_chrom_lengths(meta, rec)
meta |
a tibble with meta data for the genotype data. It must consist of the columns (chrom, pos, variant_id). |
rec |
a tibble with recombination map information for your study species.It must contain 5 columns (chrom, chrom_len, start_pos, end_pos, rec_prob). |
This function will return a message for each chromosome stating whether the chromosome lengths are accurate
# The example uses the built in datasets M_meta and RecRates # to use the check_chrom_lengths() function check_chrom_lengths(M_meta,RecRates)
# The example uses the built in datasets M_meta and RecRates # to use the check_chrom_lengths() function check_chrom_lengths(M_meta,RecRates)
These conditions can be checked for a GSP with no inbreeding loops
simply by ensuring that the amount of genetic material coming into
each individual is the same as the amount going out (either as
segregated gametes or as samples). If the amount of material coming
out of any individual in the GSP is greater than the amount coming
in, then an error is thrown. If the amount coming out is less than
the amount coming in, then a warning about the GSP is thrown.
Messages printed via message()
and warning()
indicate which
individuals in the GSP are problematic. All problematic individuals
are listed before an error is thrown with stop()
.
check_gsp_for_validity_and_saturation(GP)
check_gsp_for_validity_and_saturation(GP)
GP |
A gsp in list format as produced by the function
|
This function does not return anything.
# get the 13 member pedigree in tibble form as the package # and turn it into a list GP <- prep_gsp_for_hap_dropping(GSP) # check it. (This passes) check_gsp_for_validity_and_saturation(GP) ## The following will show a failure, so we wrap it in tryCatch ## so CRAN check does not flag it as a problem. # Read in a gsp with errors and then make sure all the # error in it are caught bad <- readr::read_csv(system.file("extdata/13-member-ped-with-errors.csv", package = "gscramble")) # check_gsp_for_validity_and_saturation() is called internally from # within prep_gsp_for_hap_dropping(), after creating a list-from GSP. # This will show the error produced by check_gsp_for_validity_and_saturation(). badL <- tryCatch( prep_gsp_for_hap_dropping(bad), error = function(x) 0, warning = function(x) 0 )
# get the 13 member pedigree in tibble form as the package # and turn it into a list GP <- prep_gsp_for_hap_dropping(GSP) # check it. (This passes) check_gsp_for_validity_and_saturation(GP) ## The following will show a failure, so we wrap it in tryCatch ## so CRAN check does not flag it as a problem. # Read in a gsp with errors and then make sure all the # error in it are caught bad <- readr::read_csv(system.file("extdata/13-member-ped-with-errors.csv", package = "gscramble")) # check_gsp_for_validity_and_saturation() is called internally from # within prep_gsp_for_hap_dropping(), after creating a list-from GSP. # This will show the error produced by check_gsp_for_validity_and_saturation(). badL <- tryCatch( prep_gsp_for_hap_dropping(bad), error = function(x) 0, warning = function(x) 0 )
After a GSP has been passed through prep_gsp_for_hap_dropping()
it
is in a list format with the individuals ordered in such a way that
it should be easy to check for any inbreeding loops in it (which
are not allowed!). This version uses a simple recursive
approach to compute the ancestry vector for each individual, and it
detects inbreeding by the occurrence of the same ID in the ancestry
vector more than once. This might be slow on large pedigrees, but for
most that people would use, this should be fine.
check_pedigree_for_inbreeding(GP)
check_pedigree_for_inbreeding(GP)
GP |
A gsp in list format as produced by the function
|
Note that the ancestry vector produced by this is not ordered the way the ancestry vectors are in my package CKMRpop—for simplicity I just get a list of ancestors in whatever order they happen to be reached.
This function does not return anything. It throws an error via stop()
if
inbreeeding loops are found in the pedigree. Before throwing that error it lists
the individuals with repeated occurrences in their ancestry vectors via the
message()
function.
# get the 13 member pedigree in the data object GSP and # turn it into a list GP <- prep_gsp_for_hap_dropping(GSP) # check it for inbreeding. (There is none) check_pedigree_for_inbreeding(GP) ## This one will fail, so we wrap it in tryCatch so CRAN ## check doesn't find it a problem. # To see what happens if there are inbreeding loops, make some GP_inbred <- GP # make 12 be inbred trough individual 6 GP_inbred$`12`$par1$par = "13" # make 8 inbred (because both of its founder parents are the same!) GP_inbred$`8`$par2$par = "4" # now try that: tryCatch( check_pedigree_for_inbreeding(GP_inbred), error = function(x) 0, warning = function(x) 0 )
# get the 13 member pedigree in the data object GSP and # turn it into a list GP <- prep_gsp_for_hap_dropping(GSP) # check it for inbreeding. (There is none) check_pedigree_for_inbreeding(GP) ## This one will fail, so we wrap it in tryCatch so CRAN ## check doesn't find it a problem. # To see what happens if there are inbreeding loops, make some GP_inbred <- GP # make 12 be inbred trough individual 6 GP_inbred$`12`$par1$par = "13" # make 8 inbred (because both of its founder parents are the same!) GP_inbred$`8`$par2$par = "4" # now try that: tryCatch( check_pedigree_for_inbreeding(GP_inbred), error = function(x) 0, warning = function(x) 0 )
This operates on the output of segregate to add up the lengths of all the segments segregated to different individuals to thus compute the admixture fractions of each sampled individual.
computeQs_from_segments(S, check_total_length = TRUE)
computeQs_from_segments(S, check_total_length = TRUE)
S |
the tibble output from |
check_total_length |
TRUE means it checks the total genome length in each individual to make sure it checks out. |
This function returns a tibble with the following columns:
gpp
: the genomic simulation pedigree within which the individual sample was
simulated.
index
: the index which gives which instance of the GSP the sample is from
ped_sample_id
: the id number of that the sampled individual had in the
genomic simulation pedigree.
samp_index
: the index of the sample taken. Some individuals in some
genomic simulation pedigrees can produce more than one sample. This number
tells you which sample it is.
pop_origin
: the "pedigree" population of origin of the segments that contributed
to the group_length
. These are
the simple "A", "B", "C", etc. designations given in the genomic simulation
pedigree.
group_origin
: Which group of samples the segments contributing to
the group_length
originated from. These are the groups of samples
that were mapped onto the simple pedigree pop_origin
s by the reppop
request.
group_length
: the total length of segments from this group in this individual
in this reppop index from this gpp (in bases).
tot_length
: the total number of bases from all origins carried by this
individual.
admixture_fraction
: the fraction of all bases in the simulated individual
that originate from the group in group_origin
.
#### Get output from segregate to use as input #### # We construct an example here where we will request segregation # down a GSP with two F1s and F1B backcrosses between two hypothetical # populations, A and B. gsp_f1f1b <- create_GSP("A", "B", F1 = TRUE, F1B = TRUE) # We will imagine that in our marker data there are three groups # labelled "grp1", "grp2", and "grp3", and we want to create the F1Bs with backcrossing # only to grp3. reppop <- tibble::tibble( index = as.integer(c(1, 1, 2, 2)), pop = c("A", "B", "A", "B"), group = c("grp3", "grp1", "grp3", "grp2") ) # combine those into a request request <- tibble::tibble( gpp = list(gsp_f1f1b), reppop = list(reppop) ) # now run it through segregate() set.seed(5) # just for reproducibility in example... simSegs <- segregate(request, RecRates) #### Now we can run those through computeQs_from_segments() #### Qs <- computeQs_from_segments(simSegs) Qs
#### Get output from segregate to use as input #### # We construct an example here where we will request segregation # down a GSP with two F1s and F1B backcrosses between two hypothetical # populations, A and B. gsp_f1f1b <- create_GSP("A", "B", F1 = TRUE, F1B = TRUE) # We will imagine that in our marker data there are three groups # labelled "grp1", "grp2", and "grp3", and we want to create the F1Bs with backcrossing # only to grp3. reppop <- tibble::tibble( index = as.integer(c(1, 1, 2, 2)), pop = c("A", "B", "A", "B"), group = c("grp3", "grp1", "grp3", "grp2") ) # combine those into a request request <- tibble::tibble( gpp = list(gsp_f1f1b), reppop = list(reppop) ) # now run it through segregate() set.seed(5) # just for reproducibility in example... simSegs <- segregate(request, RecRates) #### Now we can run those through computeQs_from_segments() #### Qs <- computeQs_from_segments(simSegs) Qs
This function allows the user to choose two populations and create a GSP input for 'gscramble'.
The input requires two different population IDs of class character as well as at least one TRUE statement for one of
the hybrid parameters (F1, F2, F1B, F1B2). The GSP will indicate hybrid individuals that will be sampled based on
which F1, F2, F1B, and F1B2 parameters are TRUE. F1= TRUE means F1 hybrids will be sampled. F2=TRUE means F2 hybrids
will be sampled. F1B2=TRUE means F1 backcross hybrids will be sampled. F1B2=TRUE means F1 backcross 2 hybrids will be sampled.
Default setting for these parameters is FALSE. The function then outputs a GSP in tibble format that can be used for other
functions in 'gscramble' including check_pedigree_for_inbreeding()
and prep_gsp_for_hap_dropping()
.
create_GSP( pop1, pop2, F1 = FALSE, F2 = FALSE, F1B = FALSE, F1B2 = FALSE, AllowSinglePop = FALSE )
create_GSP( pop1, pop2, F1 = FALSE, F2 = FALSE, F1B = FALSE, F1B2 = FALSE, AllowSinglePop = FALSE )
pop1 |
character name for population 1 |
pop2 |
character name for population 2 |
F1 |
logical indicating whether you would like to have sampled F1 hybrids in the output. |
F2 |
logical indicating whether you would like to have sampled F2 hybrids in the output. |
F1B |
logical indicating whether you would like to have sampled F1 backcross hybrids in the output. |
F1B2 |
logical indicating whether you would like to have sampled F1 backcross 2 hybrids in the output. |
AllowSinglePop |
Set to true if you want all the founders from just a single population. This has some use cases... |
This function returns a GSP in tibble format with the user argument for pop1 and pop2 autopopulated in the hpop1 and hpop2 columns.
# create a GSP that generates hybrids and samples F1s and F1B's from pops A and B gsp <- create_GSP("A", "B", F1 = TRUE, F1B = TRUE) #create a GSP that generates hybrids and samples F1s, F2s, F1Bs, F1B2s from pops A and B gsp <- create_GSP("Pop_X", "Pop_Z", F1 = TRUE, F2 = TRUE, F1B = TRUE, F1B2 = TRUE)
# create a GSP that generates hybrids and samples F1s and F1B's from pops A and B gsp <- create_GSP("A", "B", F1 = TRUE, F1B = TRUE) #create a GSP that generates hybrids and samples F1s, F2s, F1Bs, F1B2s from pops A and B gsp <- create_GSP("Pop_X", "Pop_Z", F1 = TRUE, F2 = TRUE, F1B = TRUE, F1B2 = TRUE)
This one asks for the number of reps to do, and it also automatically does it over chromosomes and returns the results in a nice tidy tibble.
drop_segs_down_gsp(GSP, RR, Reps)
drop_segs_down_gsp(GSP, RR, Reps)
GSP |
the pedigree to use for the simulation, in the format of
the package data |
RR |
the recombination rates in the format of the package data
|
Reps |
the number of times to do the simulation. Different replicates are denoted by the index column in the output tibble. |
The output from this function is a tibble. Each row represents one segment of genetic
material amongst the sampled individuals from the genomic permutation pedigrees. The columns give
information about the provenance and destination of that segment as follows.
Each segment exists in one of the samples (samp_index
) from a sampled individual
with a ped_sample_id
in a given samp_index
within the individual. Further, it is on one of two gametes
(gamete_index
) that segregated into the individual, and it came from a certain founding
population (pop_origin
).
And, of course, the segment occupies the space from start
to end
on a chromosome chrom
.
Finally, the index of the founder haplotype on the given gpp that this segement descended from is
given in rs_founder_haplotype
which is short for "rep-specific founder haplotype". This final
piece of information is crucial for segregating variation from the individuals in the Geno
file
onto these segments. The gamete_segments
column is a list column with duplicated entries for
each chromosome in an individual.
simSegs <- drop_segs_down_gsp(GSP, RecRates, 4)
simSegs <- drop_segs_down_gsp(GSP, RecRates, 4)
These are for the example of how to use plink_map2rec_rates()
.
A tibble with two columns: chrom
and bp
.
These lengths were taken from the maximal values in Tortereau, Flavie, et al. "A high density recombination map of the pig reveals a correlation between sex-specific recombination and GC content." BMC genomics 13.1 (2012): 1-12. It is available for download from https://static-content.springer.com/esm/art%3A10.1186%2F1471-2164-13-586/MediaObjects/12864_2012_4363_MOESM1_ESM.txt. After downloading the data were processed to remove inconsistencies with the marker data set used for M_meta and Geno.
This is the Segments object from the 'gscramble' tutorial.
It is included as a data object to use in the example for
the function plot_simulated_chromosome_segments()
.
A tibble like that produced by the segregate()
function.
Made from package functions
This dataset represents 3 distinct populations of feral swine in the United States.
To make this dataset computationally efficient, only 3 chromosomes (12, 17, 18) from the pig genome were used.
Further, loci were reduced to the 100 most informative loci for distinguishing the 3 populations.
Each row of the genotype matrix includes the genotypes for a single individual.
The genotype matrix is in a standard "two-column" format for a diploid species
where two adjacent columns make up a locus and each column of a locus contains an allele.
Genotype data is stored in the character class.
Missing data is represented by NAs.
Individual IDs and population assignment information can be found in I_meta
Locus and chromosome information for genotypes can be found in M_meta
A character matrix. Each row represents an individual and columns contain the genotypes for individuals in a "two-column" format where two adjacent columns make up a locus with an allele in each column.
USDA-APHIS-WS-National Wildlife Research Center Feral Swine Genetic Archive
This function turns character-based alleles into integers
and writes the necessary headers, etc. It preferentially uses
the "id" column if it exists in M$ret_ids
. Otherwise it uses
the indiv
column for the sample names.
gscramble2newhybrids( M, M_meta, z = NULL, s = NULL, retain = NULL, outfile = tempfile() )
gscramble2newhybrids( M, M_meta, z = NULL, s = NULL, retain = NULL, outfile = tempfile() )
M |
the output from |
M_meta |
the Marker meta data file. |
z |
A vector of length two. The values
are regular expressions that the sample names that you want to have
-z 0 or -z 1 should match. For example |
s |
a single regular expressions that matches individuals that should be given the -s option. For example "SH|CCT" |
retain |
a vector of loci to retain. |
outfile |
path to the file to write the newhybrids data set to. For CRAN compliance, this is, by default, a temp file. But you can change it to be anything valid. |
It allows you to set the -s and -z through some regular expression mapping.
This function relies a lot on some tidyverse functions for pivoting, etc. As such, it is not intended for data sets with tens of thousands of markers. You oughtn't be using NewHybrids with so many markers, anyway!
A list with three components:
outfile
: outfile name of saved data.
genos
: Genotypes
allele_names
: Allele names.
# get output from segments2markers(): example("segments2markers") # copy that result to a new variable M <- s2m_result # then run it gscramble2newhybrids(M, M_meta)
# get output from segments2markers(): example("segments2markers") # copy that result to a new variable M <- s2m_result # then run it gscramble2newhybrids(M, M_meta)
Writes genetic and individual information in 'gscramble's
I_meta, M_meta, and Geno like objects into a uncompressed plink
.ped
and .map
files.
gscramble2plink(I_meta, M_meta, Geno, prefix = tempfile())
gscramble2plink(I_meta, M_meta, Geno, prefix = tempfile())
I_meta |
a tibble of individual meta data
with at least the columns of |
M_meta |
a tibble of marker meta data with at least the columns
of |
Geno |
a character matrix of genotypes. Num-indivs rows and num-markers * 2 columns, with missing denoted denoted by NA. |
prefix |
the file path and prefix into which to write out the files |
Returns TRUE if successful.
gscramble2plink(I_meta, M_meta, Geno)
gscramble2plink(I_meta, M_meta, Geno)
A GSP is a pedigree with no inbreeding which specifies
This is a relatively complex GSP. The tibble GSP
specifies its
structure using the following columns:
ind
: the numeric identifier for the individual specific to the row
(we will call that the "focal individual").
par1
: numeric identifier of the first parent of the focal individual.
Must be NA for pedigree founders.
par2
: numeric identifier of the second parent of the focal individual.
Must be NA for pedigree founders.
ipar1
: the number of gametes that will be incoming from the first parent
to the focal individual. Must be NA for pedigree founders. Note that this
must be equal to ipar2
.
ipar2
: the number of gametes that will be incoming from the second parent
to the focal individual. Must be NA for pedigree founders. Note that this
must be equal to ipar1
.
hap1
: character name for the first haplotype of a founder (if the focal
individual is a founder). Must be NA for pedigree non-founders.
hap2
: character name for the second haplotype of a founder (if the focal
individual is a founder). Must be NA for pedigree non-founders.
hpop1
: character ID of the population from which haplotype 1 comes from.
Must be NA for pedigree non-founders.
hpop2
: character ID of the population from which haplotype 2 comes from.
Must be NA for pedigree non-founders.
sample
: character ID for the sample from the focal individual. NA if no samples
are taken from the focal individual, and must be NA for any pedigree founders.
osample
: numeric value giving the number or samples that are taken from this
individual. osample
must be less than or equal to ipar1
and ipar2
. If
osample
is less than ipar1
and ipar2
, then some gametes must get passed
on to descendants of the focal individual.
A tibble
The CSV version of this is in extdata/13-member-ped.csv
.
Created by the developers.
createGSP()
This is a list that is used by the function createGSP()
. There are 15
different genomic simulation pedigrees, specified as tibbles, in this
list.
A list of 15 tibbles
Written by package authors Rachael and Tim
This takes the tibble representation of a GSP and writes it to a dot file to be rendered into a graph using the dot command from the GraphViz package. You can easily get GraphViz using Miniconda or check out the GraphViz downloads page. If you have the dot executable in your PATH, then dot will be run on the dot file and an SVG and a PNG image of the graph.
gsp2dot( g, path = file.path(tempfile(), "file_prefix"), edge_label_font_size = 18, indiv_node_label_font_size = 18, sample_node_label_font_size = 18, haplo_origin_colors = c("lightblue", "orange", "blue", "green", "cadetblue", "dodgerblue3", "darkolivegreen1", "forestgreen", "lightpink", "red2", "sandybrown", "orangered", "plum3", "purple4", "palegoldenrod", "peru"), sam_node_color = "violet", sample_edge_label_color = "purple", parent_edge_label_color = "red" )
gsp2dot( g, path = file.path(tempfile(), "file_prefix"), edge_label_font_size = 18, indiv_node_label_font_size = 18, sample_node_label_font_size = 18, haplo_origin_colors = c("lightblue", "orange", "blue", "green", "cadetblue", "dodgerblue3", "darkolivegreen1", "forestgreen", "lightpink", "red2", "sandybrown", "orangered", "plum3", "purple4", "palegoldenrod", "peru"), sam_node_color = "violet", sample_edge_label_color = "purple", parent_edge_label_color = "red" )
g |
a GSP tibble. |
path |
the path to the file prefix to use (to this will be appended
.dot, and .png or .svg, if dot is on your system). By default these
paths are in a temporary directory, because packages are not allowed to
write to user home directories by default. Supply a path with prefix,
like |
edge_label_font_size |
The font size of the number annotations along the edges. |
indiv_node_label_font_size |
the font size of the labels for the individual nodes |
sample_node_label_font_size |
the font size of the labels for the individual nodes |
haplo_origin_colors |
The colors for different origins of haplotypes. By default there are only sixteen. If you have more populations that founders may come from, you should provide a vector with more than 16 colors. |
sam_node_color |
The color given to the sample nodes in the GSP. |
sample_edge_label_color |
Color for the numeric annotations along the edges leading to samples. |
parent_edge_label_color |
Color for the numeric annotations along the edges leading from parents to offspring. |
It can be tricky knowing whether or not R or Rstudio will read
your Unix rc files and populate your paths appropriately. If you want to
test whether dot
in on your PATH as it is when running under R, try:
Sys.which("dot")
at your R console. If that returns an empty string,
(""
), then you need to do something else to make sure R can find dot
on your system.
A vector of file paths. The first is the path of the dot file that was produced. The second and third are only present if dot was found in the PATH. They are the paths of the png and svg files that were produced.
gsp_file <- system.file("extdata/13-member-ped.csv", package = "gscramble") g <- readr::read_csv(gsp_file) paths <- gsp2dot(g) paths
gsp_file <- system.file("extdata/13-member-ped.csv", package = "gscramble") g <- readr::read_csv(gsp_file) paths <- gsp2dot(g) paths
This has 3 founders
A tibble
For details on the columns, see the documentation for GSP
.
Created by the developers.
This has 4 founders, each one from a different population, and it provides four samples that are the product of F1 matings (an A x B F1 crossing with a C x D F1.)
A tibble
For details on the columns, see the documentation for GSP
.
Created by the developers.
The tibble 'I_meta' contains the Individuals IDs and group specifications for 78 individuals. Each row of 'I_meta'
contains the metadata for an individual. 'I_meta' has two columns, a 'group' column with the group or population assignment
for a given individual and an 'indiv' column with the individual sample IDs. The information in each column are stored as characters.
The order of the individual IDs corresponds to the genotypes found in Geno
.
A tibble with two columns: group and indiv.
USDA-APHIS-WS-National Wildlife Research Center Feral Swine Genetic Archive
This data set contains the metadata for the 100 most divergent SNP loci for three feral swine populations
sampled in the United States. To make the dataset more computationally efficient, only 3 chromosomes were
used (12,17, and 18). The metadata for the SNP loci is in a tibble with three columns:
chrom (character), pos (numeric), variant_id (character).
The column 'chrom' contains the chromosome ID where the SNP is located, the column 'pos' gives the base pair
location on the chromosome, and the column 'variant_id' gives the name of the SNP.
Each row of the metadata tibble contains the metadata for a given SNP locus.
The individual genotypes for each of these SNP loci can be found in Geno
.
A tibble with three columns: chrom, pos, and variant_id.
Chromosome, position, and variant IDs are from the Sus scrofa 10.2 genome https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.5/.
This function assumes that M is a matrix with L rows (number of markers) and
2 * N (N = number of individuals) columns.
There are two ways that the data might be permuted. In the first,
obtained with preserve_haplotypes = FALSE
,
the position of missing data within the matrix is held constant, but all
non-missing sites within a row (i.e. all gene copies at a locus) get
scrambled amongst the samples. In the second way, just the columns are
permuted. This preserves haplotypes in the data, if there are any.
The second approach should only be used if haplotypes are inferred in
the individuals.
mat_scramble( M, preserve_haplotypes = FALSE, row_groups = NULL, preserve_individuals = FALSE )
mat_scramble( M, preserve_haplotypes = FALSE, row_groups = NULL, preserve_individuals = FALSE )
M |
a matrix with L rows (number of markers) and 2 * N columns where N is the number of individuals. Missing data must be coded as NA |
preserve_haplotypes |
logical indicating whether the haplotypes set to be TRUE |
row_groups |
if not NULL must be a list of indexes of adjacent rows
that are all in the same groups. For example: |
preserve_individuals |
logical indicating whether the genes within each individual should stay togeter. |
There is now an additional way of permuting: if
preserve_individuals = TRUE
, then entire individuals are permuted.
If preserve_haplotypes = FALSE
, then the gene copies at each locus
are randomly ordered within each individual before permuating them.
If preserve_haplotypes = TRUE
then that initial permutation is not
done. This should only be done if the individuals are phased and that
phasing is represented in how the genotypes are stored in the matrix.
This function returns a matrix of the same dimensions and storage.mode
as the input, M
; however the elements have been permuted according to the
options specified by the users.
# make a matrix with alleles named as I.M.g, where I is individual # number, M is marker number, and g is either "a" or "b" depending # on which gene copy in the diploid it is. 4 indivs and 7 markers... Mat <- matrix( paste( rep(1:4, each = 7 * 2), rep(1:7, 4 * 2), rep(c("a", "b"), each = 7), sep = "." ), nrow = 7 ) # without preserving haplotypes S1 <- mat_scramble(Mat) # preserving haplotypes with markers 1-7 all on one chromosome S2 <- mat_scramble(Mat, preserve_haplotypes = TRUE) # preserving haplotypes with markers 1-3 on one chromosome and 4-7 on another S3 <- mat_scramble(Mat, row_groups = list(1:3, 4:7)) # preserving individuals, but not haplotypes, with two chromosomes S4 <- mat_scramble(Mat, row_groups = list(1:3, 4:7), preserve_individuals = TRUE) # preserving individuals by chromosome, but not haplotypes, with two chromosomes S5 <- mat_scramble(Mat, row_groups = list(1:3, 4:7), preserve_individuals = "BY_CHROM") # preserving individuals by chromosome, and preserving haplotypes, with two chromosomes S6 <- mat_scramble(Mat, row_groups = list(1:3, 4:7), preserve_individuals = "BY_CHROM", preserve_haplotypes = TRUE)
# make a matrix with alleles named as I.M.g, where I is individual # number, M is marker number, and g is either "a" or "b" depending # on which gene copy in the diploid it is. 4 indivs and 7 markers... Mat <- matrix( paste( rep(1:4, each = 7 * 2), rep(1:7, 4 * 2), rep(c("a", "b"), each = 7), sep = "." ), nrow = 7 ) # without preserving haplotypes S1 <- mat_scramble(Mat) # preserving haplotypes with markers 1-7 all on one chromosome S2 <- mat_scramble(Mat, preserve_haplotypes = TRUE) # preserving haplotypes with markers 1-3 on one chromosome and 4-7 on another S3 <- mat_scramble(Mat, row_groups = list(1:3, 4:7)) # preserving individuals, but not haplotypes, with two chromosomes S4 <- mat_scramble(Mat, row_groups = list(1:3, 4:7), preserve_individuals = TRUE) # preserving individuals by chromosome, but not haplotypes, with two chromosomes S5 <- mat_scramble(Mat, row_groups = list(1:3, 4:7), preserve_individuals = "BY_CHROM") # preserving individuals by chromosome, and preserving haplotypes, with two chromosomes S6 <- mat_scramble(Mat, row_groups = list(1:3, 4:7), preserve_individuals = "BY_CHROM", preserve_haplotypes = TRUE)
This is done prior to assigning random genomic fragments of individuals in the sample to the founders of the GSP, to be dropped to the samples.
perm_gs_by_pops(GS, preserve_haplotypes = FALSE, preserve_individuals = FALSE)
perm_gs_by_pops(GS, preserve_haplotypes = FALSE, preserve_individuals = FALSE)
GS |
the tibble that is the output from rearrange_genos |
preserve_haplotypes |
If TRUE then the Geno data is assumed phased (first allele at an individual on one haplotype and second allele on the other) and those haplotypes are preserved in this permutation of genomic material amongst the founders. |
preserve_individuals |
If TRUE then whole individuals are permuted
around the data set and the two gene copies at each locus are randomly
permuted within each individual. If |
Returns a list of the same format as the output of rearrange_genos
.
Plus one additional component. Each component of the return list is itself
an unnamed list with one component (makes it easier to use bind_rows
to
create a tibble of list columns from these). The components, once unlisted are:
G
: a matrix—the original genotype data matrix
I
: the I_meta tibble
M
: the M_meta tibble
G_permed
: the genotype matrix after permutation.
# first get the output of rearrange_genos RG <- rearrange_genos(Geno, I_meta, M_meta) # then permute by the populations PG <- perm_gs_by_pops(RG)
# first get the output of rearrange_genos RG <- rearrange_genos(Geno, I_meta, M_meta) # then permute by the populations PG <- perm_gs_by_pops(RG)
This is a convenience function to convert PLINK map format to
the format used in the 'gscramble' RecRates
object. By default,
this function will use the positions of the markers and assume
recombination rates of 1 cM per megabase. If the marker positions
are also available in Morgans in the PLINK map file, the these can be
used by setting use_morgans
to TRUE.
plink_map2rec_rates( map, use_morgans = FALSE, cM_per_Mb = 1, chrom_lengths = NULL )
plink_map2rec_rates( map, use_morgans = FALSE, cM_per_Mb = 1, chrom_lengths = NULL )
map |
path to the plink |
use_morgans |
logical. IF true, the third column in the PLINK map
file (assumed to have the position of the markers in Morgans) will be used
to calculate the |
cM_per_Mb |
numeric. If |
chrom_lengths |
if you know the full length of each chromosome, you can
add those in a tibble with columns |
For simplicity, this function will assume that the length of the chromosome
is just one base pair beyond the last marker. That is typically not correct
but will have no effect, since there are no markers to be typed out beyond
that point. However, if you know the lengths of the chromosomes and want to
add those in there, then pass them into the chrom_lengths
option.
A tibble that provides the recombination rates for the segments of the genome.
mapfile <- system.file( "extdata/example-plink-with-morgans.map.gz", package = "gscramble" ) # get a rec-rates tibble from the positions of the markers, # assuming 1 cM per megabase. rec_rates_from_positions <- plink_map2rec_rates(mapfile) # get a rec-rates tibble from the positions of the markers, # assuming 1.5 cM per megabase. rec_rates_from_positions_1.5 <- plink_map2rec_rates( mapfile, cM_per_Mb = 1.5 ) # get a rec-rates tibble from the cumulative Morgans position # in the plink map file rec_rates_from_positions_Morg <- plink_map2rec_rates( mapfile, use_morgans = TRUE ) # get a rec-rates tibble from the cumulative Morgans position # in the plink map file, and extend it out to the full length # of the chromosome (assuming for that last part of the chromosome # a map of 1.2 cM per megabase.) rec_rates_from_positions_Morg_fl <- plink_map2rec_rates( mapfile, use_morgans = TRUE, cM_per_Mb = 1.2, chrom_lengths = example_chrom_lengths )
mapfile <- system.file( "extdata/example-plink-with-morgans.map.gz", package = "gscramble" ) # get a rec-rates tibble from the positions of the markers, # assuming 1 cM per megabase. rec_rates_from_positions <- plink_map2rec_rates(mapfile) # get a rec-rates tibble from the positions of the markers, # assuming 1.5 cM per megabase. rec_rates_from_positions_1.5 <- plink_map2rec_rates( mapfile, cM_per_Mb = 1.5 ) # get a rec-rates tibble from the cumulative Morgans position # in the plink map file rec_rates_from_positions_Morg <- plink_map2rec_rates( mapfile, use_morgans = TRUE ) # get a rec-rates tibble from the cumulative Morgans position # in the plink map file, and extend it out to the full length # of the chromosome (assuming for that last part of the chromosome # a map of 1.2 cM per megabase.) rec_rates_from_positions_Morg_fl <- plink_map2rec_rates( mapfile, use_morgans = TRUE, cM_per_Mb = 1.2, chrom_lengths = example_chrom_lengths )
This will read .ped and .map files (which can be gzipped, but cannot be the binary .bed or .bim plink format). The population specifier of each individual is assumed to be the first column (the FID column) in the .ped file.
plink2gscramble(ped = NULL, map = NULL, prefix = NULL, gz_ext = FALSE)
plink2gscramble(ped = NULL, map = NULL, prefix = NULL, gz_ext = FALSE)
ped |
path to the plink |
map |
path to the plink |
prefix |
If map and ped are not given as explicit paths to the file, you can give the prefix, and it will search for the two files with the .ped and .map extensions on the end of the prefix. |
gz_ext |
Logical. If TRUE, and specifying files by prefix, this will add a .gz extension to the map and ped files. |
A list with three components:
I_meta
: meta data about the individuals in the file. This will
include the columns of group
(value of the first column of the
ped file) and indiv
(the ID of the individual stored in
second column of the ped file). And wil also include the other four
columns of the plink ped specification, named as follows: pa
ma
, sex_code
, pheno
.
M_meta
: meta data about the markers. A tibble with the columns
chrom
, pos
, and variant_id
and link_pos
. The link_pos
column
holds the information about marker position in Morgans or cM that was
included in the map
file.
Geno
: a character matrix of genotypes with number-of-indviduals rows
and number-of-markers * 2 columns. Missing genotypes in this matrix
are coded as NA
.
ped_plink <- system.file("extdata/example-plink.ped.gz", package = "gscramble") map_plink <- system.file("extdata/example-plink.map.gz", package = "gscramble") result <- plink2gscramble(ped_plink, map_plink)
ped_plink <- system.file("extdata/example-plink.ped.gz", package = "gscramble") map_plink <- system.file("extdata/example-plink.map.gz", package = "gscramble") result <- plink2gscramble(ped_plink, map_plink)
This function uses the information in the tibble about segments dropped down a genome simulation pedigree to plot the chromomomes of an individual colored by the population of origin of each segment.
plot_simulated_chromomsome_segments( Segs, RR = NULL, fill_by_group_origin = FALSE, rel_heights = c(chrom_ht = 4, chrom_gap = 0.8, spark_gap = 0.2 * !is.null(RR), spark_box = 2.6 * !is.null(RR), unit_gap = 4), bottom_gap = 0.3, spark_thick = 0.2, spark_splat = 0.25 )
plot_simulated_chromomsome_segments( Segs, RR = NULL, fill_by_group_origin = FALSE, rel_heights = c(chrom_ht = 4, chrom_gap = 0.8, spark_gap = 0.2 * !is.null(RR), spark_box = 2.6 * !is.null(RR), unit_gap = 4), bottom_gap = 0.3, spark_thick = 0.2, spark_splat = 0.25 )
Segs |
a tibble of segments |
RR |
a tibble of recombination rates in bins in the format of RecRates. If this is included, the recombination rates in cM/Mb are plotted atop the chromosomes as a little sparkline. If it is not included, then the there are no little sparklines above the chromosomes. |
fill_by_group_origin |
If FALSE (the default) the fill color of segments
is mapped to the pop_origin, which is where the founder haplotypes came from according
to the |
rel_heights |
a vector the the relative heights of the different elements of each chromosomal unit of the plot. This is a named vector with the following elements, listed in order of the bottom of each unit to the top:
|
bottom_gap |
the y value of the bottom chromosome unit. Basically the absolute distance between the y=0 line and the start of the plotted material. Should typically be between 0 and 1. |
spark_thick |
thickness of the line that draws the recombination rate sparkline. |
spark_splat |
fraction by which the unit gap should be reduced when there are sparklines being drawn. |
This function returns a ggplot object. Each facet of the plot shows
the chromosomes of a different sampled individual from a particular replicate
simulation from a particular genome simulation pedigree. The facets are titled
like: GSP 1, Idx 2, ID 8[3]
, which means that the chromosomes shown in the panel
are from the third sampled set of chromosomes from the individual with ID 8 from the
simulation from genome simulation pedigree 1 with index 2.
s <- example_segments rr <- RecRates g <- plot_simulated_chromomsome_segments(s) g_with_sparklines <- plot_simulated_chromomsome_segments(s, rr)
s <- example_segments rr <- RecRates g <- plot_simulated_chromomsome_segments(s) g_with_sparklines <- plot_simulated_chromomsome_segments(s, rr)
Just a simple function that makes a list-based data structure that makes it easy to gene-drop chromosome segments down the gsp. The basic idea is to get everyone into a data structure in which they are ordered such that by the time you get to segregating segments from anyone, you already have the segments segregated into them. This works by each individual having a list of gametes (post-recombination) coming out of them, and a list of "uniters" which are the gametes coming into them. We just point the uniters to gametes in the previous generation and then make sure that we shuffle the order of the gametes when they come out of someone.
prep_gsp_for_hap_dropping(gsp)
prep_gsp_for_hap_dropping(gsp)
gsp |
A tibble that holds the genome simulation pedigree (GSP). This is a tibble in which each row specifies an individual in the GSP. The columns of the tibble are:
|
This function returns a named list, which is a linked-list type of
structure that contains the same information that is in gsp
, but
it makes it easier to access when traversing the pedigree.
The length of the list is
nrow(gsp)
. The names are the character versions of the ind
column.
Each component of the list refers to an individual row from gsp
. All
of these list elements are themselves lists. (i.e., the information
about a single individual is stored in a list.) Every such individual
list has at least the two elements:
isSample
: TRUE if samples are taken from the individual. FALSE otherwise.
isFounder
: TRUE if the individual is a founder. FALSE otherwise.
nGamete
: The total number of gametes that will be segregated out
of this individual along edges to the offspring of the individual in the pedigree.
This is the sum of all the red numbers alongside edges below the individual in the
GSP.
If an individual's isSample
is TRUE, then its list also contains the following elements:
nSamples
: the number of diploid genomes sampled out of this individual.
This is the purple number along the edge to the sample node below the individual
in the GSP "picture".
If an individual's, isFounder
is TRUE then its list also contains
the following elements:
hpop1
: the population from which haplotype 1 in this (founder) individual originated.
hpop2
: the population from which haplotype 2 in this (founder) individual originated.
fh_idx1
: this stands for "founding haplotype index 1. It is a unique integer
that identifies haplotype one in this founder individual. This integer is unique over
all haplotypes in all the founder individuals.
fh_idx2
' the unique integer identifier for haplotype two in this founder
individual.
If an individual's isFounder
is FALSE, then its list also contains
the following elements:
par1
and par2
. Each of these is a list with the elements:
par
: the character identifier of the first (if in par1
) of the
second (if in par2
) parent of the individual.
gam_idx
: This tells us which of the gametes in the parent (1 or 2)
depending on if this is in par1
or par2
, gets segregated to the
individual. NEED TO EXPLAIN MORE. SINCE THINGS GET PERMUTED, ETC.
# get the 13 member complex pedigree in tibble form as the # package data object GSP and prep it: GSP_list <- prep_gsp_for_hap_dropping(GSP)
# get the 13 member complex pedigree in tibble form as the # package data object GSP and prep it: GSP_list <- prep_gsp_for_hap_dropping(GSP)
This function first reorders individuals in the columns of the matrix so that every population is together. Then it rearranges genotypes into separate columns for each haplotype (or "halflotype" if they are unphased.) This prepares the matrix for different kinds of a permutation (within or between populations, for example).
rearrange_genos(G, Im, Mm)
rearrange_genos(G, Im, Mm)
G |
the genotype matrix (N rows and 2L columns) |
Im |
the meta data for the N samples. |
Mm |
the meta data for the L markers. |
It returns a list. One component is the matrix, another is the updated individual meta data, and the third is the marker meta data.
Returns a list. Each component of the return list is itself
an unnamed list with one component (makes it easier to use bind_rows
to
create a tibble of list columns from these). The components, once unlisted are:
G
: a matrix—the rearranged genotype data matrix
I
: the I_meta tibble
M
: the M_meta tibble
RG <- rearrange_genos(Geno, I_meta, M_meta)
RG <- rearrange_genos(Geno, I_meta, M_meta)
This function uses the observed recombination fractions such as those
in the data object RecRates. These are observed recombination fractions
in a series of adjacent small bins that are defined by a start position
start_pos
and ending position end_pos
. This function operates on the
recombination rates for only a single chromosome at a time, so it will
typically be wrapped up inside a purrr::map()
function to operate over
multiple chromosomes.
recomb_point(M, at_least_one = TRUE)
recomb_point(M, at_least_one = TRUE)
M |
a tibble that has the columns start_pos, end_pos, and rec_prob (where rec_prob is the probability of a recombination occurring during meiosis within the interval defined by start_pos and end_pos. |
at_least_one |
if this is TRUE then at least one recombination occurs on every chromosome (see Details). If FALSE then the total number of recombinations is simulated as a Poisson r.v. with mean equal to the sum of the recombination fractions. |
There are two main modes by which this function operates. If
at_least_one == TRUE
, then the chromosome will always have at least
one recombination. In this case, the position of that first recombination
is chosen according to the recombination rates. Subsequently, the remaining
number of recombinations is chosen by the random variable Y, which is
the greater of zero and X - 1, where X is a
Poisson r.v. with mean given by the sum of the
recombination fractions. These additional recombinations are placed
randomly according to the rec_probs
but without interference.
If at_least_one == FALSE
then the total number of recombinations is
simulated as a Poisson r.v. with mean equal to the sum of the
recombination fractions. Again, their placement assumes no interference.
Locations within each bin are chosen uniformly. These locations are represented as real numbers (rather than as integers) and those are used for describing segments, as well. This simplifies matters such as condensing information about multiple recombinations that occurred at the same base pair. In practice, this will have negligible effects, since it is so unlikely that a recombination will ever occur in the same place.
If no recombination occurs, this just returns a zero-length numeric vector.
Returns a numeric vector of recombination breakpoints along the chromome. The values are sorted in ascending order.
# for an example, create a tibble of bins, roughly 1 Mb each, # on a chromosome of length roughly 150 Mb, and we assign each # a rec_prob around 0.01 ends <- seq(1e6, 150e6, by = 1e6) ends <- ends + floor(runif(length(ends), min = -1e4, max = 1e4)) set.seed(5) M <- tibble::tibble( start_pos = c(0, ends[-length(ends)] + 1), end_pos = ends, rec_prob = abs(rnorm(length(ends), 0.01, 0.004)) ) recomb_point(M)
# for an example, create a tibble of bins, roughly 1 Mb each, # on a chromosome of length roughly 150 Mb, and we assign each # a rec_prob around 0.01 ends <- seq(1e6, 150e6, by = 1e6) ends <- ends + floor(runif(length(ends), min = -1e4, max = 1e4)) set.seed(5) M <- tibble::tibble( start_pos = c(0, ends[-length(ends)] + 1), end_pos = ends, rec_prob = abs(rnorm(length(ends), 0.01, 0.004)) ) recomb_point(M)
Chromosome, start position and end position and probability of recombination within the bin for chromosomes in pigs.
A tibble with four columns: chrom, chrom_len, start_pos, end_pos, and rec_prob.
These rates were estimated in: Tortereau, Flavie, et al. "A high density recombination map of the pig reveals a correlation between sex-specific recombination and GC content." BMC genomics 13.1 (2012): 1-12. It is available for download from https://static-content.springer.com/esm/art%3A10.1186%2F1471-2164-13-586/MediaObjects/12864_2012_4363_MOESM1_ESM.txt. After downloading the data were processed to remove inconsistencies with the marker data set used for M_meta and Geno.
This function assumes that all individuals are named as numerics and that their haplotypes in hap1 and hap2 are named Xa and Xb,respectively, and their samples are named sX, where X is an integer,
renumber_GSP(G, add)
renumber_GSP(G, add)
G |
a GSP tibble |
add |
amount to add to each label |
Returns a GSP just like the input, but with the identity numbers
of the individuals in it incremented by add
.
# get an example GSP G <- create_GSP(pop1 = "p1", pop2 = "p2", F1B2 = TRUE)
# get an example GSP G <- create_GSP(pop1 = "p1", pop2 = "p2", F1B2 = TRUE)
A reppop
table in 'gscramble' is used to define how the founder populations
in a GSP (typically named something like "A", "B", etc.) are mapped to the
groups/populations of individuals, as specified in the individual meta data
(an example of which is found in I_meta
).
A tibble with three columns: index
, which must be of type integer,
pop
, and group
of type character.
This particular version shows a situation where individuals from group Pop1 will be
sampled as founders from A and from group Pop2 will be sampled as founders from
B into the GSP. Since this RepPop1
example has two values of index: 1 and 2,
it specifies that individuals from the populations will be sampled without replacment,
two times into the founders on the pedigree.
The developers created this.
A reppop
table in 'gscramble' is used to define how the founder populations
in a GSP (typically named something like "A", "B", etc.) are mapped to the
groups/populations of individuals, as specified in the individual meta data
(an example of which is found in I_meta
).
A tibble with three columns: index
, which must be of type integer,
pop
, and group
of type character.
This particular version shows a situation where individuals from four different
groups (Pop1, Pop2, Pop3, and Pop4) get mapped to four different founder groups
(A, B, C, D) in the GSP.
Since this RepPop4
example has three values of index: 1, 2, and 3,
it specifies that there will be three rounds of sampling of individuals
from the populations to be the founders on the pedigree. That will be done
entirely without replacement (individuals are not replaced after each round!)
The developers created this.
A small helper function.
seg2tib(s)
seg2tib(s)
s |
a gamete in segment format |
Returns a tibble with columns tmp_seg_names
, start
, and end
,
that show the origin (in tmp_seg_names
) of segments that start at start
and end at end
.
# first make a segment that has pieces from a few different founders V <- c(Amy = 0, Bob = 10000, Joe = 30000, Frieda = 40000) seg2tib(V)
# first make a segment that has pieces from a few different founders V <- c(Amy = 0, Bob = 10000, Joe = 30000, Frieda = 40000) seg2tib(V)
Map alleles from scrambled founders to the sampled segments from a GSP.
segments2markers( Segs, Im, Mm, G, preserve_haplotypes = FALSE, preserve_individuals = FALSE )
segments2markers( Segs, Im, Mm, G, preserve_haplotypes = FALSE, preserve_individuals = FALSE )
Segs |
the simulated segments. A tibble like that returned from
|
Im |
the individual meta data, like that in |
Mm |
the marker meta data formatted like that in |
G |
the marker genotype data as a matrix like |
preserve_haplotypes |
If TRUE then the Geno data is assumed phased (first allele at an individual on one haplotype and second allele on the other) and those haplotypes are preserved in this permutation of genomic material amongst the founders. |
preserve_individuals |
If TRUE then whole individuals are permuted
around the data set and the two gene copies at each locus are randomly
permuted within each individual. If |
A list with three components:
ret_geno
: A character matrix where each row is an individual and each pair of
columns are the alleles at a locus, thus it is N x 2L where N is the number of
individuals and L is the number of markers.
ret_ids
: A tibble providing the individual meta data with columns groups
and indiv
.
hyb_Qs
: A tibble of the admixture Q values.
#### First, get input segments for the function #### # We construct an example here where we will request segregation # down a GSP with two F1s and F1B backcrosses between two hypothetical # populations, A and B. set.seed(5) gsp_f1f1b <- create_GSP("A", "B", F1 = TRUE, F1B = TRUE) # We will imagine that in our marker data there are three groups # labelled "Pop1", "Pop2", and "Pop3", and we want to create the F1Bs with backcrossing # only to Pop3. reppop <- tibble::tibble( index = as.integer(c(1, 1, 2, 2)), pop = c("A", "B", "A", "B"), group = c("Pop3", "Pop1", "Pop3", "Pop2") ) # combine those into a request request <- tibble::tibble( gpp = list(gsp_f1f1b), reppop = list(reppop) ) # now segegate segments. Explicitly pass the markers # in M_meta so that the order of the markers is set efficiently. segs <- segregate(request, RecRates, M_meta) #### Now, use segs in an example with segments2markers() #### # this uses several package data objects that are there for examples # and illustration. s2m_result <- segments2markers(segs, I_meta, M_meta, Geno)
#### First, get input segments for the function #### # We construct an example here where we will request segregation # down a GSP with two F1s and F1B backcrosses between two hypothetical # populations, A and B. set.seed(5) gsp_f1f1b <- create_GSP("A", "B", F1 = TRUE, F1B = TRUE) # We will imagine that in our marker data there are three groups # labelled "Pop1", "Pop2", and "Pop3", and we want to create the F1Bs with backcrossing # only to Pop3. reppop <- tibble::tibble( index = as.integer(c(1, 1, 2, 2)), pop = c("A", "B", "A", "B"), group = c("Pop3", "Pop1", "Pop3", "Pop2") ) # combine those into a request request <- tibble::tibble( gpp = list(gsp_f1f1b), reppop = list(reppop) ) # now segegate segments. Explicitly pass the markers # in M_meta so that the order of the markers is set efficiently. segs <- segregate(request, RecRates, M_meta) #### Now, use segs in an example with segments2markers() #### # this uses several package data objects that are there for examples # and illustration. s2m_result <- segments2markers(segs, I_meta, M_meta, Geno)
Given a collection of genomic simulation pedigrees and requests
for how many simulations should be done (in the request
input),
as well as recombination rates, this simulates the segregation
of segments down through the pedigrees
segregate(request, RR, MM = NULL)
segregate(request, RR, MM = NULL)
request |
a tibble with list columns "gpp" and "reppop".
Each element of the gpp column is a tibble giving a genomic simulation
pedigree as documented as the input for |
RR |
the recombination rates in the format of the package data |
MM |
the marker meta data tibble (like M_meta). If this is NULL (the default) that
is fine. If not, then it uses the order of the markers in MM to define
the levels of a chrom_f column so that we can sort the rows of the output
correctly, with respect to markers in the Genotype data frame. This will
let us more efficiently subscript the markers out of the matrix. If MM is
not present, then the function will create |
The output from this function is a tibble. Each row represents one segment of genetic
material amongst the sampled individuals from the genomic permutation pedigrees. The columns give
information about the provenance and destination of that segment as follows.
Each segment exists in one of the samples (samp_index
) from a sampled individual
with a ped_sample_id
on a given gpp
(the index giving the row of the request input tibble)
in a given index
within the individual. Further, it is on one of two gametes
(gamete_index
) that segregated into the individual, and it came from a certain founding
population (pop_origin
) that corresponds to the named groups in the genotype file (group_origin
).
And, of course, the segment occupies the space from start
to end
on a chromosome chrom
.
Finally, the index of the founder haplotype on the given gpp that this segement descended from is
given in rs_founder_haplotype
which is short for "rep-specific founder haplotype". This final
piece of information is crucial for segregating variation from the individuals in the Geno
file
onto these segments. Finally, the column sim_level_founder_haplo
assigns a unique index for each founder
haplotype. This is necessary because any simulation can involve multiple gpps and/or indexes of gpps,
and the founders in each of those must all be unique within a simulation. so that those haplotypes
can all, eventually, be accessed easily out of the genotype matrix.
# We construct an example here where we will request segregation # down a GSP with two F1s and F1B backcrosses between two hypothetical # populations, A and B. gsp_f1f1b <- create_GSP("A", "B", F1 = TRUE, F1B = TRUE) # We will imagine that in our marker data there are three groups # labelled "grp1", "grp2", and "grp3", and we want to create the F1Bs with backcrossing # only to grp3. reppop <- tibble::tibble( index = as.integer(c(1, 1, 2, 2)), pop = c("A", "B", "A", "B"), group = c("grp3", "grp1", "grp3", "grp2") ) # combine those into a request request <- tibble::tibble( gpp = list(gsp_f1f1b), reppop = list(reppop) ) result1 <- segregate(request, RecRates) # here we pass it some markers, too result2 <- segregate(request, RecRates, M_meta) result1 result2
# We construct an example here where we will request segregation # down a GSP with two F1s and F1B backcrosses between two hypothetical # populations, A and B. gsp_f1f1b <- create_GSP("A", "B", F1 = TRUE, F1B = TRUE) # We will imagine that in our marker data there are three groups # labelled "grp1", "grp2", and "grp3", and we want to create the F1Bs with backcrossing # only to grp3. reppop <- tibble::tibble( index = as.integer(c(1, 1, 2, 2)), pop = c("A", "B", "A", "B"), group = c("grp3", "grp1", "grp3", "grp2") ) # combine those into a request request <- tibble::tibble( gpp = list(gsp_f1f1b), reppop = list(reppop) ) result1 <- segregate(request, RecRates) # here we pass it some markers, too result2 <- segregate(request, RecRates, M_meta) result1 result2
This takes the output of segregate()
and deals with the multiple gpp's and reps
to come up with a unique index for each found haplotype, so that those haplotypes
can all, eventually, be accessed easily out of the genotype matrix.
Along the way, this function does some light checking to make sure that the
rs_founder_haplo
values are dense within gpp
and index
as they should be.
sim_level_founder_haplos(S)
sim_level_founder_haplos(S)
S |
tibble of segments like that produced by |
This function returns a result that is basically the output of segregate()
with
an additional column added to it: sim_level_founder_haplo
. This is the index
of the haplotype within each group_origin
that should be used. For details of the
other columns in the output tibble, see the documentation for segregate
.
#### Get output from segregate to use as input #### # We construct an example here where we will request segregation # down a GSP with two F1s and F1B backcrosses between two hypothetical # populations, A and B. gsp_f1f1b <- create_GSP("A", "B", F1 = TRUE, F1B = TRUE) # We will imagine that in our marker data there are three groups # labelled "grp1", "grp2", and "grp3", and we want to create the F1Bs with backcrossing # only to grp3. reppop <- tibble::tibble( index = as.integer(c(1, 1, 2, 2)), pop = c("A", "B", "A", "B"), group = c("grp3", "grp1", "grp3", "grp2") ) # combine those into a request request <- tibble::tibble( gpp = list(gsp_f1f1b), reppop = list(reppop) ) # now run it through segregate() set.seed(5) # just for reproducibility in example... simSegs <- segregate(request, RecRates) #### Now we can run those through sim_level_founder_haplos() #### fh <- sim_level_founder_haplos(simSegs) fh
#### Get output from segregate to use as input #### # We construct an example here where we will request segregation # down a GSP with two F1s and F1B backcrosses between two hypothetical # populations, A and B. gsp_f1f1b <- create_GSP("A", "B", F1 = TRUE, F1B = TRUE) # We will imagine that in our marker data there are three groups # labelled "grp1", "grp2", and "grp3", and we want to create the F1Bs with backcrossing # only to grp3. reppop <- tibble::tibble( index = as.integer(c(1, 1, 2, 2)), pop = c("A", "B", "A", "B"), group = c("grp3", "grp1", "grp3", "grp2") ) # combine those into a request request <- tibble::tibble( gpp = list(gsp_f1f1b), reppop = list(reppop) ) # now run it through segregate() set.seed(5) # just for reproducibility in example... simSegs <- segregate(request, RecRates) #### Now we can run those through sim_level_founder_haplos() #### fh <- sim_level_founder_haplos(simSegs) fh
This function doesn't choose the recombination points. That is done with
the function recomb_point()
,
and the results are passed into this function. The two inputs V1
and
V2
represent the two gametes coming into an individual on the pedigree.
Recombination occurs within that individual, and the two resulting gametes
from that recombination are the output. Typically this is not the way
things happen, of course. Generally, only one of the two resulting gametes
from the recombination will be segregated to a surviving offspring. But,
since we are interested in segregating genetic material without duplicating
or destroying any of it, we keep track of both gametes that result
from the meiosis/recombination.
xover(V1, V2, R)
xover(V1, V2, R)
V1 |
integer vector of recombination points already existing on the first incoming gamete. Its names are the names of the founder haplotype that the left end originates from (i.e. it is from the named haplotype up until it changes at each point). For example c(A = 0, B = 12890, B = 30000) would work for a 30 Kb chromosome in which there is a single recombination just to the right of the point 12890. (In that example, positions 1 through 12890 are from founder haplotype A, while positions 12891 to 30000 are from founder haplotype B.) Note that these vectors have to have a first value of 0 and a final value of the chromosome length. |
V2 |
integer vector of breakpoints of the second incoming gamete. Format is just like it is for V1. |
R |
a vector of new breakpoints to insert into the existing ones on each gamete.
This is usually returned from the function |
This sends back two updated gametes, V1 and V2, but with the new points of recombination stuck in there. Note, for two incoming gametes there are two outgoing gametes, but we aren't "re-using" any genomic sequence.
#' # make the two gametes/chromosomes coming into the function. #' # Each one has length 30000 and a single existing recombination V1 <- c(A = 0, B = 10000, B = 30000) V2 <- c(C = 0, D = 20000, D = 30000) # now, set a new recombination point at position 15000 xover(V1, V2, R = 15000) # set three recombination points at 5,000, 15,000, and 25,000: xover(V1, V2, R = c(5000, 15000, 25000)) # no recombinations (R is a zero length numeric vector) xover(V1, V2, R = numeric(0))
#' # make the two gametes/chromosomes coming into the function. #' # Each one has length 30000 and a single existing recombination V1 <- c(A = 0, B = 10000, B = 30000) V2 <- c(C = 0, D = 20000, D = 30000) # now, set a new recombination point at position 15000 xover(V1, V2, R = 15000) # set three recombination points at 5,000, 15,000, and 25,000: xover(V1, V2, R = c(5000, 15000, 25000)) # no recombinations (R is a zero length numeric vector) xover(V1, V2, R = numeric(0))