FAQ

常見問題

What is this site and how does the process work?


EthnoGene.com is utilized by customers to gain more insight into their genetic heritage. The website is run by the EthnoGene.com team and is based in the United States. The purpose of EthnoGene.com is to enable customers to learn more about their DNA. Customers upload your raw DNA data to our site (after getting a DNA test at AncestryDNA, 23andMe, or another main DNA provider), we analyze the data, and we send a report to the provided email within 1-15 business days. This report contains ancestry data which is highly detailed and sometimes more specific than that of the main DNA testing providers since we have an extremely diverse data set of reference populations. Please read our disclaimer, terms and conditions, and privacy statement. EthnoGene is an online based startup company which operates a decentralized network of servers to process DNA data files. The data is parsed and compared to reference data sets before being assigned percent matches to reference populations using a set of algorithms. To form the data set for our reference panel, we have used data from several sources which include the Human Genome Diversity Project, the Pan Asian SNP Consortium, the Population Reference Sample data set, the 1000 Genomes Project, the International HapMap project, and our own customer outreach program which adds consenting individuals with a deep regional history in a particular region to the reference data set. This customer outreach program has given us access to data which is representative of numerous underserved ethnic groups and regions. Our algorithms speculatively match customers to approximate regions and ethnic groups using information inferred from the reference data set. Please note that our test does not conclusively prove you have definitive ancestry within the listed population groups since these are speculative algorithmic matches and as such entail a moderate degree of inaccuracy.




Why should I use your service?


At times, taking a DNA heritage test leaves more questions than answers. At all major DNA testing services which examine heritage, ethnic percentages are not always specific or accurate. Our service seeks to provide you with more answers via an extensive analysis and comparison to numerous reference groups. We reference an astounding number of distinct populations, as can be seen here. This may provide a more specific analysis than other services.




Why is it so hard to find an accurate heritage analysis?


Analysis of DNA for accurate heritage measurements has always been an extremely tricky process due to the facts that humans have migrated throughout history and ethnic groups have not always corresponded to national borders. Most countries which are thought to be completely homogenous for centuries tend to have varying ethnic compositions which don't always correspond to the ethnic group thought to be predominant in that area. While we will try to the best of our ability, accurate DNA assessments are not guaranteed due to the above factors.




How do you protect user privacy?


We take our users' privacy very seriously. We will NEVER sell your information to any third parties. We only ask the minimum amount of information from you (your first name, raw DNA data, email, and any information you would like to share regarding your ethnic composition). Please see our privacy policy here.




How long does the process take?


From the date of purchase, the process takes anywhere from 1 to 15 business days. This is variable and dependent on customer volume.

When DNA data files are received, we place them in queue to be algorithmically parsed on a series of dedicated servers. At times, the queue within one server may be longer or shorter than the queue on another server, resulting in varying file processing times.




How do I receive my results?


Your results are delivered in report form to the email which you provided to us while filling out your DNA data file submission form.

When DNA data files are received, we place them in queue to be algorithmically parsed on a series of dedicated servers. At times, the queue within one server may be longer or shorter than the queue on another server, resulting in varying file processing times.




How do I interpret my results?


The report contains matches to overall regions, an ethnic percentage breakdown, and specific regional matches. The results are most accurate for overall regional matches and less accurate for the specific ethnic matches due to the constant migration of populations across regional borders for centuries. It is common for users to receive ethnic matches to neighboring populations since these populations all carry similar genetic markers. Specific regional estimates are also given, but please note that these are highly speculative and these matches indicate that you matched with data samples which are correlated to these approximate locations. This does not necessarily indicate you have direct ancestry from that region. In addition, neighboring areas always contain similar genetic markers so matching to specific locations will always entail a high degree of inaccuracy. Trace ethnic matches are also given in the report. These are not factored into the overall ethnic calculation. Trace matches represent low confidence ethnic estimates which are generated during our ethnic determination process. These values are very speculative and hence are not added into the main ethnic estimation.




Why does it state "Daily File Upload Limit Reached" when I click 'Order Now'?


The upper processing limit our system can reliably handle per day is approximately 100 DNA data files. Recent surges in customer volume have substantially exceeded this amount, thereby increasing the processing time.

To expedite the processing time, the daily file upload limit has been restricted to 70 files per day. This means the ability to purchase the report and upload files will be stopped each day once this limit is reached.

Unfortunately, this is necessary at the present time to keep the processing time within the 15 business day range. For individuals unable to submit their DNA data file due to the file upload limit being reached, please re-submit the file the next day. It is unfortunate that an upload limit is currently needed, but this is necessary to ensure processing times stay within a time range which is shorter than 15 business days.
Our process is algorithmically based. Files are put in queue, parsed using multiple algorithms, results are generated, and the resulting reports are emailed to clients. In terms of computing power, our system is limited to a certain amount of files per day and thus cannot handle the recent threefold to fourfold increases in customer volume, hence a file upload limit is necessary at the present time.

For any questions, please contact us at Admin@EthnoGene.com or see the remaining questions within the FAQ section.




What constitutes your reference panel?


We use a series of in-house algorithms based on the STRUCTURE and ADMIXTURE programs in conjunction with our reference population data sets to complete our analyses. To form the data set for our reference panel, we have used data from several sources which include the Human Genome Diversity Project, the Pan Asian SNP Consortium, the Population Reference Sample data set, the 1000 Genomes Project, the International HapMap project, and our own customer outreach program which adds consenting individuals with a deep regional history in a particular region to the reference data set. This customer outreach program has given us access to data which is representative of numerous underserved ethnic groups and regions.

The populations within the reference panel have, at minimum, all four grandparents as members of the same ethnic group. PCA was used to separate any outlying samples that did not conform to the delineated parameters of each individual group cluster. In the most precise cases, our algorithms can detect haplotypes specific to certain groups down to the 1% level (~6 generations), but it is important to note that this varies according to the algorithmic interpretation of the sequences as belonging to particular groups along with the variations within samples constituting the reference panel data points.
Our algorithms speculatively match customers to approximate regions and ethnic groups using information inferred from the reference data sets. Please note that our test does not conclusively prove you have definitive ancestry within the listed population groups since these are speculative algorithmic matches and as such entail a moderate degree of inaccuracy which varies according to the algorithmic interpretation of your DNA file.




Are you affiliated with any other site(s)?


We are not affiliated with any other site or sites. We solely run EthnoGene.com and do not have any intention of branching out into any other site or sites in the foreseeable future.




How are matches to regions within countries determined?


To form the data set for our reference panel, we have used data from several sources which include the Human Genome Diversity Project, the Population Reference Sample data set, the 1000 Genomes Project, the International HapMap project, and our own customer outreach program which adds consenting individuals with a deep regional history in a particular region to the reference data set. This customer outreach program has given us access to data which is representative of numerous underserved ethnic groups and regions. We have compiled data sets which represent regions and approximate regions within countries. Our algorithms reference these data sets when analyzing customer DNA files.

These algorithms approximate the matches to regions within countries based on the given data set. Inferences based on algorithmic exrapolations of the reference data are made in instances where outright distinctions are not present but information inferred from reference points within the available data insinuates a match to a particular region or set of provinces within a country.

The regional matches within our report contain areas within countries to which we have corresponding regional data. At times, algorithmic inferences based on reference points are used to infer that specific region in instances where significant distinctions are not present between region-specific data and available data for the overall set of neighboring regions.

Our algorithms speculatively match customers to approximate regions and ethnic groups using information inferred from the reference data set. Please note that our test does not conclusively prove you have definitive ancestry within the listed population groups since these are speculative algorithmic matches and as such entail a moderate degree of inaccuracy.




Do your calculations involve a Hidden Markov Model?


Since we are trying to predict the probabilities of an observed state belonging to each possible class while using multivariate data as an input, conditional random fields are used instead of the hidden markov model. Conditional random fields are probabilistic frameworks that result in a probability distribution and are useful for labeling and segmenting structured data. We estimate the distribution of the unobserved ancestral states at each window (within the data) given the observed genetic information as a CRF. The model we’re using is a linear chain CRF since we define emission functions that relate the observed data and the unobserved ancestral state at each window, along with transition functions that associate the unobserved ancestral states at adjoining SNPs. The functions and their parameters define the distribution of the unobserved ancestral states after being given the observed data. The CRF parameters are determined by model assumptions as well as the training data. The CRF was trained with maximum likelihood techniques that allow it to choose weights in order to maximize the conditional probability of the labels after being given each training data set. A normalization function was used on the data to permit interaction between weights at different locations in each data set.




How is your process output smoothed?


To smooth the output, an expectation maximization (EM) algorithm is used to calculate maximum a posteriori (MAP) estimates. These MAP estimates attempt to approximate unknown quantities using observed data, with the unknown quantity being equal to the most commonly present values in the posterior distribution. MAP estimates can be thought of as regularizations of maximum likelihood estimations, as information is added to prevent overfitting. Each ancestral population on the reference panel has a specific haplotype pattern that is indicative of samples belonging to that group. After these are determined in samples used during the training procedure, they are used to improve the accuracy of the mathematical models. The EM algorithm is used to find MAP estimates of parameters in our model where the model is influenced by the presence of unobserved variables. A linear quadratic equation estimation algorithm is used to assume a joint probability distribution over the inputted variables, with a weighted average being used to give greater weight to estimates that have more certainty. The resulting estimates are smoothed and used to get updated parameters that model the data most accurately.




Do your algorithms use a local classifier?


The task of the algorithm acting as a local classifier is to designate segments of DNA as matching with populations on the reference panel. The haplotype is split into segment windows which are treated independently of one another on a classification basis. A statistical inference strategy is used to designate the most appropriate class for a selected instance (segment window) with a probability being given for the segment being a member of each of the possible classes.
A random decision forest is used as the local classifier, and contains a bagging algorithm (bootstrapping algorithm) that samples ancestry types from the reference panel with uniform probability, then selects a haplotype at random from that ancestral group with uniform probability. This process ensures a greater degree of balance when one ancestral group has more samples present than another in the panel. Greater accuracy of posterior class probability estimates from the bootstrapped classifier is attained by substituting the per tree majority unit vote with a fractional vote, with the fractional vote being dependent on the characteristics of the selected node that the target haplotype is mapping to.
Groups on the reference panel are used as training data for the random forests, with random forests being used to gauge the probability of each ancestry on segment windows of reference haplotypes. On each segment, the alleles at select SNPs serve as biallelic predictor variables, with the response variable being the inferred ancestry in that segment window. Random forests are ideal for the characterization of haplotype structures since they can discern complex interactions in sets of variables even when uninformative variables are present. During the training period, the random forest generates one set of model parameters after being trained on segments from reference panel haplotypes. The second set of model parameters comes from assumptions based on our admixture model that allow for correlations between linked markers within each target haplotype, and our algorithm can analyze data sets that have markers with admixture linkage disequilibrium between them since it utilizes a modified set of algorithms built on the Structure and Admixture programs.





  • Facebook
  • YouTube

營業時間: 星期一至五

早上九時至下午六時

訂閱電子報

© 2020  b-MOLA by NCCO International Ltd