Raw Data Dna Analysis
In March 2017, I compared. I had tested with FTDNA at home on Nov 25, 2016 and with MyHeritage DNA at RootsTech on Feb 10, 2017.Since then, I ordered tests and tested at home with 23andMe on Dec 2, 2017, AncestryDNA on Dec 12, 2017, and Living DNA on June 23, 2018. So I now have five sets of my own Raw Data from different testing companies that I can compare.You never know what the companies are doing, so just to make sure, I downloaded my Build 37 raw data from Family Tree DNA again and compared it with the download I did on Jan 12, 2017. Nothing had changed. The files were identical. That’s good.Raw Data File ContentsAll 5 companies list your data, one per line. Some companies include some lines of text description at the top, followed by a title line naming the fields, followed by the SNP data.
DNA geek here. In my last post, I talked about the many uses for your raw DNA data that you got through testing with 23andMe, AncestryDNA, MyHeritage and other DNA testing companies. Those use cases include genetic insights for fitness, romance, personalized goods, medical risks, etc.
Here for example is the beginning of my Ancestry DNA raw data file:And this is the beginning of my Family Tree DNA raw data file:Here’s a comparison of all five of my raw data files:Family Tree DNA and MyHeritage DNA files are both set up similarly as.csv files (comma delimited) with field put in double quotes. The other 3 companies use plain text files separating fields with a space or tab. Both type of files can easily be loaded into Excel and the fields will be placed properly into columns for you.The first field for each SNP in all the files is the RSID (Reference SNP cluster Identifier) which basically is a name for the SNP. I checked, and in each raw data file, no RSID was listed more than once.The RSID is followed by the chromosome number and the position in base pairs on the forward strand that the SNP is located on the chromosome.
The position of the SNP can change when the powers that be come out with a new “build” of the genome. Several years ago, Build 36 was the standard, but most companies now use Build 37. They have already come out with a Build 38, but so far all of the companies are sticking to Build 37 because it really is a lot of work to change for little gain with regards to matching people to each other. All 5 of my raw data files are from Build 37, so (theoretically at least) the chromosome and position of any SNP should match.
I’ll check that later in this article in the section: “RSIDs with more than one Position”.The value of the SNP is called “result” by Family Tree DNA and MyHeritage DNA, “allele1 and allele2” by AncestryDNA, and “genotype” by 23andMe and Living DNA. Ancestry DNA puts a space between the two allele values. The other companies list the two alleles together as a single 2 character string.The SNPs from all five companies are listed by chromosome and then by position within the chromosome. Chromosomes 1 to 22 (the autosomes) are listed first. The sex chromosomes X and Y and the mitochondrial MT follow. X as 23 and 25, Y as 24 and MT as 26. Ancestry uses 25 for the few SNPs that they probe that are in of the X and Y chromosomes.
These are the tips of the X that actually combine with the Y chromosome just like autosomal genes do.Family Tree DNA embeds a 2nd title line between the last SNP on the 22nd chromosome and the first SNP on the X chromosome. Don’t get caught by this. Be sure to remove this second title line if you are analyzing a Family Tree DNA raw data file in a spreadsheet or with programming.RSIDs and SNPediaThe RSID, which you can think of as the name of the SNP, is usually represented by the letters “rs” followed by a number. The has information on a fair percentage of these RSIDs and you can look them up to find out what that particular SNP has been found to do. For example, will tell you that this SNP is on chromosome 11 at position 66560624, is part of Gene ACTN3, and is said to have an effect on muscle performance.
Values of (C,C) could contribute to better performing muscles, (C, T) is a mix of muscle types, and (T,T) could contribute to impaired muscle performance. Medical interpretation of SNPs is not something I have any experience with, so I will make no attempt to do that.When testing companies test SNPs that do not already have an RSID defined, they often invent their own. 23andMe has used “i” followed by a number. Family Tree DNA and MyHeritage DNA have used “VG” followed by the chromosome number followed by “S” followed by a number. And Living DNA came up with a whole set of different RSID names, each of which must have some meaning to them. In my raw data, I found the following number of SNPs with these prefixes:At the time I’m writing this, is 109,335. SNPedia says that 49,082 of those are tested by Ancestry.com’s v2 platform and 24,761 by 23andMe’s v5 platform with 16,453 in common between them.
There are 13,916 tested by Family Tree DNA. They say there are 1,504 SNPs of their defined SNPs that are in common to most platforms.Number of SNPs by ChromosomeAll companies read and provide raw data for the SNPs from the autosomes (chromosomes 1 to 22) as well as the X chromosome. MyHeritage DNA, Ancestry DNA and 23andMe provide Y chromosome SNPs. Ancestry DNA and 23andMe provide mitochondrial (MT) SNPs.Below is the number of SNPs by chromosome in my raw data:You’ll notice that the FTDNA and MyHeritage number of SNPs are identical for all chromosomes and are only 16 different for the X chromosome.
That’s because both companies use the the same chip and the same Gene By Gene lab (the parent company of Family Tree DNA). Differences in the reads between the two are indicative of the error rate in one set of raw data. That compared the two sets of raw data found 42 differences out of 702,442 autosomal SNPs, indicating an error rate less than 0.01%.
MyHeritage does include some Y chromosome results in its raw data, but Family Tree DNA does not.Note that Living DNA’s autosomal file does not include Y or mt values. They include separate files for those. But Living DNA’s Y file only includes the SNP names in the Y file.
It does not include positions or values. I have 308 entries in my Y file from Living DNA. I would have to assume that these are just the variant Y SNPs that they found for me. You need to look them up in a table like:. The table indicates what the primary variants are.
But I don’t think their list gives all the Y SNPs that they test. For mt, Living DNA gives the position and value.
But I only have 19 entries in my file, so I assume they are only giving variants. I am pretty sure again those are not all the mt SNPs that they test. Because Living DNA seem to be only including variants, my SNPs are not representative of what everyone will get, and I’m not including them in my analysis.Ancestry’s X Chromosome in More DetailAncestry divides its X data into what it calls chromosomes 23 and 25. The latter is said to represent the pseudoautosomal region which I described earlier. My 27,973 X SNPs from my Ancestry DNA raw data is made up of 27,473 chromosome 23 SNPs and just 500 pseudoautosomal chromosome 25 SNPs.This is the range of positions and counts of my designated chromosome 23 versus chromosome 25 SNPs:Ancestry DNA’s Chromosome 25 regions in my raw data include 339 SNPs up to position 2,697,868 which is the starting tip of the X chromosome and is the first pseudoautosomal region. And then there’s 63 SNPs at the ending tip of the chromosome in the second pseudoautosomal region.For some reason, Ancestry DNA assigns 13 SNPs from 2,700,157 to 8,549,940 to chromosome 25 when it is outside where it also assigns 1,256 SNPs to chromosome 23.
Then between 88,720,459 and 92,164,248, they have another 84 SNPs assigned to chromosome 25, and I’m not sure why.The SNP designated 25 at position 117,610,641 in my raw data file is all alone and is likely an incorrect entry by Ancestry DNA.138 of those Ancestry chromosome 25 SNPs are also included in my raw data from 23andMe, who simply include them as an X chromosome SNP and don’t differentiate them like Ancestry DNA does.SNPs in common between companiesIt is quite important to know how many SNPs are shared between companies. I compared my 5 sets of raw data in pairs and counted the SNPs shared. The numbers on the diagonal in bold are the number of SNPs in my raw data just from that company. The numbers below the diagonal are the number shared. The percentages above the diagonal are the percent shared out of the total SNPs that the two companies have = #shared / (#c1 + #c2 – #shared)The first table shows the shared autosomal SNPs that I have between my raw data files from the five companies.Below that are the comparable numbers from. The FTDNA number 698,179 that I’ve marked in their chart has to be wrong because it can’t be less than the number FTDNA shares with MyHeritage.
The numbers are fairly close to mine. I know from looking at several different people’s raw data from Family Tree DNA, that there is variation in the number of SNPs included in one company’s raw data from test to test.Family Tree DNA and MyHeritage DNA provide identical autosomal SNPs. They share about 44% with AncestryDNA. 23andMe and Living DNA who both use the v5 chip share over 90% with each other, but only about 14% with the other companies.
Only 110,231 autosomal SNPs were included in my raw data by all five companies.Those low overlap percentages are what makes it difficult to find matching segments between data from the v5 chip and data from the old chip. Some companies like Family Tree DNA do not yet accept transfers of raw data from 23andMe or Living DNA because of that. MyHeritage DNA uses imputation to estimate the missing SNPs. GEDmatch is still working to develop a more reliable method to compare v5 chip data with earlier data through it’s GEDmatch Genesis project.Here’s the same data, but for the X chromosome:The ISOGG Wiki doesn’t yet have X data in their table for MyHeritage DNA, Living DNA or the new v5 chip of 23andMe.Here are my tables for the Y chromosome and for mitochondrial.RSIDs with more than one PositionAll my raw data files were from Build 37 of the genome. So every RSID should map to one SNP on one specific chromosome at one position. That was true within any one set of raw data, where every RSID was just given once.But once you combine multiple sets of raw data, you’ll find the same RSID tested in different files. This is the count of the number of RSIDs by the number of files each was found in:So you would expect those RSIDs that are in more than one raw data file to be at the same position on the same chromosome in each file.
It turns out that in my files 68 of those RSIDs are not at exactly the same position.All but 1 are differences with the 23andMe raw data. And most of them are minor.29 differences have the 23andMe position being just 1 less than the Living DNA position, e.g. RSID rs498648 is on chromosome 1. In my 23andMe raw data file it is at position 176,957,452 and in my Living DNA file, it is at position 176,957,453. Now this is just 1 position different and isn’t important at all for genealogical purposes. But for a programmer who may want to develop tools for handling raw data, even a one difference can cause a problem.
None of these 29 differences have RSIDs that are in the other 3 raw data files or in SNPedia, so I can’t tell which one might be the correct one.34 of the differences are very small ones on the mt chromosome where 23andMe is 1 more (31 times), or 2 more (twice) or 3 more (once) than the Ancestry DNA position. For RSID Ancestry DNA lists position 611 on chromosome 26, and 23andMe lists position 613 on chromosome MT.
Of these RSIDs, 32 are listed in SNPedia and SNPedia agrees with Ancestry DNA in all cases.One more difference is SNP rs3857360 which is in both my Family Tree DNA and my MyHeritage DNA raw data files as position 102,989,428 on chromosome 5, but has a position one higher at 23andMe. This SNP is not in SNPedia.But there are four differences between 23andMe and Living DNA that concern me the most because the RSID is used for two completely different locations. These 4 are:Two of the values at 23andMe are no-calls, but of the other two, one doesn’t match with a TT at 23andMe and a AA at Living DNA. That already is indicative that these might be different SNPs that one of the companies has named incorrectly.
Hi Louis,Thanks for the very interesting and informative post. I recently (March 2019) got my autosomal DNA test results from Living DNA. When I compared the number of shared SNPs between Living DNA, FTDNA and Ancestry, I got different percentage of SNPs shared between them than what you found; 39% and 23% for FTDNA and Ancestry, respectively. After doing some more research I realized this was due to Living DNA recently switching (late Oct 2018) to the new Affymatrix chip whereas your Living DNA results pre-dated the switch and were on the Illumina GSA chip. The Affymatrix chip data should result in better matches at Gedmatch Genesis for FTDNA, MyHeritage, Ancestry and Living DNA or when all results are combined. If you’re interested in SNPs used by Living DNA I’m happy to share the list.As an aside, I got my Living DNA results back in about 3 weeks which was a pleasant surprise. I am wondering if there is a chart or list somewhere showing the actual overlaps, not just the counts?
I have already done the Ancestry DNA, but instead of doing all the other ones to make a combined file, Which would be the best two to followup with in order to obtain the greatest total SNP coverage. Right now, I am guessing that along with my AncestryDNA, I should do a MyHeritage kit, and a LivingDNA kit. The extra Y chromosomes from in the 23andMe aren’t of a concern as I have ordered the Big Y-700 from FTDNA during this recent sale.Thank you.
This is super interesting and well done. I think ultimately the double position and double rs-ids problem are to blame on dbSnp the underlying public database. I have heard many professional geneticists say that they now only use chr:pos:A1A2:built as ID because of this frustration (that’s “chromosome”, “position”, “allele 1″, “allele 2″, “genome built”). But obviously not nice for us in genealogy, and it’s also on the DTC companies to fix.One thing I would like to see more investigation of is that of imputation. Because for so many of these SNPs that are missing from one company and not from another, it would be possible to perfectly fill in the missing part because the pair is in perfect linkage disequilibrium. Meaning that missing SNPs from one company actually could become a non-issue.
Is that something you’ve looked more into ever? You mention myheritage uses it, but seems like it could be an overall solution? The general method of imputation is to use samples from a population that best fit known values to fill in missing values. Imputed values will thus not catch many variants, and may result in non-matching segments with close relatives who should match and matching segments with people who shouldn’t match.As a result, I don’t like imputation. MyHeritage is considered to be the company giving the poorest matching results and I attribute that to their imputation and stitching.The correct solution would be for companies to use one person’s extra SNPs in a matching segment to add to and fill in the other person’s SNPs. Bayonetta 2 switch price.
Most of the major consumer genetic testing companies allow customers to download their raw data files. These files contain the “letters” (nucleotides A, C, G, T) that comprise DNA. The raw data can be uploaded to a variety of different services for free and/or paid analyses. The list below provides information on services which are of particular interest for or which have been used by genetic genealogists.
It is not intended to be a comprehensive list and is provided for information only. Inclusion on this list does not imply recommendation or endorsement by.Tool CategoryMeaningweb tools (not desktop tools)website/page with no (desktop) software installation needed'genealogical'can do analysis of a person's DNA against DNA from other relatives or ancestral populations(e.g. Ancestry/ethnicity/ancient DNA, compare/analyze relatives, find new relatives). Contents.Genealogical web tools.
Kevin Borland's database focused on DNA of deceased individuals, linked to a set of DNA reconstruction tools for creating kits representing deceased ancestors. Utilities for analysing raw DNA data.
A website which provides a range of online and offline tools for analysing DNA data. The basic service is free. There is a monthly subscription fee for the premium service. A not-for-profit community website run by academics affiliated with Columbia University and the New York Genome Center.
The site offers a biogeographical analysis, imputation and a relative-matching feature. A free utility to compare autosomal DNA data files from all three testing companies and to compare Gedcom files.
A number of other very useful tools are also provided, some for a fee. A service using raw DNA data to report on genes, traits, and ancient origins. Traits covered include eye color, lactose intolerance, alcohol flush reaction, and taste and smell sensitivity. Generates inheritance trees to show which genes were passed down from grandparents and parents to a child; identifies whether these genes originated in Europe, Asia, Eurasia, or Africa.
Its Grandchild Report calculates what percentage of DNA a child inherited from each grandparent. A set of online tools from Stanford University for analysing your personal genomic data. For further explanation see this. A tool provided by Andrew Riha for analysing raw data files.