How to Analyze Big Y Results

by Philip Shaddock
"Big Y" is a marketing term for the NGS (Next Generation Sequencing) YDNA test sold by Family Tree DNA. It tests a portion of the male Y chromosome to find changes in the genetic code called SNPs (Single Nucleotide Polymorphisms). These one-letter changes at select locations on the Y chromosome are inherited. They occur infrequently, about once every 144 years. They are used in family trees to denote a new branch of the family. The Shattocke family tree shows how I use a combination of SNPs and STRs to define the historical branching of the family. Click on the image to enlarge it.


Branch-defining SNPs are found in Big Y results through analysis of those results. You have to dig to find them. FTDNA has recently done a much better job of presenting the results of their Big Y test but if my haplogroup is any indication, there is still a pressing need to engage the help of outside services to help find those branch-defining SNPs.  

In the case of the FTDNA Big Y results, individuals have some information about the Big Y results of other members of their haplogroup, but it is limited. And they have no access to individuals in nearby haplogroups. Comparing your results to others is the keystone of mutational analysis so these are critical limitations. As a project administer at FTDNA I have privileged access to the Big Y results information posted on my member's pages. But I do not have access to all the information. There is important information in VCF and BAM files (so-called raw results), which I do not have access to by default, although all my own haplogroup members have provided me with these files at my request. However I still cannot compare the results of my project members to the Big Y results on a neighboring branch. I am writing these people to ask them to participate in my research to both our benefit. 

Some may find this method useful if they want to compare their SNP results to to another person or group of persons outside their project at FTDNA. Only people who have done the FTDNA Big Y test can be used for this method as it uses the FTDNA variants data, which will differ from tests at other companies.

With that said there is still a lot of information an individual can derive from Big Y results using the more limited information provided on the  new Hg38 Big Y results provided by FTDNA. 

If you do not know what an SNP or STR is you should acquire that understanding before reading this page. I provide a brief introduction on the master page to this one. 

Why Do Your Own Analysis?

There are a couple of reasons why you might want to analyze your own Big Y results. Analyzing your own results requires that you learn the basics of mutational analysis, which despite the name I have given it, is actually very simple. It largely consists of looking for patterns in VCF data made simple by the layout of the spreadsheet I recommend creating. You paid a lot of money for the test and you should get the maximum amount of information and understanding of your family tree from that investment. Also, the information coming back from third party analysis is not always correct, and not always complete because a relatively small percentage of people submit their results to third party analysis. Since mutational analysis is based on comparing your results with others, that is a very important limitation. 

In fact I have found that I have had tend to my own family tree and communicate what I have found to FTDNA and to the each of the third party services I use to get all their trees into synch. So it pays to be your own SNP gardener. 

A case in point is the Y19751 SNP which all but FTDNA identified as the terminal SNP for the branch of my family I call the Stogumber Shattocks. FTDNA  uses a very strict quality control system that can sometimes miss an important SNP like Y19751. Some might argue that FTDNA is just being very careful, validating only those SNPs with a very high confidence level for genealogical purposes. FTDNA shows the SNP as positive for only four of the twelve Stogumber Shattucks who have Big Y tested. The other eight Shattucks are positive for the SNP, but the quality of that positive is below the standard set by FTDNA. 

What does "confidence" or "quality" mean in this context? When DNA is sequenced it is first broken into small fragments. The same fragment has one or dozens of copies. Each of those fragments is "read." Some reads will be incorrect. Some will be highly ambiguous. Others cannot not be read ("no call"). The more positive reads you have for a location on the Y chromosome, the higher probability you have captured the correct genetic "letter" (A, C, G, or T) for that SNP. For the 12799203 SNP FTDNA's algorithm determined that there was not enough samples with quality reads at location 12799203 to accept it as a branch defining SNP. FTDNA requires ten high quality reads to make an SNP a branch defining SNP. For the 12799203 SNP only four of the 12 Stogumber Shattockes who paid for Big Y tests were awarded the 12799203 unnamed SNP.

The 22 Shattockes who are not Stogumber Shattocks, but belong to the other branches of the family, are negative for this SNP. This fact gives us a lot of confidence that the Y19751 SNP was inherited from a common ancestor of the Stogumber Shattockes. Although the SNP is hard to detect in 75% of Stogumber Shattucks, it is in fact an SNP that is only shared among Stogumber Shattucks.  

The fact that FTDNA did not initially include the Y19751 SNP in their tree is reason to seek the opinion of other experts, including Thomas Krahn of YSEQ, Alex Williamson of YNET and the team at YFull. However, as the project administrator for my family group, ultimately the decision about its inclusion in the family tree lies with me and whoever else in my family project wants to participate in that decision. 

Branch Defining SNPs

Mutational analysis is based on finding SNPs that are inherited exclusively by a branch of the family. The term commonly used to describe an SNP that defines a branch of the family is "terminal SNP." A better term might be "placeholder SNP" because the SNP chosen to represent a new branch of the family might be one of several SNPs that have been inherited by all members of the branch and you may not know which SNP came first. And so-called terminal SNPs usually are above newer branches of the family so they are not in that sense terminal. 

My family tree is a succession of "terminal" SNPs.
I use two terminal SNPs to describe the haplogroup my family members belong to, shown at the bottom of the line of descent. That is because YFull has designated SNP Y16884 as the terminal SNP for the family and FTDNA has designated a different SNP as the terminal SNP for the family, Y16895. This makes it easier to compare information from the two companies. Did I not say that you have to take analysis of your Big Y data into your own hands? I tested for the two SNPs in other nearby branches. None of them were positive for either SNP. 

In any event, Y16884 / Y16895 are hardly terminal. The Shattocke family tree at the top of this article has many branches that descend from Y16884 / Y16895.

The SNP Variant Spreadsheet

So how do you find these SNPs? FTDNA provides lists of SNPs found in your DNA (called named and unnamed variants) and a graphical way of viewing the quality of individual SNPs (the chromosome browser). However, they do not provide a way to compare one family member's Big Y results to another in a graphical format. Most people will find their "matching" system difficult to use and if they have missed important branch defining SNPs it will not be of much value. I have developed a method for developing a Project SNP Comparison Spreadsheet using the information supplied by FTDNA on the Big Y results page of individual members. This spreadsheet gives you a visual comparison of SNPs and a place where you can note the results of further investigations. You can see what this spreadsheet looks like by downloading the Project SNP Variant Spreadsheet I have created for members of my project. I have removed member names to protect their privacy.  


The spreadsheet contains the shared and not shared SNPs that are held in common by the members of my project. And I have annotated it with further research. To my surprise, I was able to independently validate the branch-defining SNPs found not only by FTDNA but the third party services as well. While I am not representing it as a replacement for these services, it has helped me understand the research. And I discovered branch-defining SNPs in a recent Big Y result, long before FTDNA delivers the raw results (BAM file), long before Alex Williamson has analyzed the VCF file, and months ahead of a YFull analysis of the BAM file. 

The information I used to create the spreadsheet comes from the Big Y results page of individual member accounts. FTDNA provides lists of named and unnamed variant SNPs.



Named variants are those SNPs that have been discovered to have genealogical significance by a lab. The name of the lab is represented by the letter in front of a number. (The "Y" in Y19751 refers to the YFull lab.) Unnamed variants are often described as "private SNPs," meaning they are found in an individual and have not as yet been found in another individual within the family group. For example, YFull has named the SNP that defines the Stogumber branch of my family as Y19751. FTDNA does not consider this SNP to be branch defining for the Stogumbers so they use its position name: 12799203. 

The trick is to get the list of variant SNPs FTDNA provides into a spreadsheet so that you can compare one person's SNPs with another's. 

Creating the the SNP Variants Comparison Spreadsheet

In a sub-page to this one I show you how to do that. I download the list of SNP variants from the Big Y results page.  I then use a Visual Basic macro embedded in an Excel spreadsheet to line up columns of named and unnamed variables so that they can be visually compared. See the steps for creating the spreadsheet here

Analyzing the SNP Variants Comparison Spreadsheet

You are trying to discover if two people share a private SNP. If they do, then they are probably descended from a common ancestor and you have discovered a branch of the family, or confirmed its existence from other genealogical studies. It is not always the case that branch defining SNPs are discovered among unnamed variants. A private SNP can be a named or unnamed variant. For example the branch of my family called the Byars - Parrish Shattockes share a named variant called A8033, which had previously been discovered as a shared SNP among members of a completely different branch of the human family. The Y16884 common ancestor of all Shattockes has been discovered in two other completely different branches of the human family, the E haplogroup and the J haplogroup. So how do you decide if a named SNP is also a branch defining SNP? You have to look at the SNP within the context of your haplogroup. Is there an SNP that can only be found in a sub-group? If the named or unnamed variant is found only among the members of a sub-group it is likely to be a branch defining SNP. All it takes is two members of a sub-group to define a new branch of the family.

The spreadsheet makes this very easy to do.

Let's look at a specific example, the hard to detect Y19751 named variant that initially showed up as unnamed variant 12799203 in FTDNA's table of Big Y results.


The image at left shows a portion of my haplogroup's spreadsheet.

On the left of the spreadsheet is a color-coded list of member names. The color codes refer to the major branches of the family. For example, the first four yellowish brown highlighted names are Shattockes who belong to the North Molton branch. The different shades of yellowish brown represent the further subdivision of the North Molton Shattockes into Yarnscombe and Burrington branches.  

The Stogumber Shattockes are found in the gray highlighted area of the project members names. This is the area where the 12799203 variant is found.  FTDNA's algorithm found this "unnamed variant" in four of the 34 Shattockes who Big Y tested. Those four members with the unnamed 12799203 variant are all members of the Stogumber Shattockes and they positively show that a "G" was found in place of an ancestral "C" at this location. While FTDNA decided this was not a branch defining SNP and left the location unnamed, YFull did a deeper analysis and decided that it was indeed a branch defining SNP. They named it Y19751.

I investigated the other members of the group to see what SNP reads were found at their 12799203 locations. Remember that "reads" are simply the letter of the genetic alphabet (A,C,G, or T) found at the location of an SNP. The caveat is that NGS (next generation sequencing) is an imperfect technology so the quality or confidence of the read at specific locations can vary from one person to the next, even between closely related individuals.  

You can see that only four Shattucks met FTDNA's standard for this SNP at the 12799203 location. I understand that 10 high quality reads is the standard. Only four individual results met this standard, with the rest failing because they ranged between one positive read and six positive reads. 

I have drawn a dotted line around Stogumber results for location 12799203. How do I know they are all descended from a common ancestor? Later I will show you how to do this using STR (Single Tandem Repeat) results and genealogical research. But let us see how far we can get just using SNP data.

The key to finding a mutant (C -> G in this case) SNP unique to a branch of a family is to determine if it is found in other branches. So we need to determine whether or not other Shattockes in other branches of the family are negative for this SNP (Y19751-). That means they have the ancestral value for the SNP, "C." It is actually easier to determine a negative than a positive. 

In the case of FTDNA named variants, like Y16884, you can use the chromosome browser to determine the quality of an SNP result. Remember that YFull uses this SNP as a terminal SNP, but FTDNA uses Y16895 as their terminal SNP. This screen capture from the SNP Variant Comparison spreadsheet shows why:


All but one of the Big Y results pass FTDNA's quality standard for the Y16895 terminal SNP. Only 12 of 34 Big Y results pass the same test for Y16884. I have filled in the number of positive reads. Where did I get these from? I used FTDNA's chromosome browser to look at each individual's reads. You call up the chromosome browser by doing a search for the SNP in the named variant column.

If you search for a named variant that is not in the list you will come up with no results. Set the "Derived" column to "Show All" to find Y16884. (Derived means the SNP is positive for the SNP, that is the genetic letter at this location is the variant of the ancestral value, T in place of C.) For this result, the system does not know if the SNP is positive or negative for the SNP so it has put a question mark in the Derived column. To bring up the chromosome browser, click on the Y16884 link. This brings up the chromosome browser.  
The graphic shows that there were eight reads, shown as pink "Cs."

Clicking on one of the pink Cs pops up a display of the quality of the individual read. In this case the individual read is rated at a very high confidence letter, 99.9999%. But interesting enough, the SNP at this location 6885283 (aka Y16884) fails to be appear in this person's list of named variants because it has only 8 positive reads. It falls below FTDNA standard of ten positive reads for a quality SNP.

The fact you can evaluate the SNP across all the members of your project means you can decide for yourself if you are going to accept this SNP as the terminal SNP for your haplogroup. You will probably decide that based on a lot of other evidence, the nature of which I will cover in a bit. 

Unfortunately the chromosome browser is not very useful for unnamed SNPs like 12799203, the Stogumber Shattocke terminal SNP. When you do a search for an unnamed SNP that an individual is shown as not having, it cannot find the location. You can probably assume it is negative, meaning the sample has the ancestral value for the SNP. There is another way of checking. FTDNA provides a spreadsheet of all the unprocessed variant SNPs, both named and unnamed. 


Searching in this spreadsheet shows the results at that location. This is specially useful when determining if another member is negative for the SNP. Searching for the 12799203 SNP in North Molton Shattockes produced this finding.


For the North Molton Shattocke, he had the ancestral letter "C" for this SNP, making him a negative. 

Many times NGS technology like the Big Y test fails to provide unambiguous results for a location. In cases where there is poor or no information about an SNP, FTDNA allows you to test individual SNPs, using a different form of sequencing that provides a more accurate result, called Sanger sequencing. It costs $39 per SNP. A less expensive option is the same service offered by YSEQ. They can test a single SNP for $18 (plus $5 for shipping on the first test). 

One of the benefits of submitting your raw Big Y results to YFull is you get a report that shows where a named SNP is found (in the YFull database or Thomas Krahn's YBrowse database). They provide information on how many reads the position had. They report if the SNP has been verified by Sanger sequencing. They report if the SNP falls within a region of the Y chromosome that is known to have a reliable, valid SNP for genealogical purposes (the comBED region). (Click on the image to enlarge it.)

The Y19751 SNP meets all these criteria as you can see in the graphic above. It has been verified by Sanger sequencing, it is in the YFull and Ybrowse databases and it falls within the comBed region of the Y chromosome. 

Alex Williamson at YNET also gives information about whether the SNP falls within the coverage area of the Big Y test.

Determining if the SNP is Useful For Genealogy

The entire Y chromosome is not tested by the Big Y test, only areas where SNPs useful for genealogical purposes are found. Plus the Big Y test does not provide full coverage of the genealogically significant areas, and the coverage can vary from one person to the next. So how do you know if the SNP you are studying is valid? YFull and Alex's YNET information pages tell you if the SNP is valid. But there is another way of finding out, without recourse to these services. 

The information is found in the raw results. When your results first come back from FTDNA, the BAM file is missing but there is a VCF zip available right away. This file contains information about the SNPs that vary from the standard reference. You can download it from the your results page:


These raw results are not available to project administrators, so the admin will have to enter his or her own personal account to download it, or ask the person he or she is studying to download and email it. Click on the "Download VCF file."

In the zip archive is a file titled regions.bed.

This is simply a list of all the valid regions that the Big Y test covered. Loaded into MS Word this is what it looks like:

I have highlighted two regions. They are defined with the starting chromosome position (e.g. 4936614) and ending chromosome position (e.g. 4937181) of a region. 

To determine if the Z36 SNP is valid, I have to find its actual chromosome location. I used the YBrowse tool described in the next section. Simply entering "Z36" in the search box of YBrowse gave me the Hg38 position of Z36: 5093077. (At the time I am writing this in December 2017, my VCF file was not available in Hg38 format so I had to convert it to Hg19 format, using YFull's Check SNP information pop-up: 4961118.)

If you look at the highlighted items in the above graphic, you will see that chromosome location 4961118 falls between the two regions marked as valid. You can therefore conclude that Z36 is not valid for genealogical purposes. In fact it lies in a transposable region of the Y, meaning that it is subject to crossover.

See the YBrowse section to see how to use it as a tool to discover if the Y chromosome location you are exploring is in a valid region of the Y.

A caveat is that FTDNA's criteria for BED regions is too strict, potentially missing SNPs that would be useful for genealogy. This is why I recommend passing the BAM file to YFull to get every single bit of information from your Big Y test.

I give the last word to Mike Wadna who has this to say about BED files: "FTDNA designed Big Y to target coverage of “gold” regions as defined David Poznik of Stanford University’s Bustamente lab. The Big Y REGIONS.BED file shows actual coverage from a Big Y test run but it should roughly follow the Poznik gold regions. Later a fellow named Adamov further refined the gold regions into what he called CombBed regions. YFull uses Adamov’s CombBed regions for age estimates. Later, Iain McDonald extended and refined these region definitions for what are called McDonald regions. Just because a variant falls outside of these regions does NOT mean the variant is unstable or otherwise bad. However, these are safer regions in terms of being well studied."

YBrowse: Finding the Name of FTDNA Unnamed Variants

You might be wondering how I knew that the named SNP Y19751 is found at chromosome position 12799203. There is an ISOYGG online tool created by Thomas Krahn at YSEQ that provides that information: http://ybrowse.org/gb2/gbrowse/chrY/?.

Using the named or unnamed variant, YBrowse will return named SNPs at the different locations on the Y chromosome.


To use the browser enter the named or unnamed SNP into the search box.

Named SNPs, like Y19751, can simply be entered into the search box.
Unnamed SNPs, like 12799203 need to be entered in this format: 

                    chrY:12799203..12799203

The SNPs found at that location will be displayed under "SNPs." I used YBrowse to annotate the unnamed SNPs in my spreadsheet, and to provide the alternate SNP names for named SNPs.

There is another very useful feature of YBrowse. It will identify regions of the Y chromosome that are troublesome. For example when I searched for an SNP I found in a sample at the location chrY:20,309,601, I discovered it is found in the DYZ19 region of the Y chromosome, a troublesome area best ignored. It is a 400 kb heterochromatic region, consisting of "massively amplified tandem repeats of low sequence complexity." It is said to be a region that is difficult to sequence and so cannot be mapped to the Y chromosome with sufficient resolution

Validating Big Y Conclusions By Other Means

If you look at the SNP Variant Spreadsheet I created you will see that it has major areas with holes. I have heard Big Y results described as "swiss cheese" for the missing or erroneous SNPs. I have highlighted in yellow the SNPs in the spreadsheet that at first glance appear to be branch defining SNPs. But examining other Shattockes with results at the same Y chromosome location I was able to determine that they occur at locations that are unreliable and should not be used for genealogical purposes. The third party services will be able to confirm this conclusion. 

Analyzing your own Big Y results using the list of variant named and unnamed SNPs and the chromosome browser has its limitations, so I recommend sending your VCF file to Alex Williamson at YNET and the BAM file to the team at YFull. But there is a way of checking the conclusions you have come to while you wait for such third party analysis. I will show you how that is done using the Y19751 SNP as an example. 

Validating with STR Results

Can the Y19751 SNP be trusted as a branch definging SNP? How do we know that the other 75% Stogumber Shattucks with weak signals for the SNP are descended from a common ancestor? There is another way of validating SNP results and that is through another type of Y chromosome mutation FTDNA also tests: STRs or Single Tandem Repeats. As it turns out the Stogumber Shattockes, who bear the Y19751 mutation have signature STRs that confidently identify them as descendants of a common ancestor. 

The image at the left is from a spreadsheet I maintain for Shattocke STRs for the DYS552 marker. This is a signature STR for Stogumber Shattocks because all members of this branch of the family have twenty three repeats for the marker and no other branch of the Shattocke family has this number of repeats for DYS552. The three red asterisks associated with DYS552 show that it is a relatively slow moving STR, meaning it is slow to gain or lose a repeat. In fact since the common ancestor of Shattockes, who was born approximately in AD 1360, the marker has only lost one repeat and that was in the Stogumber Shattocke branch of the family. This STR acts like an SNP for my Shattuck relatives. 

Here is a link to an edited version of the spreadsheet, with project member names removed, and a number of branches truncated. But you can see the green highlighted signature markers: https://drive.google.com/open?id=1LiJNoo9G2M4OGrsVEJra7ej_zNNLg8OzGmqkKEBaQB4. I have a spreadsheet for the 400 odd STRs returned by YFull in their analysis of the BAM file. That is even more useful for discovering signature STRs.

As it stands, neither FTDNA or the third parties have a system in place that allows you to validate genealogical conclusions based on SNP observations coupled with STR results analysis. However, YFull does have robust tools for investigating STR results on your own, including estimates of how fast a particular STR mutates, which I include in my STR spreadsheet as 2 to 5 red asterisks, with 5 the slowest. They have other STR information as well, including percentage amounts for the frequency of historical STR values. This is very helpful when you are trying to determine which samples have ancestral values for an STR.

YFull derives 300-400 STRs above the standard YDNA-111 STRs from the Big Y BAM file. Apparently FTDNA plans to also offer these extra STRs sometime in the next year. While the quality of these STRs extracted from BAM files is not as good as the first 111 STRs returned from Sanger sequencing, with enough of them they can be really useful. The additional STRs overcome a problem with STRs, which is that they mutate much more quickly than SNPs. That makes them very useful for studies within the past 600 years. But some STRs can change several times during that time frame and can either lose or gain a repeat. When looking at the number of repeats for a rapidly changing STR you do not know if it gained and then lost a repeat, then gained one again. That is three changes to the number of repeats, not zero changes! The additional STRs returned by YFull (and eventually FTDNA) allow you to use more than one STR to identify the split in a branch.

A major branch of the Shattocke family are the Byars - Parrishs, who are the result of a Shattocke to Byars NPE in the early years of the Chesapeake Bay colony in Virginia (circa 1640). This is a branch that has yielded very few useful SNPs in the past 370 years. So I have used multiple STRs to find how the family branched during that period.

My primary focus is on Shattocke family branching since the common ancestor Y16884 / 16895, who is estimated by YFull to have lived around AD 1400. The infrequency of branch defining SNP mutations makes the use of STRs in my family tree essential. Finding more than one non-modal STR that is shared by a group of my members increases the confidence in the use of STRs in defining major branches of the tree.

Validating with Genealogical Research

There is yet another way to confirm that Stogumber Shattucks descend from a common ancestor. There are well-documented paper trails for Shattucks who emigrated to the Massachusetts Bay colony in the early 17th century. In fact the author of a book about the Massachusetts Shattucks, Lemuel Shattuck, is one of the fathers of modern genealogical research. He designed the 1850 federal census form in the United States.  His book, Memorials of the Descendants of William Shattuck (1855) traces the descendants of a single male pilgrim to the Massachusetts Bay colony in the early 17 century. 

You need to do your own research and pull information from more than one source.  You need to become a genetic genealogist.

Conclusions


The method I use is not intended to completely supplant analysis done by FTDNA and third party services. I am going to still use third party analysis because they have more sophisticated software tools, and services like YFull study the BAM file, which contains twice the amount of information found in a VCF file. A big benefit to paying YFull $49 to analyze my Big Y results is that they provide a number of tools on their site for evaluating SNPs and they deliver up to 500 STRs dug out of the BAM file. (FTDNA has said they will be providing the additional STRs at sometime in the future.) YFull also compares your results to those people in the nearest branches to yours, a key advantage because the keystone of mutational analysis is the comparison of one person's genetic results to another's. Finally they estimate the age of the branch defining SNP. This feature alone is worth the price in my opinion. FTDNA may also be providing age analysis in the future.

Alex Williamson provides his VCF file analysis service for free for people who descend from the P312 SNP. He has additional tools for investigating your SNPs. With that said, I must say that I am very pleased with the information and tools now offered by FTDNA and think its greatest value is that it gives you the opportunity to take charge of your own results. You might ask why you might want to create your own tree. In the case of Shattockes, there is a branch defining SNP that was discovered at YSEQ called Y17163 that is a branch above our family's Y16884 / Y16895 SNP. (You can see it in the graphic above of our line of descent.) The person whose results led to the discovery of this SNP did not test at FTDNA. So the most complete and accurate version of my family tree is on my site, not elsewhere. And that is the best reason for doing my own Big Y analysis.