Administrator/Researcher: Angela J. Cone
Costa Tsirigakis - Founder J2 Y DNA project & admin/researcher from 2006 - mid 2008
See bottom of page for Project Admin. information

Analysis Phase 2
  Marker distributions in J2 sub-clades
   First 12 markers
   • Second results panel
   • Third results panel
Analysis Phase 3
  J2 Cluster analysis
   Marker panel comparison
   • Cluster analysis
   • J2b-Delta hypotheses

Marker distributions in J2 sub-clades*
(The potential significance of each marker in distinguishing sub-clades is Statistically quantified)

This page contains the Marker analysis results summary (see bottom of page for citation requirements).

Further links (below) lead to seperate pages. For each DYS marker, there is a graph which shows the frequency distribution of the repeat values for each subclade, and an analysis of whether there are significant differences in marker values for each sub-clade. A brief statement is also made about what the results for that marker are likely to mean in practical terms. An size reduced example of what these look like is to the right →

 

 Results summary

 First 12 markers
(in FTDNA order)

 Second results panel
(in FTDNA order)

 Third results panel
(markers tested by all labs first, in FTDNA order, and then markers tested by FTDNA only)

Please note:
• This analysis is derived from the SNP tested haplotypes that were in the J2 Y-DNA project on 19 July 2006. The reliability, and applicability of these analyses is dependant on how globally representative these haplotypes are for their sub-clades.
It is possible, due to the demographics of people who DNA test, and the relatively low sample size, that these marker values are not globally representative. In future - after more results are added - allele frequency distributions for some sub-clades may change slightly (which, accordingly, may change some of the preliminary observations and/or conclusions). The reliability and representativeness of this data set should increase, as more haplotypes are added.
• J2*( in YCC nomenclature) is a paraphyletic clade (ie. probably contains many different "sub-clades"). The J2 project is aware that there are very distinct clusters in the J2* results, and that these clusters being lumped together may confound the statistical analysis.
However, - if we were to divide J2* using subjective criteria, this would invalidate the scientifically objective nature of the analysis.
When there are adequately scientifically verified UEP's ("unique event polymorphisms", ie. SNP markers, or equivalent) to define these clusters, and there are sufficient numbers of project members who have tested for these UEP's, then these additional scientifically/objectively defined clades will be included in the analyses. For instance, the project is currently beginning to receive results for DYS 413, so we will be able to separate J2a* from J2a1* (Both of these clades are paraphyletic (J2a1* nested within J2a*), however the increased resolution will still be informative).

Results summary

The overall aim of this phase of results analysis was to statistically quantify the relative importance of each marker in distinguishing between the different sub-clades in haplogroup J2. It is recognised that the degree of statistical significance a marker might have, might not necessarily easily translate to practical application. Therefore, the next analyses phases will aim to put these results into a form that has greater practical application (and in a form that is more easily understood by the average person).

The relative importance of each marker was determined by graphing the marker distributions (grouped by SNP defined sub-clade), and also analysing the repeat values for each marker statistically. We will first summarise the overall results (ie. which markers are of most use), and then briefly explain how to interpret the statistical results. The individual graphs and statistical results for each of the markers can be seen via the links towards the top of this web page.

Which markers are most (statistically) significant in distinguishing between J2a and J2b?

In order, the 11 most significantly useful markers are:
YCA II b, DYS 389i, DYS 390, DYS 459a, DYS 437, DYS 456, DYS 19/394, DYS 439, DYS 448, DYS 385a, DYS 460

Additional markers that were (weakly) statistically significant are:
DYS 447, DYS385b, DYS 389ii (389ii-389i), DYS 391, DYS 442, DYS 449, DYS 454*,

Markers that were not statistically significant in distinguishing between J2a and J2b are:
DYS 388, CDYb, DYS 438, YCA IIa, DYS 392, DYS 459b, DYS 607*, DYS 576, DYS 455, CDYa, DYS 458*,
DYS 393, DYS 570*, GATA-H4, DYS 426.

Which markers (that may not have been significant in distinguishing between J2a and J2b), may be significant in distinguishing between other clades ?

DYS 454 and DYS 437 may be of significant use in distinguishing J2e2 from J2e1
(*needs to be confirmed with a larger sample size of J2e2*)

DYS 607 may be of moderate use in distinguishing J2f,
DYS 570 may be of moderate use in excluding J2f,
DYS 458 may be of moderate use in distinguishing J2f from J2*

Project Modal values for each marker within the main SNP defined clades
(In order of the relative significance of the markers)

(note: sample sizes for J2e2 and J2f1 are quite low, so in some instances modal values do not exist. These modals are derived from the SNP tested haplotypes within the J2 Y-DNA project on 19 July 2006, and may differ slightly from modals derived from different data sets).

 

Y
C
A
II
b

3
8
9
i
3
9
0
4
5
9
a
4
3
7
4
5
6
1
9
-
3
9
4
4
3
9
4
4
8
3
8
5
a
4
5
4
4
6
0
4
4
7
3
8
5
b
3
8
9
i
i
3
9
1
4
4
2
4
4
9
6
0
7
5
7
0
4
5
8
3
8
8
C
D
Y
b
4
3
8
Y
C
A
II
a
3
9
2
4
5
9
b
5
7
6
4
5
5
C
D
Y
a
3
9
3
G
A
T
A
|
H
4
4
2
6
J2e1 20 12 24 8 16 13 15 12 19 14 11 11 28 17 16 10 11 29 14 17 16 15 39 9 19 11 9 18 11 36 12 10 11
J2e2 20 12 24 8 14 14 15   19 16 12 11 24 18     11 31 14 19 16 15   9 19 11 9 18 11   12   11
J2* 22 13 23 8 14 16 15 12 20 13 11 10 26 16 16 10 12 30 14 17 15 15 36 9 19 11 9 16 11 36 12 11 11
J2f 22 13 23 9 15 15 14 11 20 13 11 10 26 16 17 10 12 29 13 17 18 16 35 9 19 11 9 19 11 35 12 11 11
J2f1 22 14 22 9 14   14 11 21 14 11 10 26 15 17 10     13 17 15 15 34 9 19 11 9 18 11 34 12 10 11

Marker values that deviate from the modals for a sub-clade, do not necessarily exclude the possibility of that haplotype being in that haplogroup. Some marker values do vary quite a bit from modals. Different clusters within a defined grouping may also have different modal values.

How to interpret the Statistical results

The statistical tests were done using the statistical analysis package "Statistix".

We did statistical analyses for each marker at 3 sequential levels of taxonomic resolution.

  • First we looked at the differences between the two main divisions within haplogroup J2, :: J2a (M410) and J2b (M12)
We write this comparison in the Sengupta et al. (2006) nomenclature. For these analyses, we have inferred that all SNP tested haplotypes that are M12- are M410 + (elsewhere in the project we do not automatically make this inference).
• We then looked at the differences between the main sub-clade groupings within J2 :: J2e (M12) J2f (M67) and J2*
We write this comparison in the YCC nomenclature that FTDNA still uses. Most members SNP test results were reported in this nomenclature, so it is used here for ease of use.
• Finally, we looked at the differences at the finest resolution of sub-clade groupings :: J2e1, J2e2, J2f1, J2f*, and J2*
We again write this comparison in the YCC nomenclature.
When comparing two groups, the appropriate statistical test to use is a "T-test". The larger the "T" statistic, the more significant the result.
(the minus sign just refers to whether "a" or "b" had a greater average). The more significant the results, the smaller that "P" becomes. "P" refers to the probability that any differences between the two groups could have been from chance alone.
 

When comparing three or more groups, the appropriate statistical test to use is an ANOVA ("Analysis of Variance"). The larger the "F" statistic, the more significant the result.
When we get a significant result in an ANOVA test it doesn't explicitly tell us which of the groups we are comparing are significantly different to each other. To find this out we have to perform a "pairwise comparison test".

The "T" and "F" values are not directly equivalent (ie. T=5.0 isn't the same level of significance as F=5.0)


To determine which groups in an ANOVA test are significantly different, it is informative to perform a "pairwise comparisons test". Here we have used the Tukey HSD test. The results are presented by "Statistix" in a form that shows which groups are different to each other, and which groups are not different to each other using A/B/C coding. Groups/clades that have same letter code are not significantly different to each other. In the example above J2e has the code "A", and J2f and J2* have the code "B". J2* and J2f are not different to each other (as they have been assigned the same code), but their code is different to that for J2e - therefore they are both significantly different to J2e. If J2f instead had the code "AB" it would mean that J2e and J2* are different to each other, but J2f would be neither different to J2e or J2* (it has the same code as J2e, and has the same code as J2*).

The usual statistical level for acceptance is P=0.05 (ie. 1/20 ) or lower. If large numbers of individual statistical tests are done, the acceptance level should be modified to take into consideration the fact that if you do 20 statistical tests, you'd expect one false positive from chance alone.
For these analyses, the more conservative statistical acceptance level used is 0.0005. The results that are significant at the .0005 level report *Significant* coloured in red, and the results that are between .05 and .0005 report *Significant* coloured in black, and it is stated that the level of statistical significance for that marker is weak.

Why don't we divide up the results for the different clusters within J2*,
and have them as separate groups in the analysis ?

At the moment (with current project member results) we can only define the main clusters within J2* by their haplotype values. If we divide them up into groups on the basis of their marker values, and then analyse marker values within each group, then the analysis results would only serve to "confirm"/reinforce any pre-conceived assumptions we might have regarding the haplotype values for that putative clade - which would constitute "circular reasoning", and therefore not good scientific practice. That is why we are using groups that have been defined by SNP/"UEP*" results, rather than groups defined by haplotype clusters.
Hypothetically, the cluster may represent a true distinct biological lineage (ie. a sub-clade), but within that clade there may be haplotypes that deviate from the norm, - and if we defined that clade by its normal haplotype values, then some of those haplotypes that form part of that clade would be missed.

*"unique event polymorphism" - a type of DNA change that has usually occurred only once (or twice) in the Y- chromosome family tree, so that we can say for sure that all who have that change are descended from the a single individual, and all whom have that change belong to a distinct lineage/clade/branch. Most of the time the markers we use for our haplotypes change far too rapidly to be considered "UEP"'s. Sometimes haplotype similarity/dissimilarity correlate to how related individuals are, but often they don't 100% accurately represent the true relationships between haplotypes on a deep ancestry scale.

Citation requirements

The project believes it's important to clearly delineate between results that have been adequately verified by proper scientific methodology ("fact"), and results that are still in the process of scientific verification ("preliminary results/hypotheses"). These preliminary results are placed on public display on the proviso that people understand that these results are still undergoing the scientific verification process, and therefore are not yet ready to be regarded as "absolute fact".
You may cite (but not reproduce) the findings on these J2 Y-DNA marker analysis results web pages for non-commercial purposes, provided that you:
1) Attribute them to either "the J2 Y-DNA project", or "Tsirigakis & Cone"
2) State that these findings are from the July 2006 preliminary J2 Y-DNA marker analysis.
3) Provide a link to the marker analysis summary page:
     http://www.j2-ydnaproject.net/analysisphase2.html

Creative Commons License
This work is licensed under a
Creative Commons Attribution-No Derivative Works 3.0 License.
This work can be freely cited, if it is attributed to:
The J2 Y-DNA project
http://www.j2-ydnaproject.net

Angela Cone - Co-administrator from mid 2006 - mid 2008
Administrator from mid 2008 - 2013
Click here to read about Angela.

Costa Tsirigakis - Founder J2 Y DNA project & admin from 2006 - mid 2008