Cladistics.
(this page will be expanded over the next few weeks)

For Taxonomy Menu, click here

The most important feature of the type of systematics known as cladistics is that:
For each character there is an "ancestral state" and a "derived state".
The derived state should preferably be from a "unique evolutionary event".

If we use the example of wings in vertebrates:
(this is a very simplified example)
- The character would be "type of forelimb",
- The "ancestral" state [ie: character state of common ancestor] would be "forelimb is used for walking " and
- The "derived" state [ie: character state of a subset of descendants] would be "forelimb is used for flying"

In this particular instance, the derived character state is not from a "unique evolutionary event", - it has occurred several times in the phylogenetic tree (eg. Birds, Bats and Pterosaurs are from completely different taxonomic groups, that have independently evolved "flying forelimbs" from ancestors with "walking forelimbs"). Therefore the character "wing" is probably not a good one to use in a full cladistic analysis of vertebrates. Sharing the same character state from completely different evolutionary events is called "convergence" or "parallelism" . Another reason why it wouldn't be a suitable character is because there are examples of birds with wings that do not fly, and the Pterosaur is thought to have used its forelimbs for both walking and flying. Some of the problems with this character could be solved by changing the definitions of the character state to fit the morphology rather than the function. However, the fact that the character is not a "unique event" remains unchanged.

In real life, a character so broad as "Wing" or "type of forelimb" would not be used. It's more likely to be something like "number of papillae on prothorax", with character states such as "ancestral state = 3 papillae", "derived state= 4 papillae".

Sharing the same character state due to descent from a common ancestor is called "Homology" and this is the type of derived character we want in cladistics.

How does this relate to Y-DNA?

Within the context of the NRY phylogenetic tree, mutations such as "single nucleotide polymorphisms" are suitable as characters. With this type of mutation, it is 100% clear what the ancestral state is, and what the "derived" state is. Mutations such as "STR's" are not not suitable as characters. With this type of marker it is not 100% clear what the ancestral state is and what the "derived" state is.
We will explain why this is so in more depth below.

Black and White analogy:

 

A binomial character is like the distribution of black and white in the strip above. Its 100% clear where "black" ends, and "white" starts. It is 100% unambiguous.

A multi state character is like the distribution of black and white in the strip above. It is not 100% clear where "black" ends and "white" starts. It is ambiguous. Exactly where do you draw the line?

To apply the analogy to Y-DNA - with binomial characters (like SNP's) the "ancestral" and "derived" states are 100% clear. On the other hand, with multi state characters (like STR's) the "ancestral" and "derived" states are not clear.
With STR's the problem with defining ancestral v's derived is even more difficult that with the black and white analogy. With STR's it's difficult to define what would be considered "ancestral" v's "derived" let alone how to separate them from each other.

Within the context of the NRY Phylogenetic tree, - the ancestral state is the character state that the "Y-chromosomal progenitor" is hypothesized to have. The Y-chromosomal progenitor is the man whose Y-Chromosomal line is the most recent ancestor of all Y-chromosomal lineages (the popular media would call him "Y-chromosome adam").
To be considered a valid character - it must be possible to objectively determine the character state that this hypothetical ancestor had. With STR markers it would be difficult (if not impossible) to determine the character state that the Y-chromosomal progenitor had.

As an example, consider the marker DYS 393. The marker data presented on the SMGF website shows that it has the following repeat value distribution:

However, the above graph doesn't show the full distribution of values. The axis in the graph below is truncated to show the full distribution of values.

As can be seen, it has a repeat distribution from 10-16, (most values being between 12 &15) and has an overall modal of 13 in the sampled population.

In the Black and white analogy you would be able to label one end of the spectrum "ancestral" and the other end of the spectrum "derived". Then you would have the problem of deciding where "ancestral" ends and "derived" begins (like in the wing analogy, there would be fossil evidence of intermediate forms). However with STR's, - you cannot even do that.

Firstly, - we can see what the distribution is in the current day, but we can't really tell what value the "Y-chromosomal progenitor" really had. We could assume that the most parsimonious conclusion is that the modal value = the ancestral value. However, - there's no way to tell this for sure. For starters, the sampling above is biased towards individuals of North Western European ancestry and therefore is not globally representative.

Outgroups Normally you could deduce the ancestral state by comparing values with an outgroup (ie. Chimpanzees, Gorillas, Orangutans), but STR markers are likely to be just as quickly mutating in the outgroup as they are in humans. So again, you wouldn't know 100% for sure.

Lets pretend for a moment we had some way of saying 100% for sure that the ancestral value was 13. Just how would we score ancestral v's derived ? Would we call 13 repeats "ancestral" and any other repeat value "derived"?, or would we arbitrarily assign some midway cut off point?. If we did that there would essentially be two different versions of "derived" as well (10 repeats derived, and 16 repeats derived). The next problem is, - can we be sure that a haplotype with the "ancestral" value of 13 really has the "ancestral" state? In some haplotypes the value of 13 might be from a mutation downwards from 14, or a mutation upwards from 12.

The fact is that STR's mutate so quickly, - it would be like thousands of animal lineages independently evolving wings for flight, and these individual lineages reverting back to the ancestral state of using forelimbs for walking, and then back again potentially several times in each individual lineage. A character such of this would never be used as a character in cladistics. Having convergence occur so many different times is one thing, - but reversion of character state would be yet another reason why the character could not be used in Cladistics.

Binomial STR distributions
"Virtual SNP's" or "just another STR"

There is only one type of instance in which we would be able to apply ancestral v's derived character states to STR markers.

That is, if the distribution of marker repeats was binomial. eg:



 

In this hypothetical example, the bimodal distribution would result from a single mutation event in which the number of repeats almost doubled (or alternatively a single mutation event in which several repeats were deleted). The original distribution would be ancestral, and the distribution that corresponds to the descendants of the multi repeat mutation event would be "derived". Here the character state of "ancestral" and "derived" can be objectively distinguished, as there is no overlap between the two distributions.

However,... what if the marker distribution was like this instead?



Once the two distributions begin to overlap, then the character states of "ancestral" and "derived" can't be objectively distinguished. You could assign an ancestral status to values between 10 and 14, and a derived status to values between 17 and 21. However, you would not be able to assign an "ancestral" or "derived" status to values of 15 or 16 repeats. Therefore the character would no longer be strictly binomial in character state. In addition - In a real life example you wouldn't know the full extent to which the two distributions overlap. The overlap could instead be as below:


In real life, the "ancestral" v's "derived" wouldn't be colour coded, all you'd see is the below:

The danger of using STR markers, is that even if that marker has had (for instance) a multi-repeat deletion or addition equivalent to a UEP, and the distribution of marker values between the "Ancestral" v's Derived" states seem to be distinct, - the marker distributions between the two groups can and will eventually overlap at some time.
In addition, it's important to remember that even if existing research seems to indicate that the distributions do not overlap, - the populations sampled might not be an accurate representation of global variation. Later research may reveal that overlap does indeed exist (and therefore the marker cannot be considered equivalent to a UEP).

Creative Commons License
This work is licensed under a
Creative Commons Attribution-No Derivative Works 3.0 License.
This work can be freely cited, if it is attributed to:
The J2 Y-DNA project or
Angela Cone (2007)
Msc Evolutionary Ecology
http://www.j2-ydnaproject.net/cladistics.html