EvoViz

Thursday, 3 March 2011

Areas of Endemism and Event-Based Methods

In a recent Journal of Biogeography editorial, Ontology of areas of endemism, Brian Crother and Christopher Murray argue that areas of endemism should be the preferred unit in historical biogeography, including event-based methods such as Dispersal Vicariance Analysis and La Grange. I disagree.

Event-based methods reconstruct the history of a clade from an observed distribution of taxa and their evolutionary relationships, given a biogeographic model that defines three things.

How geographical space is divided into units
How those units relate through time
How organisms respond to different configurations of units.

Crother and Murray state that the geographic units should be areas of endemism. Like the 'niche', an 'area of endemism' is a simple concept that most biogeographers' recognise, yet are unable to agree on a clear simple definition. It is, however, essentially an area occupied by a group of species that share similar ranges. It is argued that sharing ranges implies a shared history of the taxa, and, by inference, a history of place. Common phylogenetic pattern confirms hypotheses of shared taxa history. Areas of endemism are further believed to be hierarchical as they are (usually) created by vicariance, and they are geographical units which can be nested within other such units, e.g Jamaica can be nested within North America.

I am interested in reconstructing the spatial and temporal history of a family of freshwater fish species as they diversified with the rise of the Trans-Mexican Volcanic Belt. I know where the fish live now, their evolutionary relationships, and I have a partial hydrological history inferred from geological evidence. The fish cannot disperse between drainage basins and the configuration of the drainage basins changes through time. Basins split, coalesce, or part of one may be exchanged with an adjacent basin in a river capture event. While identifying areas of endemism occupied by sister taxa may help identify past splitting and exchange events, the units that determine the history of the clade are the basins themeselves, and to a lesser extent environmental variation within basins. Basins which change in extent and connectivity through time, and are not in any way hierarchical.

Areas of endemism may exist, they may be correlated with shared history, but they are only one of many artifacts left by history. To reconstruct taxon histories fragmented information must be integrated from many sources to build scenarios consistent with all the available information. Areas of endemism are just one source of information.

Monday, 14 February 2011

GeoPhyloBuilder for ArcGIS v1.2 released

GeoPhyloBuilder for ArcGIS v1.2 introduced two new tip-fan options, 'Internal+' and 'Drop' which with 'Internal' and 'Tip' makes four tip-fan options.

The release also fixes bugs in the positioning of nodes and branches when placed using overlap and disjunction between sister clades.

GeoPhyloBuilder for ArcGIS v1.2 can be download from SourceForge.

Friday, 4 February 2011

Overlap-Disjunction Analysis Geophylogeny

In overlap-disjunction analysis tree nodes are placed at the centroid of the region of disjunction or overlap between sister clade ranges.

An ODA geophylogeny ofthe Goodeidae is displayed against four palaeohydrological models for Central Mexico. Five clades are identified by colour and observations treated as polytomies of their first internal node. The direction circle has a radius of 100km. Box divisions are every 5 million years from present (bottom).

Wednesday, 2 February 2011

A Macroscope of Evolution

Microscopes allow us to see very small things, similarly macroscopes help us see very big things.

The evolution of life on earth is a very large thing indeed. It has existed more than 2 billion years and extends over 500 million square kilometers. It is currently composed of at least 4 million species, believed to be a small fraction of the total number that have ever lived. Each species is composed of many individuals that vary in morphology, genetics and behaviour. All these individuals are linked through an often tree-like space-time network of common descent, which formed by interacting with the planet, and with itself.

That's pretty big and complicated!

One way of understanding entities that extend over wide ranges of spatial and temporal scales is to plot information against logarithmic axes of space and time. In my macroscope of evolution objects of study, biological processes, scientific disciples and structures generated are shown.

Monday, 3 January 2011

Choosing between alternative biological names with uBio

Background

The format and organisation of taxonomic information in the GPDD leaves a lot to be desired.

Queries on compound name entries require the extensive use of wildcards that increase query complexity and reduces efficiency. If I want fields with unambiguous single names I must decide between alternative names.

I am going to use some the biological names services to answer my questions and in doing so I hope to learn more about what each does and how they interconnect, or not, as the case may be.

I began with Phyla in theTaxonomicPhylum field. Three of which have compound names; Chromophyta (Heterokontophyta), Cnidaria (Coelenterata), and Dinophyta (Pyrrophyta).

Universal Biological Indexer and Organizer (uBio)

UBio is collated from a wide range of sources complied by taxonomists and other scientists so data quality should be good. It is a Taxonomic Name Server (TNS) composed of two parts. NameBank stores the names and facts that link names while ClassificationBank stores classifications and taxon concepts. I type the first name, "Chromophyta" into the box and press search. One match, but does it help?

NamebankID is the unique ID of the name in NameBank, the LSID is its resolvable Life Science Identifier. A clickable list of common names given, one of which is "heterokonts", and some information on record insertion. Clicking "view metadata" returns the string,

urn:lsid:ubio.org:namebank:10129547 Vernacular

This tells me it is a common name. "Heterokontophyta" also returns a single record.

A record with the metadata string,

urn:lsid:ubio.org:namebank:1560200 Heterokontophyta Heterokontophyta 5 Heterokontophyta Heterokontophyta Scientific Name Canonical form Phylum

Therefore, according to NameBank, "Heterikontophyta" is the correct scientific name for the Phylum, but what is the metadata string format and why are Heterikontophyta and Chromophyta not cross-linked if they refer to the same taxon? These questions must remain for another day.

What of the other two name pairs?

Both Cnidaria and Coelenterata appear to be valid scientific names, but the returned metadata does not seem to help me to make an informed decision between the two names. Wikipedia searches provides an answer,

"Cnidarians were for a long time grouped with Ctenophores in the phylum Coelenterata, but increasing awareness of their differences caused them to be placed in separate phyla."

"Coelenterata is an obsolete long term encompassing two animal phyla, the Ctenophora (comb jellies) and the Cnidaria (coral animals, true jellies, sea anemones, sea pens, and their allies)"

A search of ZooBank, the official registry of Zoological Nomenclature, returns no acts for either name. So, Cnidaria it is.

NameBank searches for Dinophyta and Pyrrophyta reveal the latter to be a vernacular name. Again, Wikipedia provides a quick and easily understood resolution to why both names exist. They were classified as both plants and animals! So Dinophyta they are, although on both Wikipedia and the University of California Museum of Palaeotology the phylum is "Dinoflagellata". So, it looks like neither name is correct.

Will other taxonomic name services provide more information than uBio? We shall see.

Wednesday, 22 December 2010

Biological names in the Global Population Dynamics Database (Part 1)

Joining Data

To connect information on different organisms data must joined using one or more common attributes. The attributes we join on include, but are not limited to;

names,
geographical extent
temporal duration,
evolutionary and genealogical lineage, and
ecological relationships.

So far I have only skirted the complex world of names, working instead with spatiotemporal and evolutionary relationships. Those halcyon days are, however, over, as I join mammal time-series in the Global Population Dynamics Database (GPDD), to the Pantheria trait database, range maps from the ICUN and a supertree through common Latin binomials.

Pantheria, the range maps and supertree use the Wilson and Reeder mammal taxonomy, thus, providing there are no typos, the joins should be clean. In contrast, the GPDD is collated from many different sources and consequentially names are not drawn from a common taxonomy.

GPDD names

In the GPDD data on taxa is stored in the taxon table, although, surprisingly, the is no field which contains simple Latin binomials.

Information on names is spread across several fields in the taxon table, including TaxonName that contains the 'full name' of the entity counted with various embellishments. Entries in TaxonName include,

Latin binomials,
Latin trinomials (with and without ssp),
Latin binomial with varierty (var.),
Higher taxon name only e.g. Ursus
conglomerates of synonyms, e.g. Cutara (Curtara) kuri and Cutara/Curtara kuri
conglomerates of synonyms with typos, e.g. Curtara (Curtara) kuri
conglomerates of more than one species, e.g. Eudiaptomas gracilioides and Eudiaptomas gracilis
ad-hoc names, e.g. Unknown Insect sp142,
other mysterious names, e.g. Cahita n. sp. "a" = "aa" !???!

In addition to TaxonName, five fields store names relating to higher taxonomic levels. These are TaxonomicGenus, TaxonomicFamily, TaxonomicOrder, TaxonomicClass and TaxonomicPhylum. These fields also contain conglomerate names.

The final field of interest is TaxonomicLevel which codes the taxonomic level to which the taxon is differentiated. A TaxonomicLevel of 'Species' is, unfortunately, no guarantee of a valid binomial at that level as some species are 'differentiated', but only have ad-hoc names.

White-space, letter case and hyphenation

Some quick SELECT DISTINCT queries on the taxonomy columns revealed inconsistency in the writing of names and thus duplication resulting from,

white-space following names ("Name" and "Name "),
different cases ("Name" and "name"), and
the use of hyphens ("subname" and "sub-name ").

These were relatively simple to fix. White-space was removed with the PostgreSQL TRIM function, for example,

UPDATE gpdd.taxon SET "TaxonomicPhylum" = trim(both ' ' from taxon."TaxonomicPhylum")

The few case and hyphenation errors were corrected with further UPDATE statements.

Extracting Latin binomials from TaxonName

So, TaxonName contains a hideous mixture of valid and ad-hoc biological names and I must extract Latin binomials from this mess? Two problems must be solved,

The Latin binomials must be parsed from TaxonName using SQL, and
name duplication from homonyms and typographical error must be adjudicated using internet name resources.

Parsing TaxonName

The strategy is to sequentially process each 'word' in TaxonName, where words are separated by white-space. This SQL splits TaxonName on the first white space into the first word (genus) and the remainder (r_genus).

SELECT 
  "TaxonID" as id, 
  "TaxonName" as name,
  CASE position(' ' in "TaxonName")
  WHEN 0 THEN "TaxonName"
  ELSE substring("TaxonName" from 0 for position(' ' in "TaxonName"))
  END AS genus,
  CASE position(' ' in "TaxonName")
  WHEN 0 THEN NULL 
  ELSE substring("TaxonName" from position(' ' in "TaxonName") + 1)  
  END as r_genus
FROM 
  gpdd.taxon
ORDER BY "TaxonName"

Providing the taxon is defined at the TaxonomicLevel of genera, species or subspecies, the field genus should contain the generic epithet which in turn should match the TaxonomicGenus. To check this I added some fields to the first query and created a view against which I queried equality between the two fields.

CREATE VIEW gpddname AS
SELECT 
  "TaxonID" as id, 
  "TaxonName" as name,
  "TaxonomicPhylum",
  "TaxonomicClass",
  "TaxonomicOrder", "TaxonomicFamily","TaxonomicGenus",
  CASE position(' ' in "TaxonName")
  WHEN 0 THEN "TaxonName"
  ELSE substring("TaxonName" from 0 for position(' ' in "TaxonName"))
  END AS genus,
  CASE position(' ' in "TaxonName")
  WHEN 0 THEN NULL 
  ELSE substring("TaxonName" from position(' ' in "TaxonName") + 1)  
  END as r_genus
FROM 
  gpdd.taxon
WHERE "TaxonomicLevel" IN ('Genus','Species','Subspecies')
ORDER BY "TaxonomicPhylum","TaxonomicClass","TaxonomicOrder", "TaxonomicFamily","TaxonomicGenus"

SELECT * FROM gpddname WHERE "genus" <> "TaxonomicGenus"

108 rows were returned. 102 had a genus in TaxonomicGenus, but "Unknown species ..." in TaxonName. The latter were excluded from the binomial query,

SELECT * FROM gpddname WHERE "genus" <> "TaxonomicGenus" AND genus <> 'Unknown'

So, what of the other six?

Verifying Biological Names

Should I use the spelling of the genus in TaxonomicGenus or extracted from TaxonName in the Latin binomial? Similarly, given a choice of synonyms which should be selected? I am no taxonomist, but I have sat through many talks on initiatives to make taxonomic names accessible, so, stay tuned for a non-taxonomist's search of clean and verified biological names in the exotic world of uBio, the Catalogue of Life and the Global Names Index .

Tuesday, 19 October 2010

My First Music Promo

Pappamidi - 'Gotta Rush On' - from the 'Album Chronicle of the Nineties'. A 'photo-animation' built with Adobe Photoshop, Squirlz Morph and Microsoft Movie Maker. Hopefully, a bit more interesting than you average YouTube music slideshow - especially the wiggling VW Beetle at the end.