Wednesday 22 December 2010

Biological names in the Global Population Dynamics Database (Part 1)

Joining Data
To connect information on different organisms data must joined using one or more common attributes. The attributes we join on include, but are not limited to;
  1. names,
  2. geographical extent
  3. temporal duration,
  4. evolutionary and genealogical lineage, and
  5. ecological relationships.
So far I have only skirted the complex world of names, working instead with spatiotemporal and evolutionary relationships. Those halcyon days are, however, over, as I join mammal time-series in the Global Population Dynamics Database (GPDD), to the Pantheria trait database, range maps from the ICUN and a supertree through common Latin binomials.

Pantheria, the range maps and supertree use the Wilson and Reeder mammal taxonomy, thus, providing there are no typos, the joins should be clean. In contrast, the GPDD is collated from many different sources and consequentially names are not drawn from a common taxonomy.

GPDD names
In the GPDD data on taxa is stored in the taxon table, although, surprisingly, the is no field which contains simple Latin binomials.

Information on names is spread across several fields in the taxon table, including TaxonName that contains the 'full name' of the entity counted with various embellishments. Entries in TaxonName include,
  1. Latin binomials,
  2. Latin trinomials (with and without ssp),
  3. Latin binomial with varierty (var.),
  4. Higher taxon name only e.g. Ursus
  5. conglomerates of synonyms, e.g. Cutara (Curtara) kuri and Cutara/Curtara kuri
  6. conglomerates of synonyms with typos, e.g. Curtara (Curtara) kuri
  7. conglomerates of more than one species, e.g. Eudiaptomas gracilioides and Eudiaptomas gracilis
  8. ad-hoc names, e.g. Unknown Insect sp142,
  9. other mysterious names, e.g. Cahita n. sp. "a" = "aa" !???!
In addition to TaxonName, five fields store names relating to higher taxonomic levels. These are TaxonomicGenus, TaxonomicFamily, TaxonomicOrder, TaxonomicClass and TaxonomicPhylum. These fields also contain conglomerate names.

The final field of interest is TaxonomicLevel which codes the taxonomic level to which the taxon is differentiated. A TaxonomicLevel of 'Species' is, unfortunately, no guarantee of a valid binomial at that level as some species are 'differentiated', but only have ad-hoc names.

White-space, letter case and hyphenation
Some quick SELECT DISTINCT queries on the taxonomy columns revealed inconsistency in the writing of names and thus duplication resulting from,
  1. white-space following names ("Name" and "Name "),
  2. different cases ("Name" and "name"), and
  3. the use of hyphens ("subname" and "sub-name ").
These were relatively simple to fix. White-space was removed with the PostgreSQL TRIM function, for example,

UPDATE gpdd.taxon SET "TaxonomicPhylum" = trim(both ' ' from taxon."TaxonomicPhylum")

The few case and hyphenation errors were corrected with further UPDATE statements.

Extracting Latin binomials from TaxonName
So, TaxonName contains a hideous mixture of valid and ad-hoc biological names and I must extract Latin binomials from this mess? Two problems must be solved,
  1. The Latin binomials must be parsed from TaxonName using SQL, and
  2. name duplication from homonyms and typographical error must be adjudicated using internet name resources.
Parsing TaxonName
The strategy is to sequentially process each 'word' in TaxonName, where words are separated by white-space. This SQL splits TaxonName on the first white space into the first word (genus) and the remainder (r_genus).

SELECT
"TaxonID" as id,
"TaxonName" as name,
CASE position(' ' in "TaxonName")
WHEN 0 THEN "TaxonName"
ELSE substring("TaxonName" from 0 for position(' ' in "TaxonName"))
END AS genus,
CASE position(' ' in "TaxonName")
WHEN 0 THEN NULL
ELSE substring("TaxonName" from position(' ' in "TaxonName") + 1)
END as r_genus
FROM
gpdd.taxon
ORDER BY "TaxonName"

Providing the taxon is defined at the TaxonomicLevel of genera, species or subspecies, the field genus should contain the generic epithet which in turn should match the TaxonomicGenus. To check this I added some fields to the first query and created a view against which I queried equality between the two fields.

CREATE VIEW gpddname AS
SELECT
"TaxonID" as id,
"TaxonName" as name,
"TaxonomicPhylum",
"TaxonomicClass",
"TaxonomicOrder", "TaxonomicFamily","TaxonomicGenus",
CASE position(' ' in "TaxonName")
WHEN 0 THEN "TaxonName"
ELSE substring("TaxonName" from 0 for position(' ' in "TaxonName"))
END AS genus,
CASE position(' ' in "TaxonName")
WHEN 0 THEN NULL
ELSE substring("TaxonName" from position(' ' in "TaxonName") + 1)
END as r_genus
FROM
gpdd.taxon
WHERE "TaxonomicLevel" IN ('Genus','Species','Subspecies')
ORDER BY "TaxonomicPhylum","TaxonomicClass","TaxonomicOrder", "TaxonomicFamily","TaxonomicGenus"

SELECT * FROM gpddname WHERE "genus" <> "TaxonomicGenus"

108 rows were returned. 102 had a genus in TaxonomicGenus, but "Unknown species ..." in TaxonName. The latter were excluded from the binomial query,

SELECT * FROM gpddname WHERE "genus" <> "TaxonomicGenus" AND genus <> 'Unknown'

So, what of the other six?

Verifying Biological Names
Should I use the spelling of the genus in TaxonomicGenus or extracted from TaxonName in the Latin binomial? Similarly, given a choice of synonyms which should be selected? I am no taxonomist, but I have sat through many talks on initiatives to make taxonomic names accessible, so, stay tuned for a non-taxonomist's search of clean and verified biological names in the exotic world of uBio, the Catalogue of Life and the Global Names Index .