Wednesday 22 December 2010

Biological names in the Global Population Dynamics Database (Part 1)

Joining Data
To connect information on different organisms data must joined using one or more common attributes. The attributes we join on include, but are not limited to;
  1. names,
  2. geographical extent
  3. temporal duration,
  4. evolutionary and genealogical lineage, and
  5. ecological relationships.
So far I have only skirted the complex world of names, working instead with spatiotemporal and evolutionary relationships. Those halcyon days are, however, over, as I join mammal time-series in the Global Population Dynamics Database (GPDD), to the Pantheria trait database, range maps from the ICUN and a supertree through common Latin binomials.

Pantheria, the range maps and supertree use the Wilson and Reeder mammal taxonomy, thus, providing there are no typos, the joins should be clean. In contrast, the GPDD is collated from many different sources and consequentially names are not drawn from a common taxonomy.

GPDD names
In the GPDD data on taxa is stored in the taxon table, although, surprisingly, the is no field which contains simple Latin binomials.

Information on names is spread across several fields in the taxon table, including TaxonName that contains the 'full name' of the entity counted with various embellishments. Entries in TaxonName include,
  1. Latin binomials,
  2. Latin trinomials (with and without ssp),
  3. Latin binomial with varierty (var.),
  4. Higher taxon name only e.g. Ursus
  5. conglomerates of synonyms, e.g. Cutara (Curtara) kuri and Cutara/Curtara kuri
  6. conglomerates of synonyms with typos, e.g. Curtara (Curtara) kuri
  7. conglomerates of more than one species, e.g. Eudiaptomas gracilioides and Eudiaptomas gracilis
  8. ad-hoc names, e.g. Unknown Insect sp142,
  9. other mysterious names, e.g. Cahita n. sp. "a" = "aa" !???!
In addition to TaxonName, five fields store names relating to higher taxonomic levels. These are TaxonomicGenus, TaxonomicFamily, TaxonomicOrder, TaxonomicClass and TaxonomicPhylum. These fields also contain conglomerate names.

The final field of interest is TaxonomicLevel which codes the taxonomic level to which the taxon is differentiated. A TaxonomicLevel of 'Species' is, unfortunately, no guarantee of a valid binomial at that level as some species are 'differentiated', but only have ad-hoc names.

White-space, letter case and hyphenation
Some quick SELECT DISTINCT queries on the taxonomy columns revealed inconsistency in the writing of names and thus duplication resulting from,
  1. white-space following names ("Name" and "Name "),
  2. different cases ("Name" and "name"), and
  3. the use of hyphens ("subname" and "sub-name ").
These were relatively simple to fix. White-space was removed with the PostgreSQL TRIM function, for example,

UPDATE gpdd.taxon SET "TaxonomicPhylum" = trim(both ' ' from taxon."TaxonomicPhylum")

The few case and hyphenation errors were corrected with further UPDATE statements.

Extracting Latin binomials from TaxonName
So, TaxonName contains a hideous mixture of valid and ad-hoc biological names and I must extract Latin binomials from this mess? Two problems must be solved,
  1. The Latin binomials must be parsed from TaxonName using SQL, and
  2. name duplication from homonyms and typographical error must be adjudicated using internet name resources.
Parsing TaxonName
The strategy is to sequentially process each 'word' in TaxonName, where words are separated by white-space. This SQL splits TaxonName on the first white space into the first word (genus) and the remainder (r_genus).

SELECT
"TaxonID" as id,
"TaxonName" as name,
CASE position(' ' in "TaxonName")
WHEN 0 THEN "TaxonName"
ELSE substring("TaxonName" from 0 for position(' ' in "TaxonName"))
END AS genus,
CASE position(' ' in "TaxonName")
WHEN 0 THEN NULL
ELSE substring("TaxonName" from position(' ' in "TaxonName") + 1)
END as r_genus
FROM
gpdd.taxon
ORDER BY "TaxonName"

Providing the taxon is defined at the TaxonomicLevel of genera, species or subspecies, the field genus should contain the generic epithet which in turn should match the TaxonomicGenus. To check this I added some fields to the first query and created a view against which I queried equality between the two fields.

CREATE VIEW gpddname AS
SELECT
"TaxonID" as id,
"TaxonName" as name,
"TaxonomicPhylum",
"TaxonomicClass",
"TaxonomicOrder", "TaxonomicFamily","TaxonomicGenus",
CASE position(' ' in "TaxonName")
WHEN 0 THEN "TaxonName"
ELSE substring("TaxonName" from 0 for position(' ' in "TaxonName"))
END AS genus,
CASE position(' ' in "TaxonName")
WHEN 0 THEN NULL
ELSE substring("TaxonName" from position(' ' in "TaxonName") + 1)
END as r_genus
FROM
gpdd.taxon
WHERE "TaxonomicLevel" IN ('Genus','Species','Subspecies')
ORDER BY "TaxonomicPhylum","TaxonomicClass","TaxonomicOrder", "TaxonomicFamily","TaxonomicGenus"

SELECT * FROM gpddname WHERE "genus" <> "TaxonomicGenus"

108 rows were returned. 102 had a genus in TaxonomicGenus, but "Unknown species ..." in TaxonName. The latter were excluded from the binomial query,

SELECT * FROM gpddname WHERE "genus" <> "TaxonomicGenus" AND genus <> 'Unknown'

So, what of the other six?

Verifying Biological Names
Should I use the spelling of the genus in TaxonomicGenus or extracted from TaxonName in the Latin binomial? Similarly, given a choice of synonyms which should be selected? I am no taxonomist, but I have sat through many talks on initiatives to make taxonomic names accessible, so, stay tuned for a non-taxonomist's search of clean and verified biological names in the exotic world of uBio, the Catalogue of Life and the Global Names Index .


Tuesday 19 October 2010

My First Music Promo

Pappamidi - 'Gotta Rush On' - from the 'Album Chronicle of the Nineties'. A 'photo-animation' built with Adobe Photoshop, Squirlz Morph and Microsoft Movie Maker. Hopefully, a bit more interesting than you average YouTube music slideshow - especially the wiggling VW Beetle at the end.

Thursday 30 September 2010

Creating a simple video from a photo in Photoshop

I am putting together a YouTube video to accompany a friend's music track using Wax. Making a basic slideshow by stringing a few photos together with transition effects is pretty simple, but not exactly exciting so I began exploring how Photoshop image filters can be sequentially applied to an image to quickly create short eye-catching video sequences. For example, this pulsating logo was made in about 20 minutes by repeatingly applying a 5% 'Spherize' filter a photo.




So, how was this made? First, the photo was clipped and resized to a 640 x 480 resolution image. A new Photoshop video file was then created with the same resolution using File > New.


The .jpg photograph was then added to the video file.

As the image became the 'background layer' it converted into a normal layer by right-clicking the layer in the layers window and selecting 'Layer from Background...'. The new 'layer 0' was made invisible, and then nine identical copies created by right-clicking the layer and selecting 'Duplicate Layer...'.

Nine additional frame were then added using 'Duplicate selected frame' (arrowed above). The first frame was then selected within which the first layer is made visible.

The second frame is now selected and the first layer made invisible and the second visible. The desired transformation is then applied to the second layer, in this case, Filter > Distort > Spherize, with 5% as the amount.

The last step is now repreated for the other frames, applying 10%, 15%, 20%, etc. filters to the respective image. This can be speed up by using Ctl-F to apply the last filter used to the selected layer, thus, you can just Ctl-F twice on the next frame-layer pair, three times on the third, etc.
Use the play controls in the 'Animation(Frames) window to view.

To write the video file, File > Export > Render Video. Exporting a 10 frame animation at 30 frames a second produces a smooth but very quick animation just 0.33 seconds long. Lowering the export frame rate to 5 does not increase the length of the animation by 6. Instead a jerky video 0.4 seconds long is produced. Setting a frame delay (in Photoshop click the small down arrow in a frame window) does increase video length, but again results in a jerky video determined by the delay time (this is how the video was made). Clearly, more frames and a slower rate of change is required. Time to look at Photoshop Actions.

Tuesday 17 August 2010

Standardize Job Applications Now!

I am currently 'resting', as they say in the theater, although in my case resting involves the same workload with the added requirement of filling in job applications. I probably average one a week, which is not bad given they take 2 days each to complete.

An efficiency saving of at least 10-15% could be made very easily if employers woulc agree on implementing a standard for that covers the basic data required by all application forms: name, address, email, employers, eduction, etc. Each employer could have their own layout into which the data is imported from a file, probably via XML.

Why, when such an efficiency gain can be easily made for little effort, does no such standard exist? And, indeed why does a Google search reveal no efforts to develop one?

The answer is simple - those in charge of the system (the employers) do not gain from such a system. Indeed, it may be argued that they would in fact loose out, as they would have to spend longer sifting through application forms than they do already.

So, I say to all employers and makers of electronic job application software - give the poor man a break and knock together a standard and some I/O routines PDQ. Ta.

Visualizing the Global Population Dynamics Database

The Global Population Dynamics Database (GPDD) is comprised of over 5000 long-term species abundance time series, over 270,000 spatiotemporally referenced data points in all. How can these data be visualized to explore spatiotemporal correlations between series?

The data is four dimensional: latitude, longitude, time and abundance. The geographical coordinates are all in decimal degrees so compatible across series. Series differ in temporal resolution and coverage, as well in their units of abundance.

Dates were converted to decimal years by adding the proportion of the year the observation was taken at to the year. For example, 0.04167 was added to 'January' abundances: (month number * 1/12) + 1/24.

Log abundance series were untransformed, and then all standardized (SS) by subtracting mean and dividing by the variance.

Here are two fly-by animations of these data, both done in ArcScene:



2903 european series (131,000 observations) as line-plots offset in longitude by 6 * SS (above the mean to the east, below to the west). Series are coloured by species and displayed against annual mean temperature in 1900, 1921, 1941, 1961, 1981 and 2001. This approach shows the overall distribution of series but little other pattern is revealed without interactive manipulation.



Abundance of Lynx (L. lynx , L. canadensis & L. rufus), with spheres proportional to SS. Note the pronounced and highly spatially and temporally correlated population cycles in L. canadensis compared to L. rufus.

Friday 6 August 2010

Animating geophylogenies: fish and fire

The Trans-Mexican Volcanic Belt is a mountain range that has developed over the last 15 million years, which runs east-west across central Mexico . As the name suggests, the region is highly geologically active, with large-scale tectonic uplift, frequent volcanic eruptions, stratovolcanoes rising to over 5000m, and frequent earthquakes and faulting. This activity, combined with erosive river capture and changing rainfall, resulted in wide-spread hydrological change including the development of large lakes in the Pliocene and Pleistocene. (5.4-2.4 Mya).

15 million year or so ago the ancestor the Goodeidae freshwater fish family was caught up in this dynamic landscape as it migrated south from the southwest USA with a drying climate.The 40-odd species of Goodeidae alive today co-evolved with the hydrological landscape, which itself is the product of geological and climatic forces. Each hydrological event could split a species' range causing it to diversify into two (allopatic speciation), while in the lakes sympatric speciation involving ecological and behavioural diversification may have occurred.

I have information on species distributions and the evolutionary relationships between species, geological data and partial hydrological reconstructions. How can this information be brought together to help construct a single history from the fragmented parts?




The animation took me a couple of days to put together in ArcMap. It shows gaps (disjunctions) and marginal overlap (<50%) geophylogeny built with GeophyloBuilder for ArcGIS 1.1 from a time-scale molecular tree (Webb et al. 2004) and species records (from Dr. Constantino Marcias Garcia, UNAM). Gaps and marginally overlapping polygons are displayed from orange (disjunct) to grey (50% overlap). The node associated with each DAVA polygon and its daughter branches are shown as black dots and red arrows (pointing downstream). This biological pattern overlays some palaeo-channel, lake and watershed reconstructions (de Cserna & Alvarez 1995), and the spatiotemporal pattern of extrusive volcanics (pink; from Luca Ferrari, UNAM). The Colima graben, which the palaeolake may have drained through, is shown in purple.

The animation ends with all branches of the geophylogeny being displayed, the width of the downstream arrows being negatively proportioned to age (thin = old). The temporal scale is roughly 16 million years from start to end with the biogeographic pattern displayed by the dates from the phylogeney and the volcanics by their stratgraphic range. The temporal extent of the hydrological reconstructions has been visually set to maximise congruence with the biogeographic pattern.

What does it show? Well, disjunction and marginal overlap between the DAVA polygons indicate where a species' range may have been split in the past, resulting in speciation. The arrows point to the centroid of where the daughter clade ranges are today. Some events are congruent with change in the hydrological network and the pattern of volcanism, others remain unexplained but suggest regions where hydrological change may have occurred in the past.

References
de Cserna Z, Alvarez R. Quaternary drainage development in Central Mexico and the threat of an environmental disaster: a geological appraisal. Environ. Eng. Geosci. 1995 1: 29-34.
Webb SA, Graves JA, Marcias-Garcia C, Magurran AE, Ó Foighil D, Ritchie MG. Molecular phylogeny of the live-bearing Goodeidae (Cyprinodontiformes). Mol. Phylogenet. Evol. 2004 30: 527-544.

Tuesday 20 July 2010

An uncertain existence

My contract at Imperial finishes at the end of the month. NERC have agreed in principal to additional funding for a nifty graphical interface to the Entangled Bank database that I have been creating for the last year. However, contractual issues between are delaying the funding and indeed potentially may scupper the whole exercise. Presuming the money is going to arrive, I will then have to apply to do the work in an open competition.

Such situations are common in academia and extremely stressful to live through, especially when it affects others. I am lucky that Mary is very supportive but this situation weighs heavily on our lives. How long do I wait for the funding to come though? - these things can take months. Should I apply for another post-doc and move the family for yet another short-term contract, or find a (probably less interesting) permanent non-academic post? Applying for jobs is very time consuming so these are important decisions. Write papers or job applications? Stop and establish a long-term relationship with a place or another temporary holding position? The older we, and Ben get, the more stability and a home beckon.

Wednesday 7 July 2010

PostGIS select across the +180/-180 meridian with a bounding box

Many geographical information systems, e.g. PostGIS, treat the earth as a Cartesian plane, despite data being in a geographical coordinate system with latitudes and longitudes. This is annoying when, for example, you want to select geometries that overlap, or are within, a bounding box that crosses the +180/-180 meridian. I have just implemented PostGIS cross +180/-180 meridian bounding box searches in the Entangled Bank as follows :

To select geometries that OVERLAP a box that extends from 90E over the meridian to 90W intersect (&&) the_geom with a pair of boxes either side of the meridian:

the_geom && ST_MakeBox2D(ST_Point(+90, -90), ST_Point(+180, +90))
OR the_geom && ST_MakeBox2D(ST_Point(-180, -90), ST_Point(-90, +90)

Select geometries that are WITHIN the box is slightly more complex as those that intersect the line that bounds the select box must be excluded. In PostGIS using the well-known-text format to code line the query is:

the_geom && ST_MakeBox2D(ST_Point(+90, -90), ST_Point(+180, +90))
OR the_geom && ST_MakeBox2D(ST_Point(-180, -90), ST_Point(-90, +90)
AND NOT the_geom && ST_LineFromText('LINESTRING(+180 +90, +90 +90, +90 -90, 180 -90)')
AND NOT the_geom && ST_LineFromText('LINESTRING(-180 +90, -90 +90, -90 -90, -180 -90)')"

When I have time I shall write the required function.



Tuesday 6 July 2010

How to prune a tree given a list of nodes to include?

I have a tree and a list of nodes that define a subtree. How do I prune the tree?

It took me a while to work this one out, so to save anyone else the trouble here is some pseudo code showing how I went about it.

Begin your tree by setting the least common ancestor of the node list as the root node, then 'walk the tree' using the subroutine walk: walk(parent, parent, 0) :

sub walk (parent, ancestor, distance from ancestor to parent) {

For each (child of parent with nodes) {

Case (Count(children) of parent with nodes) {
0: Return
1: Case(Count(grandchildren of child with nodes)) {
0: add_child()
1: walk(child, ancestor, distance())
>1: add_child()
walk(child, child, 0)
}
>1: Case(Count(grandchildren of child with nodes)) {
0: add_child()
1: walk(child, ancestor, distance())
>1: add_child()
walk(child, child, 0)
}
}
}
return
}

Where, 'with nodes' is true where the node or any descendant nodes are in the list.

add_child(ancestor) adds the child as a descendant of the ancestor with distance from the ancestor to the child.

distance() sums the distance from parent to child to the combined distance from the ancestor to the parent calculated.