2015
Journal article  Open Access

Retrieving taxa names from large biodiversity data collections using a flexible matching workflow

Vanden Berghe E, Coro G Bailly N, Fiorellato F, Aldemita C, Ellenbroek A, Pagano P

Modeling and Simulation  Taxon Name Parsing  Name Matcher Chain  Taxonomy  Taxonomic Authority File  Ecology  Taxon names matching  Computational Theory and Mathematics  Computer Science Applications  Behavior and Systematics  Evolution  Taxonomic nomenclature  Applied Mathematics  Ecological Modeling  Digital Libraries 

In the domain of biological classification there are several taxon name matching services that can search for a species scientific name in a large collection of taxonomic names. Many of these services are available online, and many others run on computers of individual scientists. While these systems may work very well, most suffer from the fact that the list of names used as a reference, and the criteria to decide on a match, are hard-coded in the engine that performs the name matching. In this paper we present BiOnym, a taxon name matching system that separates reference names lists, search criteria and matching engine. The user is offered a choice of several taxonomic reference lists, including the option to upload his/her own list onto the system. Furthermore, BiOnym is a flexible workow, which embeds and combines techniques using lexical matching algorithms as well as expert knowledge. It is also an open platform allowing developers to contribute with new techniques. In this paper we demonstrate the benefits brought by this approach in terms of the efficiency and effectiveness of the information retrieval process with respect to other solutions.

Source: ECOLOGICAL INFORMATICS, vol. 28, pp. 29-41


Bard, G. V., 2007. Spelling-error tolerant, order-independent pass-phrases via the damerau-levenshtein string-edit distance metric. In: Proceedings of the
Berghe, E. V., Stocks, K. I., Grassle, J. F., 2010. Data integration: The ocean biogeographic information system. Life in the World's Oceans: Diversity, Distribution, and Abundance, 333.
Biological Abstracts, 2015. Biological Abstracts data description guide. URL http://www.library.illinois.edu/bix/pdf/dbguide/bioabs.pdf
Bisby, F. A., Froese, R., Ruggiero, M. A., Wilson, K. L., 2004. Species 2000 and ITIS catalogue of life, annual checklist 2004: indexing the world's known species. CD-ROM.
Botanical Society of Britain and Ireland, 2014. Taxon name parser. Http://bsbidb.org.uk/taxonnameparser.php.
Boyle, B., Hopkins, N., Lu, Z., Garay, J. A. R., Mozzherin, D., Rees, T., Matasci, N., Narro, M. L., Piel, W. H., Mckay, S. J., et al., 2013. The taxonomic name resolution service: an online tool for automated standardization of plant names. BMC bioinformatics 14 (1), 16.
Bragantia, 2015. Bragantia authors guidelines. URL http://www.scielo.br/revistas/brag/iinstruc.htm
Candela, L., Castelli, D., Coro, G., Pagano, P., Sinibaldi, F., 2013. Species distribution modeling in the cloud. Concurrency and Computation: Practice and Experience, n/an/a. URL http://dx.doi.org/10.1002/cpe.3030
Candela, L., Castelli, D., Pagano, P., 2009. D4science: an e-infrastructure for supporting virtual research environments. In: IRCDL. pp. 166169.
Chamberlain, S. A., Szcs, E., 2013. taxize: taxonomic search and retrieval in r. F1000Research 2.
Coro, G., Candela, L., Pagano, P., Italiano, A., Liccardo, L., 2014. Parallelizing the execution of native data mining algorithms for computational biology. Concurrency and Computation: Practice and Experience, n/an/a. URL http://dx.doi.org/10.1002/cpe.3435
Coro, G., Italiano, A., 2012. Statistical Manager developer's guide. Http://gcube.wiki.gcubesystem.org/gcube/index.php/How to_Implement_Algorithms_for_the_Statistical_Manager.
Costello, M. J., Bouchet, P., Boxshall, G., Fauchald, K., Gordon, D., Hoeksema, B. W., Poore, G. C., van Soest, R. W., Sthr, S., Walter, T. C., et al., 2013. Global coordination and standardisation in marine biodiversity through the world register of marine species (worms) and related databases. PLoS One 8 (1), e51629.
Edwards, J. L., Lane, M. A., Nielsen, E. S., 2000. Interoperability of biodiversity databases: Biodiversity information on every desktop. Science 289 (5488), 23122314.
Fiorellato, F., 2015. The REGEXP Parser. URL http://wiki.i-marine.eu/index.php/YASMEEN_input_data_ parser#SIMPLE_parser_processing_rules
Froese, R., 1997. An algorithm for identifying misspellings and synonyms in lists of scientic names of shes. Cybium 1 (3), 265280.
Froese, R., Pauly, D., 2000. FishBase 2000: concepts, design and data sources. WorldFish, Jalan Batu Maung, Batu Maung, 11960 Bayan Lepas, Penang, Malaysia.
GBIF, 2014. The GBIF ECAT programme. Https://code.google.com/p/gbifecat.
Global Biotic Interactions, 2014. GloBI. Https://github.com/jhpoelen/eolglobi-data/wiki.
Lanig, S., Schilling, A., Stollberg, B., Zipf, A., 2008. Towards standards-based processing of digital elevation models for grid computing through web processing service (wps). Computational Science and Its ApplicationsICCSA 2008, 191203.
Levenshtein, V. I., 1966. Binary codes capable of correcting deletions, insertions, and reversals. In: Fortov, V. E. (Ed.), Soviet physics doklady. Vol. 10. MAIK Nauka, pp. 707710.
Odell, M. K., 1956. The prot in records management. Systems Magazine 2021.
Oinn, T., Greenwood, M., Addis, M., Alpdemir, M. N., Ferris, J., Glover, K., Goble, C., Goderis, A., Hull, D., Marvin, D., Li, P., Lord, P., Pocock, M. R., Senger, M., Stevens, R., Wipat, A., Wroe, C., 2006. Taverna: lessons in creating a workow environment for the life sciences. Concurrency and Computation: Practice and Experience 18 (10), 10671100. URL http://dx.doi.org/10.1002/cpe.993
Owolabi, O., McGregor, D., 1988. Fast approximate string matching. Software: Practice and Experience 18 (4), 387393.
Page, R. D. M., 2014. iphylo. Http://iphylo.blogspot.be/2012/02/using-googlerene-and-taxonomic.html.
Patterson, D. J., 2014. Helping protists to nd their place in a big data world. ACTA PROTOZOOLOGICA 53 (1), 115128.
Patterson, D. J., Cooper, J., Kirk, P. M., Pyle, R., Remsen, D. P., 2010. Names are key to the big new biology. Trends in ecology & evolution 25 (12), 686691.
Rees, T., 2008a. 18.8. irmngthe interim register of marine and nonmarine genera. The Proceedings of TDWG, 72.
Rees, T., 2008b. 8.3. taxamatch, a fuzzy matching algorithm for taxon names, and potential applications in taxonomic databases. The Proceedings of TDWG 35.
Rees, T., 2014. A collection of software for taxon names matching. Http://www.cmar.csiro.au/datacentre/taxamatch.htm.
Reis, R. E., 2000. Catalog of shes. Copeia 2000 (3), 904906.
Wieczorek, J., Bloom, D., Guralnick, R., Blum, S., Dring, M., Giovanni, R., Robertson, T., Vieglais, D., 2012. Darwin core: An evolving communitydeveloped biodiversity data standard. PLoS One 7 (1), e29715.
Wikipedia, 2015. Information Retrieval, Wikipedia page. URL http://en.wikipedia.org/wiki/Information\_retrieval
Wilson, E. O., 2003. The encyclopedia of life. Trends in Ecology & Evolution 18 (2), 7780.

Metrics



Back to previous page
BibTeX entry
@article{oai:it.cnr:prodotti:331989,
	title = {Retrieving taxa names from large biodiversity data collections using a flexible matching workflow},
	author = {Vanden Berghe E and Coro G Bailly N and Fiorellato F and Aldemita C and Ellenbroek A and Pagano P},
	doi = {10.1016/j.ecoinf.2015.05.004},
	year = {2015}
}

IMARINE
Data e-Infrastructure Initiative for Fisheries Management and Conservation of Marine Living Resources


OpenAIRE