SciXML

Background

SciXML is an:

“XML vocabulary for textual content and structure in scientific articles. The format can be used for any discipline in principle. The structural information expresses section structure, headings, footnotes etc. Paragraphs are the lowest level.”

Taken from the SciXML SourceForge home page, http://sourceforge.net/projects/scixml/ .

Originally developed as part of Simone Teufel’s PhD thesis. Papers on SciXML usually reference this original document and an update conference paper, Rupp et al, 2006.

Ian Lewin’s paper on using hand-coded rules to convert publishers’ XML files to SciXML provides an accessible introduction to, and justification of, the schema, including this opening statement from his abstract:

“SciXML is designed to represent the standard hierarchical structure of scientific articles and represents a candidate common document representation framework for text-mining.”

Research projects that have used SciXML include:

  • CitRAZ Rhetorical Citation Maps and Domain-Independent Argumentative Zoning
  • FlySlip Integrating Literature, Experiments and Curation in Drosophila Genomics Research
  • Sciborg Extracting the Science from Scientific Publications

Simone was PI on all three projects.

In our JISC bid we noted the use of SciXML in the PaperBrowser project (Karamanis et al, 2008), which supports curation of the FlyBase genomic database:

“PaperBrowser has demonstrated the value of representing layout information in a suitable markup language (SciXML). Such layout is normally self-consistent, but varies between publications.”

PaperBrowser benefits from SciXML input because the different sections in the source document as well as their headings and sub-headings are identified in a consistent manner. PaperBrowser itself though, uses its own XML schema, FBXML, to record document data and make it available to curators through PaperBrowser’s viewer.

Outside of Cambridge Computer Labs, SciXML seems to be little used as yet. The one example appears to the recently completed, JISC funded, ART project at Aberystwyth (though Simone and Colin Batchelor of the RSC were involved too). The project created “an ontology based article preparation tool” which has been used to manually mark up 255 papers which “cover topics in physical chemistry and biochemistry and were provided by the Royal Society of Chemistry (RSC) Publishing.” Read more about this project in their poster presented at the JISC Repositories & Preservation Programme Meeting, Birmingham, May 2009.

There is a follow on project “Automating the recognition of scientific concepts&rdquo for which a Research Associate is being recruited. The project aims to extend the functionality of the existing tool SAPIENT by incorporating machine learning methods for the recognition of core scientific concepts such as ‘Conclusion’, ‘Method’ and ‘Result’ in research papers.

Summary

SciXML was intended to capture the logical structure of scientific papers. There is a reliable body of evidence to indicate it successfully meets this goal. However, the corpus has been restricted to chemistry, so a potential contribution would be to validate its applicability in another scientific domain.

Implications

In our work we already consider two XML formats: TEI Lite and taXMLit. Within the timescale of the ABLE project we could not realistically incorporate a third except as a proof of concept for alternative XSL transformations.

In follow on projects there could be an advantage in exploiting SciXML.

    Firstly, as a schema focused on the structure of scientific documents it offers some advantages over the more general TEI Lite. However, TEI Lite currently has better support for images and diagrams. This deficiency is being addressed in the latest developments of SciXML.

    Secondly, it would give use access to the workflow and toolset used in the Sciborg project, which matches closely what we want to achieve albeit processing literature on chemistry rather than biodiversity. Of particular relevance is Oscar3 which parses documents looking for chemical named entities and could be modified for our needs.

Note, SciXML cannot be considered as a rival to taXMLit as it does not offer the level of atomisation and biodiversity specific tagset of taXMLit.

Not to be confused with

SCIXML: Scottish Care Information XML, developed by NHSScotland as a common data exchange format.

References

Lewin, I., 2007. Using hand-crafted rules and machine learning to infer SciXML document structure. In Proceedings of the 6th UK e-science All Hands Meeting.

Rupp, C.J., Copestake, A., Teufel, S., and Waldron, B., 2006. Flexible interfaces in the application of language technology to an eScience corpus. In Proceedings of the 4th UK E-Science All Hands Meeting.

Karamanis, N., Seal, R., Lewin, I., McQuilton, P., Vlachos, A. and Gasperin, C., Drysdale R. and Briscoe, E., 2008. Natural Language Processing in aid of FlyBase Curation. BMC Bioinformatics 9. pp.193-204

Teufel, S., 1999. Argumentative zoning: Information extraction from scientific text. Unpublished PhD thesis, University of Edinburgh. Available at: http://www.cl.cam.ac.uk/users/sht25/thesis/t.pdf.

Comments

TDWG's view on SciXML

SciXML has yet to make an impact in biodiversity. Searching TDWG's website for SciXML gives NO results found. Directly checking the Modularisation of Standard Schemas page (part of Napier's Taxonomic Concept Transfer Schema wiki), I found mention of the various biodiversity XML schemas, including taXMLit, and a few schemas from outside our domain that might prove interesting, such as CML Chemical Markup Language, but not SciXML.

Sciborg's Oscar3

Oscar3

Oscar3 is used in Sciborg. According to the tool’s web site:

Oscar3 is a tool for shallow, chemistry-specific parsing of chemical documents. It identifies (or attempts to identify):

  • Chemical names: singular nouns, plurals, verbs etc., also formulae and acronyms, some enzymes and reaction names.
  • Ontology terms: if you can do it by string-matching, you can get OSCAR to do it.
  • Chemical data: Spectra, melting/boiling point, yield etc. in experimental sections.

The second bullet is a problem as far as ABLE is concerned, because we cannot identify terms by string-matching in the absence a comprehensive taxonomic database to match against.

Oscar3 has additional tools to support the enhancement and maintenance of its own dictionary: “online management of a chemical/stopword lexicon”, as well as support for the “manual editing of SciXML fragments containing named entities, for creating of gold standards and training data.”

So, potentially interesting as a long-term project to rework Oscar3 to support taxonomic parsing but not something for use within ABLE.

Scratchpads developed and conceived by: Vince Smith, Simon Rycroft, Dave Roberts, Ben Scott...