Biomedical Named Entity extraction using general purpose NER Models


By: Praveen Koduganty | June 15, 2015

Biomedical, healthcare research and practice has accelerated the rate at which information in the form of scientific publications, EMR, transcription records and others are created and published. In order to effectively tag, index and manage this fast and ever growing knowledge, Named Entity Recognition (NER) is the first step in extracting key entities such as the people, organizations, chemicals, diseases, genes, proteins, anatomical constituents etc. NER is a challenging task and still a active area of the research. State-of-the-art NER systems for English produce near-human performance, but generally NER systems developed for one domain do not perform well on others. Further, Biomedical vocabulary is particularly challenging, new entities continue to be created by new research, abbreviations, long and complex constructions make it difficult to get accurate results or use rule based methods.

That being said, the challenges above are underscored by the results I got using general purpose and public text extraction resources, namely Named Entity Recognizers and APIs/Services. This is an apples and oranges comparison, but it does serve as a baseline to get started with. For the observations below, I used the online demo for OpenCalais and Alchemy and developed simple Groovy and Python scripts for StanfordCoreNLP, LingPipe and NLTK NERs, the code is shared here.

All but the Genia trained classifier (Lingpipe) performed abysmally, since Genia pertains to genes, proteins and general anatomy. I also happened to select sample1 to have gene mentions and sample2 to have anatomy mentions. Do note these results do not reflect the classifier’s weakness, it merely points out the huge variation when using a general purpose trained classifier vis-a-vis a domain trained and task focused classifier.
OpenCalais and Alchemy, perform a bit better than the general classifiers, but then this is a free service and thereby limited in scope, not sure if they have domain specific services. They are generally tuned for a wide audience and domains and to consume minimal computing resources while providing fair results.

Named Entity Recognition classifiers and trained models
Public use state-of-the-art machine learning classifiers – conditional random fields(CRF), hidden markov models(HMM), support vector machine(SVM), naive bayes etc., notable among them OpenNLP, StanfordCoreNLP, NLTK, LingPipe. These classifiers/algorithms provide a framework to create your own trained models on pre-annotated datasets, creating the comprehensive training datasets is immensely effort intensive and very specialized.
Public use NER trained models and corpora created either by academic institutions like LDC or during competitive conferences like the CoNLL or MUC are based on general language models with wider and more general coverage in mind. NER models that have been trained on such general purpose (Brown), or specific (GeneTag, Genia) corpora and are either too general or too domain and task specific and therefore cannot be used with acceptable accuracy for biomedical content.

Stanford CoreNLP NLTK LingPipe
POS MaxEnt Naive Bayes HMM
NER CRF Naive Bayes HMM
Entities person,org,loc,misc person,org,loc,misc Gene, Protein, Cell, Other
Models CoNLL,MUC CoNLL,Brown Brown,Genia,GeneTag

Semantic Services/APIs
A Web service or REST API that reads unstructured text and identifies entities, facts and events within the text. Notable services include
OpenCalais an Thomson Reuters company and Alchemy an IBM company. These services are available to the general public either free or for a nominal subscription fee.
OpenCalais and Alchemy go a bit further by using proprietary algorithms and methods for identifying and classifying named entities using other knowledge like ontologies and taxonomies like DbPedia, Freebase, MeSH etc, with better results but not good enough for specialized domains.

Sample Article 1
For the following input, the results are summarized below.

Autoregulation of the bacteriophage P22 scaffolding protein gene.

During the formation of each bacteriophage P22 head, about 250 molecules of the product of gene 8, scaffolding protein, coassemble with and dictate correct assembly of the coat protein into a proper shell structure. At approximately the time that DNA is inserted inside the coat protein shell, all of the scaffolding protein molecules leave the structure. They remain active and participate in several subsequent rounds of shell assembly. Previous work has shown that scaffolding protein gene expression is affected by the head assembly process and has generated the hypothesis that unassembled scaffolding protein negatively modulates the expression of its own gene but that it lacks this activity when complexed with coat protein in proheads. To test this model, a P22 restriction fragment containing the scaffolding and coat protein genes was cloned under control of the lac promoter. These cloned genes were then expressed in an in vitro DNA-dependent transcription-translation reaction. The addition of purified scaffolding protein to this reaction resulted in reduced scaffolding protein synthesis relative to coat and tail protein synthesis to an extent and at a protein concentration that was consistent with the observed reduction in vivo. We conclude that scaffolding protein synthesis is autoregulated and that scaffolding protein is the only phage-coded protein required for this process. In addition, these experiments provide additional evidence that this autoregulation is posttranscriptional.

Authors: E Wyckoff, S Casjens

Entity Calais Alchemy Stanford NLTK LingPipe
Autoregulation Location
bacteriophage P22 scaffolding protein gene Organization DNA_domain_or_region
bacteriophage P22 head Position protein_family_or_group
250 Number Cardinal
DNA Misc Person
scaffolding protein protein_family_or_group
coat protein protein_molecule
coat protein shell protein_family_or_group
shell assembly other_name
protein gene expression other_name
head body_part
P22 restriction fragment DNA_domain_or_region
scaffolding and coat protein genes protein_family_or_group
lac promoter DNA_domain_or_region
cloned genes DNA_domain_or_region
in vitro DNA-dependent transcription-translation reaction other_name
coat cell_type
protein synthesis other_name
phage-coded protein protein_family_or_group
Wyckoff Person

Sample Article 2
For the following input, the results are summarized below.

Cortical evoked potentials in spinal surgery.

Few complications of surgery are as devastating as paraplegia in a patient who has been operated on for correction of a spinal abnormality. Early warning of potential damage to the spinal cord is highly desirable. This is possible with cortical evoked potentials. It measures electroencephalographic activity after peripheral nerve stimulation, and may be used during surgery without interrupting the procedure.In this report, 42 patients were studied by means of cortical evoked potentials. Findings in 35 patients were unremarkable. Six showed changes that made the test valuable; four of these six patients had changes that were ominous. In three of the four, the changes were appreciated and corrections were made to avoid paralysis. In one, changes were not appreciated and the patient became paralyzed. In one case, improvement in conduction was evident when the deformity of the spine was corrected. In another case, the latency was momentarily prolonged when a feeder artery to the cord was ligated.Although this technique requires special training and equipment, its value justifies the trouble and expense. Certainly its common use in spinal surgery is inevitable.

Authors: D Faust, L T Happel

Entity Calais Alchemy Stanford NLTK LingPipe
Cortical evoked potentials other_name
spinal surgery Medical Treatment Field Term other_name
surgery Medical Treatment Date other_name
paraplegia Medical Condition
patient multi_cell
spinal abnormality other_name
spinal cord Field Term body_part
cortical Location cell_type
nerve lipid
42 Number Cardinal
35 Number Cardinal
Six Number Location
three Number Cardinal
four Number Cardinal
paralysis Medical Condition
one Number Date
deformity Medical Condition
improvement protein_molecule
spine multi_cell
latency other_name
cord lipid
equipment body_part
D Faust Person Person carbohydrate
T Happel Person lipid
This post has been viewed 3,735 times

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>