Biomedical Named Entity extraction using general purpose NER Models

Biomedical, healthcare research and practice has accelerated the rate at which information in the form of scientific publications, EMR, transcription records and others are created and published. In order to effectively tag, index and manage this fast and ever growing knowledge, Named Entity Recognition (NER) is the first step in extracting key entities such as the people, organizations, chemicals, diseases, genes, proteins, anatomical constituents etc. NER is a challenging task and still a active area of the research. State-of-the-art NER systems for English produce near-human performance, but generally NER systems developed for one domain do not perform well on others. Further, Biomedical vocabulary is particularly challenging, new entities continue to be created by new research, abbreviations, long and complex constructions make it difficult to get accurate results or use rule based methods.

 

That being said, the challenges above are underscored by the results I got using general purpose and public text extraction resources, namely Named Entity Recognizers and APIs/Services. This is an apples and oranges comparison, but it does serve as a baseline to get started with. For the observations below, I used the online demo for OpenCalais and Alchemy and developed simple Groovy and Python scripts for StanfordCoreNLP, LingPipe and NLTK NERs, the code is shared here.

 

All but the Genia trained classifier (Lingpipe) performed abysmally, since Genia pertains to genes, proteins and general anatomy. I also happened to select sample1 to have gene mentions and sample2 to have anatomy mentions. Do note these results do not reflect the classifier's weakness, it merely points out the huge variation when using a general purpose trained classifier vis-a-vis a domain trained and task focused classifier. OpenCalais and Alchemy, perform a bit better than the general classifiers, but then this is a free service and thereby limited in scope, not sure if they have domain specific services. They are generally tuned for a wide audience and domains and to consume minimal computing resources while providing fair results.

 

Named Entity Recognition classifiers and trained models

Public use state-of-the-art machine learning classifiers - conditional random fields(CRF), hidden markov models(HMM), support vector machine(SVM), naive bayes etc., notable among them OpenNLP, StanfordCoreNLP, NLTK, LingPipe. These classifiers/algorithms provide a framework to create your own trained models on pre-annotated datasets, creating the comprehensive training datasets is immensely effort intensive and very specialized.

Public use NER trained models and corpora created either by academic institutions like LDC or during competitive conferences like the CoNLL or MUC are based on general language models with wider and more general coverage in mind. NER models that have been trained on such general purpose (Brown), or specific (GeneTag, Genia) corpora and are either too general or too domain and task specific and therefore cannot be used with acceptable accuracy for biomedical content.

 

  Stanford CoreNLP NLTK LingPipe
POS MaxEnt Naive Bayes HMM
NER CRF Naive Bayes HMM
Entities person,org,loc,misc person,org,loc,misc Gene, Protein, Cell, Other
Models CoNLL,MUC CoNLL,Brown Brown,Genia,GeneTag

 

Semantic Services/APIs

A Web service or REST API that reads unstructured text and identifies entities, facts and events within the text. Notable services include

OpenCalais an Thomson Reuters company and Alchemy an IBM company. These services are available to the general public either free or for a nominal subscription fee.

OpenCalais and Alchemy go a bit further by using proprietary algorithms and methods for identifying and classifying named entities using other knowledge like ontologies and taxonomies like DbPedia, Freebase, MeSH etc, with better results but not good enough for specialized domains.


Sample Article 1

For the following input, the results are summarized below.

Autoregulation of the bacteriophage P22 scaffolding protein gene. During the formation of each bacteriophage P22 head, about 250 molecules of the product of gene 8, scaffolding protein, coassemble with and dictate correct assembly of the coat protein into a proper shell structure. At approximately the time that DNA is inserted inside the coat protein shell, all of the scaffolding protein molecules leave the structure. They remain active and participate in several subsequent rounds of shell assembly. Previous work has shown that scaffolding protein gene expression is affected by the head assembly process and has generated the hypothesis that unassembled scaffolding protein negatively modulates the expression of its own gene but that it lacks this activity when complexed with coat protein in proheads. To test this model, a P22 restriction fragment containing the scaffolding and coat protein genes was cloned under control of the lac promoter. These cloned genes were then expressed in an in vitro DNA-dependent transcription-translation reaction. The addition of purified scaffolding protein to this reaction resulted in reduced scaffolding protein synthesis relative to coat and tail protein synthesis to an extent and at a protein concentration that was consistent with the observed reduction in vivo. We conclude that scaffolding protein synthesis is autoregulated and that scaffolding protein is the only phage-coded protein required for this process. In addition, these experiments provide additional evidence that this autoregulation is posttranscriptional. Authors: E Wyckoff, S Casjens
Entity Calais Alchemy Stanford NLTK LingPipe
Autoregulation       Location  
bacteriophage P22 scaffolding protein gene       Organization DNA_domain_or_region
bacteriophage P22 head Position       protein_family_or_group
250     Number Cardinal  
DNA     Misc Person  
scaffolding protein         protein_family_or_group
coat protein         protein_molecule
coat protein shell         protein_family_or_group
shell assembly         other_name
protein gene expression         other_name
head         body_part
P22 restriction fragment         DNA_domain_or_region
scaffolding and coat protein genes         protein_family_or_group
lac promoter         DNA_domain_or_region
cloned genes         DNA_domain_or_region
in vitro DNA-dependent transcription-translation reaction         other_name
coat         cell_type
protein synthesis         other_name
phage-coded protein         protein_family_or_group
Wyckoff     Person    

 

Sample Article 2

For the following input, the results are summarized below.

Cortical evoked potentials in spinal surgery. Few complications of surgery are as devastating as paraplegia in a patient who has been operated on for correction of a spinal abnormality. Early warning of potential damage to the spinal cord is highly desirable. This is possible with cortical evoked potentials. It measures electroencephalographic activity after peripheral nerve stimulation, and may be used during surgery without interrupting the procedure.In this report, 42 patients were studied by means of cortical evoked potentials. Findings in 35 patients were unremarkable. Six showed changes that made the test valuable; four of these six patients had changes that were ominous. In three of the four, the changes were appreciated and corrections were made to avoid paralysis. In one, changes were not appreciated and the patient became paralyzed. In one case, improvement in conduction was evident when the deformity of the spine was corrected. In another case, the latency was momentarily prolonged when a feeder artery to the cord was ligated.Although this technique requires special training and equipment, its value justifies the trouble and expense. Certainly its common use in spinal surgery is inevitable. Authors: D Faust, L T Happel
Entity Calais Alchemy Stanford NLTK LingPipe
Cortical evoked potentials         other_name
spinal surgery Medical Treatment Field Term     other_name
surgery Medical Treatment     Date other_name
paraplegia Medical Condition        
patient         multi_cell
spinal abnormality         other_name
spinal cord   Field Term     body_part
cortical       Location cell_type
nerve         lipid
42     Number Cardinal  
35     Number Cardinal  
Six     Number Location  
three     Number Cardinal  
four     Number Cardinal  
paralysis Medical Condition        
one     Number Date  
deformity Medical Condition        
improvement         protein_molecule
spine         multi_cell
latency         other_name
cord         lipid
equipment         body_part
D Faust   Person Person   carbohydrate
T Happel     Person   lipid
Write a comment
Cancel Reply