Biomedical Named Entity extraction using general purpose NER Models
Biomedical, healthcare research and practice has accelerated the rate at which information in the form of scientific publications, EMR, transcription records and others are created and published. In order to effectively tag, index and manage this fast and ever growing knowledge, Named Entity Recognition (NER) is the first step in extracting key entities such as the people, organizations, chemicals, diseases, genes, proteins, anatomical constituents etc. NER is a challenging task and still a active area of the research. State-of-the-art NER systems for English produce near-human performance, but generally NER systems developed for one domain do not perform well on others. Further, Biomedical vocabulary is particularly challenging, new entities continue to be created by new research, abbreviations, long and complex constructions make it difficult to get accurate results or use rule based methods.
That being said, the challenges above are underscored by the results I got using general purpose and public text extraction resources, namely Named Entity Recognizers and APIs/Services. This is an apples and oranges comparison, but it does serve as a baseline to get started with. For the observations below, I used the online demo for OpenCalais and Alchemy and developed simple Groovy and Python scripts for StanfordCoreNLP, LingPipe and NLTK NERs, the code is shared here.
All but the Genia trained classifier (Lingpipe) performed abysmally, since Genia pertains to genes, proteins and general anatomy. I also happened to select sample1 to have gene mentions and sample2 to have anatomy mentions. Do note these results do not reflect the classifier's weakness, it merely points out the huge variation when using a general purpose trained classifier vis-a-vis a domain trained and task focused classifier. OpenCalais and Alchemy, perform a bit better than the general classifiers, but then this is a free service and thereby limited in scope, not sure if they have domain specific services. They are generally tuned for a wide audience and domains and to consume minimal computing resources while providing fair results.
Named Entity Recognition classifiers and trained models
Public use state-of-the-art machine learning classifiers - conditional random fields(CRF), hidden markov models(HMM), support vector machine(SVM), naive bayes etc., notable among them OpenNLP, StanfordCoreNLP, NLTK, LingPipe. These classifiers/algorithms provide a framework to create your own trained models on pre-annotated datasets, creating the comprehensive training datasets is immensely effort intensive and very specialized.
Public use NER trained models and corpora created either by academic institutions like LDC or during competitive conferences like the CoNLL or MUC are based on general language models with wider and more general coverage in mind. NER models that have been trained on such general purpose (Brown), or specific (GeneTag, Genia) corpora and are either too general or too domain and task specific and therefore cannot be used with acceptable accuracy for biomedical content.
Stanford CoreNLP | NLTK | LingPipe | |
---|---|---|---|
POS | MaxEnt | Naive Bayes | HMM |
NER | CRF | Naive Bayes | HMM |
Entities | person,org,loc,misc | person,org,loc,misc | Gene, Protein, Cell, Other |
Models | CoNLL,MUC | CoNLL,Brown | Brown,Genia,GeneTag |
Semantic Services/APIs
A Web service or REST API that reads unstructured text and identifies entities, facts and events within the text. Notable services include
OpenCalais an Thomson Reuters company and Alchemy an IBM company. These services are available to the general public either free or for a nominal subscription fee.
OpenCalais and Alchemy go a bit further by using proprietary algorithms and methods for identifying and classifying named entities using other knowledge like ontologies and taxonomies like DbPedia, Freebase, MeSH etc, with better results but not good enough for specialized domains.
Sample Article 1
For the following input, the results are summarized below.
Autoregulation of the bacteriophage P22 scaffolding protein gene. During the formation of each bacteriophage P22 head, about 250 molecules of the product of gene 8, scaffolding protein, coassemble with and dictate correct assembly of the coat protein into a proper shell structure. At approximately the time that DNA is inserted inside the coat protein shell, all of the scaffolding protein molecules leave the structure. They remain active and participate in several subsequent rounds of shell assembly. Previous work has shown that scaffolding protein gene expression is affected by the head assembly process and has generated the hypothesis that unassembled scaffolding protein negatively modulates the expression of its own gene but that it lacks this activity when complexed with coat protein in proheads. To test this model, a P22 restriction fragment containing the scaffolding and coat protein genes was cloned under control of the lac promoter. These cloned genes were then expressed in an in vitro DNA-dependent transcription-translation reaction. The addition of purified scaffolding protein to this reaction resulted in reduced scaffolding protein synthesis relative to coat and tail protein synthesis to an extent and at a protein concentration that was consistent with the observed reduction in vivo. We conclude that scaffolding protein synthesis is autoregulated and that scaffolding protein is the only phage-coded protein required for this process. In addition, these experiments provide additional evidence that this autoregulation is posttranscriptional. Authors: E Wyckoff, S Casjens
Entity | Calais | Alchemy | Stanford | NLTK | LingPipe |
---|---|---|---|---|---|
Autoregulation | Location | ||||
bacteriophage P22 scaffolding protein gene | Organization | DNA_domain_or_region | |||
bacteriophage P22 head | Position | protein_family_or_group | |||
250 | Number | Cardinal | |||
DNA | Misc | Person | |||
scaffolding protein | protein_family_or_group | ||||
coat protein | protein_molecule | ||||
coat protein shell | protein_family_or_group | ||||
shell assembly | other_name | ||||
protein gene expression | other_name | ||||
head | body_part | ||||
P22 restriction fragment | DNA_domain_or_region | ||||
scaffolding and coat protein genes | protein_family_or_group | ||||
lac promoter | DNA_domain_or_region | ||||
cloned genes | DNA_domain_or_region | ||||
in vitro DNA-dependent transcription-translation reaction | other_name | ||||
coat | cell_type | ||||
protein synthesis | other_name | ||||
phage-coded protein | protein_family_or_group | ||||
Wyckoff | Person |
Sample Article 2
For the following input, the results are summarized below.
Cortical evoked potentials in spinal surgery. Few complications of surgery are as devastating as paraplegia in a patient who has been operated on for correction of a spinal abnormality. Early warning of potential damage to the spinal cord is highly desirable. This is possible with cortical evoked potentials. It measures electroencephalographic activity after peripheral nerve stimulation, and may be used during surgery without interrupting the procedure.In this report, 42 patients were studied by means of cortical evoked potentials. Findings in 35 patients were unremarkable. Six showed changes that made the test valuable; four of these six patients had changes that were ominous. In three of the four, the changes were appreciated and corrections were made to avoid paralysis. In one, changes were not appreciated and the patient became paralyzed. In one case, improvement in conduction was evident when the deformity of the spine was corrected. In another case, the latency was momentarily prolonged when a feeder artery to the cord was ligated.Although this technique requires special training and equipment, its value justifies the trouble and expense. Certainly its common use in spinal surgery is inevitable. Authors: D Faust, L T Happel
Entity | Calais | Alchemy | Stanford | NLTK | LingPipe |
---|---|---|---|---|---|
Cortical evoked potentials | other_name | ||||
spinal surgery | Medical Treatment | Field Term | other_name | ||
surgery | Medical Treatment | Date | other_name | ||
paraplegia | Medical Condition | ||||
patient | multi_cell | ||||
spinal abnormality | other_name | ||||
spinal cord | Field Term | body_part | |||
cortical | Location | cell_type | |||
nerve | lipid | ||||
42 | Number | Cardinal | |||
35 | Number | Cardinal | |||
Six | Number | Location | |||
three | Number | Cardinal | |||
four | Number | Cardinal | |||
paralysis | Medical Condition | ||||
one | Number | Date | |||
deformity | Medical Condition | ||||
improvement | protein_molecule | ||||
spine | multi_cell | ||||
latency | other_name | ||||
cord | lipid | ||||
equipment | body_part | ||||
D Faust | Person | Person | carbohydrate | ||
T Happel | Person | lipid |