Biomedical Named Entity extraction using general purpose NER Models

Added:15 Jun 2015
Author:Praveen Koduganty
Views:8747

Biomedical, healthcare research and practice has accelerated the rate at which information in the form of scientific publications, EMR, transcription records and others are created and published. In order to effectively tag, index and manage this fast and ever growing knowledge, Named Entity Recognition (NER) is the first step in extracting key entities such as the people, organizations, chemicals, diseases, genes, proteins, anatomical constituents etc. NER is a challenging task and still a active area of the research. State-of-the-art NER systems for English produce near-human performance, but generally NER systems developed for one domain do not perform well on others. Further, Biomedical vocabulary is particularly challenging, new entities continue to be created by new research, abbreviations, long and complex constructions make it difficult to get accurate results or use rule based methods.

That being said, the challenges above are underscored by the results I got using general purpose and public text extraction resources, namely Named Entity Recognizers and APIs/Services. This is an apples and oranges comparison, but it does serve as a baseline to get started with. For the observations below, I used the online demo for OpenCalais and Alchemy and developed simple Groovy and Python scripts for StanfordCoreNLP, LingPipe and NLTK NERs, the code is shared here.

All but the Genia trained classifier (Lingpipe) performed abysmally, since Genia pertains to genes, proteins and general anatomy. I also happened to select sample1 to have gene mentions and sample2 to have anatomy mentions. Do note these results do not reflect the classifier's weakness, it merely points out the huge variation when using a general purpose trained classifier vis-a-vis a domain trained and task focused classifier. OpenCalais and Alchemy, perform a bit better than the general classifiers, but then this is a free service and thereby limited in scope, not sure if they have domain specific services. They are generally tuned for a wide audience and domains and to consume minimal computing resources while providing fair results.

Named Entity Recognition classifiers and trained models

Public use state-of-the-art machine learning classifiers - conditional random fields(CRF), hidden markov models(HMM), support vector machine(SVM), naive bayes etc., notable among them OpenNLP, StanfordCoreNLP, NLTK, LingPipe. These classifiers/algorithms provide a framework to create your own trained models on pre-annotated datasets, creating the comprehensive training datasets is immensely effort intensive and very specialized.

Public use NER trained models and corpora created either by academic institutions like LDC or during competitive conferences like the CoNLL or MUC are based on general language models with wider and more general coverage in mind. NER models that have been trained on such general purpose (Brown), or specific (GeneTag, Genia) corpora and are either too general or too domain and task specific and therefore cannot be used with acceptable accuracy for biomedical content.

	Stanford CoreNLP	NLTK	LingPipe
POS	MaxEnt	Naive Bayes	HMM
NER	CRF	Naive Bayes	HMM
Entities	person,org,loc,misc	person,org,loc,misc	Gene, Protein, Cell, Other
Models	CoNLL,MUC	CoNLL,Brown	Brown,Genia,GeneTag

Semantic Services/APIs

A Web service or REST API that reads unstructured text and identifies entities, facts and events within the text. Notable services include

OpenCalais an Thomson Reuters company and Alchemy an IBM company. These services are available to the general public either free or for a nominal subscription fee.

OpenCalais and Alchemy go a bit further by using proprietary algorithms and methods for identifying and classifying named entities using other knowledge like ontologies and taxonomies like DbPedia, Freebase, MeSH etc, with better results but not good enough for specialized domains.

Sample Article 1

For the following input, the results are summarized below.

Autoregulation of the bacteriophage P22 scaffolding protein gene. During the formation of each bacteriophage P22 head, about 250 molecules of the product of gene 8, scaffolding protein, coassemble with and dictate correct assembly of the coat protein into a proper shell structure. At approximately the time that DNA is inserted inside the coat protein shell, all of the scaffolding protein molecules leave the structure. They remain active and participate in several subsequent rounds of shell assembly. Previous work has shown that scaffolding protein gene expression is affected by the head assembly process and has generated the hypothesis that unassembled scaffolding protein negatively modulates the expression of its own gene but that it lacks this activity when complexed with coat protein in proheads. To test this model, a P22 restriction fragment containing the scaffolding and coat protein genes was cloned under control of the lac promoter. These cloned genes were then expressed in an in vitro DNA-dependent transcription-translation reaction. The addition of purified scaffolding protein to this reaction resulted in reduced scaffolding protein synthesis relative to coat and tail protein synthesis to an extent and at a protein concentration that was consistent with the observed reduction in vivo. We conclude that scaffolding protein synthesis is autoregulated and that scaffolding protein is the only phage-coded protein required for this process. In addition, these experiments provide additional evidence that this autoregulation is posttranscriptional. Authors: E Wyckoff, S Casjens

Entity	Calais	Stanford	NLTK	LingPipe
Autoregulation			Location
bacteriophage P22 scaffolding protein gene			Organization	DNA_domain_or_region
bacteriophage P22 head	Position			protein_family_or_group
250		Number	Cardinal
DNA		Misc	Person
scaffolding protein				protein_family_or_group
coat protein				protein_molecule
coat protein shell				protein_family_or_group
shell assembly				other_name
protein gene expression				other_name
head				body_part
P22 restriction fragment				DNA_domain_or_region
scaffolding and coat protein genes				protein_family_or_group
lac promoter				DNA_domain_or_region
cloned genes				DNA_domain_or_region
in vitro DNA-dependent transcription-translation reaction				other_name
coat				cell_type
protein synthesis				other_name
phage-coded protein				protein_family_or_group
Wyckoff		Person

Sample Article 2

For the following input, the results are summarized below.

Cortical evoked potentials in spinal surgery. Few complications of surgery are as devastating as paraplegia in a patient who has been operated on for correction of a spinal abnormality. Early warning of potential damage to the spinal cord is highly desirable. This is possible with cortical evoked potentials. It measures electroencephalographic activity after peripheral nerve stimulation, and may be used during surgery without interrupting the procedure.In this report, 42 patients were studied by means of cortical evoked potentials. Findings in 35 patients were unremarkable. Six showed changes that made the test valuable; four of these six patients had changes that were ominous. In three of the four, the changes were appreciated and corrections were made to avoid paralysis. In one, changes were not appreciated and the patient became paralyzed. In one case, improvement in conduction was evident when the deformity of the spine was corrected. In another case, the latency was momentarily prolonged when a feeder artery to the cord was ligated.Although this technique requires special training and equipment, its value justifies the trouble and expense. Certainly its common use in spinal surgery is inevitable. Authors: D Faust, L T Happel

Entity	Calais	Alchemy	Stanford	NLTK	LingPipe
Cortical evoked potentials					other_name
spinal surgery	Medical Treatment	Field Term			other_name
surgery	Medical Treatment			Date	other_name
paraplegia	Medical Condition
patient					multi_cell
spinal abnormality					other_name
spinal cord		Field Term			body_part
cortical				Location	cell_type
nerve					lipid
42			Number	Cardinal
35			Number	Cardinal
Six			Number	Location
three			Number	Cardinal
four			Number	Cardinal
paralysis	Medical Condition
one			Number	Date
deformity	Medical Condition
improvement					protein_molecule
spine					multi_cell
latency					other_name
cord					lipid
equipment					body_part
D Faust		Person	Person		carbohydrate
T Happel			Person		lipid

Tags: Alchemy, Biomedical, CRF, HMM, Lingpipe, Named Entity, NER, NLTK, OpenCalais, Stanford, SVM

Biomedical Named Entity extraction using general purpose NER Models

Write a comment

Search

Author's Recent Posts

Categories

Biomedical Named Entity extraction using general purpose NER Models

Write a comment

Search

Subscribe Us

Author's Recent Posts

Categories

Subscribe To Our Newsletter