Named Entity Recognition


By: Sagar Gole | June 26, 2015

Named Entity Recognition

Introduction

Named Entity Recognition (NER) is a subtask of information extraction. It is a process of classify elements in text into pre-defined categories. These categories includes the names of persons, organizations, locations, dates and times, percentages, etc. It also categories diseases, anatomy, gene and protein names for medical dataset. All these data helps to enrich a document’s metadata and help to improve search retrieval.

NER describes the concept of annotating sequences of words in a text.

NER systems taking an unannotated block of text as an input for processing

 “After graduating from Madras Institute of Technology (MIT – Chennai) in 1960, Dr. Abdul Kalam joined Aeronautical Development Establishment of Defence Research and Development Organisation (DRDO) as a scientist.”

And producing an annotated block of text which highlights the names of entities:

“After graduating from [Madras Institute of Technology]/ORGANIZATION (MIT – [Chennai]/LOCATION) in [1960]/DATE, Dr. [Abdul Kalam]/PERSON joined [Aeronautical Development Establishment of Defence Research and Development Organisation]/ORGANIZATION ([DRDO]/ORGANIZATION) as a scientist.”

Useful:

  • To identify what a document is about.
  • To enhance search retrieval in terms of faceting.
  • To boost the document for result ranking.
  • For linking documents based on the concepts within them.

Below are the APIs and services which are available for entity extraction:

  • GATE supports NER across many languages and domains out of the box, usable via graphical interface and also Java API
  • NETagger includes the Java based Illinois Named Entity Recognition tool, trained for the standard 4 types, as well as for an extended set of entities.
  • OpenNLP includes rule based and statistical named entity recognition
  • Stanford CoreNLP includes a Java-based CRF named entity recognition tool

In this post we will use Stanford CoreNLP for entity extraction using java.

Stanford NER is a Java implementation of a Named Entity Recognizer. Stanford NER is also known as CRFClassifier. The software provides a general implementation of (arbitrary order) linear chain Conditional Random Field (CRF) sequence models. That is, by training your own models, you can actually use this code to build sequence models for any task.

Stanford NER available models

  • 3 class model : Location, Person, Organization
  • 4 class model : Location, Person, Organization, Misc
  • 7 class model : Time, Location, Organization, Person, Money, Percent, Date

Required library and classifier file:

Library:

  • stanford-corenlp-[version].jar

Classifier file:

  • NER Classifier – english.all.7class.distsim.crf.ser.gz

 

Input Text:

 “After graduating from Madras Institute of Technology (MIT – Chennai) in 1960, Dr. Abdul Kalam joined Aeronautical Development Establishment of Defence Research and Development Organisation (DRDO) as a scientist.”

NER Processing

1. Load NER Classifier

import edu.stanford.nlp.*;

static String DIR_PATH = "ner";
static String NER_CLASSIFIER_FILE = "english.all.7class.distsim.crf.ser.gz";

static AbstractSequenceClassifier<CoreMap> classifier = null;
static {
    try {
		String classifierPath = DIR_PATH + File.separator
				+ NER_CLASSIFIER_FILE;
		if (!new File(classifierPath).exists()) {
			System.out.println(classifierPath + " does not exists.");
		} else {
			classifierPath = URLDecoder.decode(classifierPath, "UTF-8");
			classifier = CRFClassifier
					.getClassifierNoExceptions(classifierPath);
		}
	} catch (Exception e) {
		e.printStackTrace();
	}
}

2. Annotate the text

public String annotateText(String query) {
    String text = query;
	String output = null;
	try {
		if ((text != null && !text.equals("")) && classifier != null) {
			output = classifier.classifyToString(text);
			System.out.println(output);
		}
	} catch (Exception e) {
		e.printStackTrace();
	}
	return output;
}

3. Parse the output

private void parseText(String text) {
    String[] ENTITY_LIST = { "person", "location", "date", "organization",
			"time", "money", "percentage" };
	try {
		if (text != null) {
			for (String entityValue : ENTITY_LIST) {
				String entity = entityValue.toUpperCase();
				Pattern pattern = Pattern
						.compile("([a-zA-Z0-9.%]+(/" + entity
							+ ")[ ]*)*[a-zA-Z0-9.%]+(/" + entity + ")");
				Matcher matcher = pattern.matcher(text);
				while (matcher.find()) {
					int start = matcher.start();
					int end = matcher.end();
					String inputText = text.substring(start, end);
					inputText = inputText.replaceAll("/" + entity, "");
					System.out.println(inputText + " : " + entity);
				}
			}
		}
	} catch (Exception e) {
		e.printStackTrace();
	}
}

Parsing Output is:

NER

This post has been viewed 4,298 times

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>