Named Entity Recognition

Introduction: Named Entity Recognition (NER) is a subtask of information extraction. It is a process of classifying elements in text into pre-defined categories. These categories include the names of persons, organizations, locations, dates and times, percentages, etc. It also categories diseases, anatomy, gene and protein names for the medical dataset. All these data help to enrich a document’s metadata and help to improve search retrieval.


NER describes the concept of annotating sequences of words in a text.

NER systems taking an unannotated block of text as an input for processing


“After graduating from Madras Institute of Technology (MIT – Chennai) in 1960, Dr. Abdul Kalam joined Aeronautical Development Establishment of Defence Research and Development Organisation (DRDO) as a scientist.”


And producing an annotated block of text which highlights the names of entities:

“After graduating from [Madras Institute of Technology]/ORGANIZATION (MIT – [Chennai]/LOCATION) in [1960]/DATE, Dr. [Abdul Kalam]/PERSON joined [Aeronautical Development Establishment of Defence Research and Development Organisation]/ORGANIZATION ([DRDO]/ORGANIZATION) as a scientist.”

Useful:

  • To identify what a document is about.
  • To enhance search retrieval in terms of faceting.
  • To boost the document for result ranking.
  • For linking documents based on the concepts within them.

Below are the APIs and services which are available for entity extraction:


  • GATE supports NER across many languages and domains out of the box, usable via graphical interface and also Java API
  • NETagger includes the Java based Illinois Named Entity Recognition tool, trained for the standard 4 types, as well as for an extended set of entities.
  • OpenNLP includes rule based and statistical named entity recognition
  • Stanford CoreNLP includes a Java-based CRF named entity recognition tool
In this post we will use Stanford CoreNLP for entity extraction using java. 

Stanford NER is a Java implementation of a Named Entity Recognizer. Stanford NER is also known as CRFClassifier. The software provides a general implementation of (arbitrary order) linear chain Conditional Random Field (CRF) sequence models. That is, by training your own models, you can actually use this code to build sequence models for any task.

Stanford NER available models

  • 3 class model : Location, Person, Organization
  • 4 class model : Location, Person, Organization, Misc
  • 7 class model : Time, Location, Organization, Person, Money, Percent, Date
Required library and classifier file:

Library:

  • stanford-corenlp-[version].jar

Classifier file:

  • NER Classifier – english.all.7class.distsim.crf.ser.gz
 

Input Text:

After graduating from Madras Institute of Technology (MIT – Chennai) in 1960, Dr. Abdul Kalam joined Aeronautical Development Establishment of Defence Research and Development Organisation (DRDO) as a scientist.”

NER Processing

1. Load NER Classifier
import edu.stanford.nlp.*;


static String DIR_PATH = "ner";
static String NER_CLASSIFIER_FILE = "english.all.7class.distsim.crf.ser.gz";

static AbstractSequenceClassifier < CoreMap > classifier = null;
static {
    try {
        String classifierPath = DIR_PATH + File.separator +
            NER_CLASSIFIER_FILE;
        if (!new File(classifierPath).exists()) {
            System.out.println(classifierPath + " does not exists.");
        } else {
            classifierPath = URLDecoder.decode(classifierPath, "UTF-8");
            classifier = CRFClassifier
                .getClassifierNoExceptions(classifierPath);
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
}


2. Annotate the text
public String annotateText(String query) {

    String text = query;
    String output = null;
    try {
        if ((text != null && !text.equals("")) && classifier != null) {
            output = classifier.classifyToString(text);
            System.out.println(output);
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
    return output;
}

3. Parse the output
private void parseText(String text) {

    String[] ENTITY_LIST = {
        "person",
        "location",
        "date",
        "organization",
        "time",
        "money",
        "percentage"
    };
    try {
        if (text != null) {
            for (String entityValue: ENTITY_LIST) {
                String entity = entityValue.toUpperCase();
                Pattern pattern = Pattern
                    .compile("([a-zA-Z0-9.%]+(/" + entity +
                        ")[ ]*)*[a-zA-Z0-9.%]+(/" + entity + ")");
                Matcher matcher = pattern.matcher(text);
                while (matcher.find()) {
                    int start = matcher.start();
                    int end = matcher.end();
                    String inputText = text.substring(start, end);
                    inputText = inputText.replaceAll("/" + entity, "");
                    System.out.println(inputText + " : " + entity);
                }
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
}

Parsing Output is:


Write a comment
Cancel Reply