Open NLP Name Finder Model Training

Named Entity Recognition The Name Finder is used to detect entities like a person, location, date, money, organization time and date in the text. These entities are detected using trained model. The model is specific to the language and entity type.
The Open NLP provides following pre-trained name finder models.

  1. en-ner-location.bin
  2. en-ner-money.bin
  3. en-ner-organization.bin
  4. en-ner-percentage.bin
  5. en-ner-person.bin
  6. en-ner-date.bin
These are trained on various freely available corpora.

Open NLP Name Finder Training
To detect custom entities from Name Finder APIs, we have to train the models for requires entity and language specific.

Training Tool
Open NLP provides command line tool to train the models.

  1. To train the model we have to provide the data in Open NLP name finder training format, which is one sentence per line.
  2. We can use other format also to train the model, the sentence must be tokenized and contain spans which mark the entities.
  3. Documents within train file are separated by empty lines which trigger the reset of the adaptive feature generators.

Sample training data for medical entity (en-ner-medical.train file):

<START:disease> Cancer <END> is the uncontrolled growth of abnormal cells anywhere in a body.These abnormal cells are termed <START:disease> cancer <END> cells, <START:disease> malignant <END> cells, or <START:disease> tumor <END> cells.

Command to train data :
$ opennlp TokenNameFinderTrainer -model en-ner-medical.bin -lang en -data en-ner-medical.train -encoding UTF-8

Additionally it’s possible to specify the number of iterations, the cutoff.

Training API
We can train the name finder using training API. Open NLP recommends to use the training API instead of the command line tool. To train the model we have to follow the three steps:

  • Read data from train file (Input Stream)
  • Train the data using NameFinderME.train method (Training process)
  • Store the model (TokenNameFinderModel) to a file (Output Stream)
public String train(String lang, String entity, InputStream taggedCorpusStream, OutputStream modelStream) 
{
Charset charset = Charset.forName(CHAR_ENCODING);
try {
ObjectStream<String> lineStream = new PlainTextByLineStream( taggedCorpusStream, charset);
ObjectStream<NameSample> sampleStream = new NameSampleDataStream( lineStream);
TokenNameFinderModel model; OutputStream modelOut = null;
try {
model = NameFinderME.train(lang, entity, sampleStream, null);
modelOut = new BufferedOutputStream(modelStream);
if (model != null) {
model.serialize(modelOut);
}
return entity + " model trained successfully";
}
catch (Exception ex) {
ex.printStackTrace();
}
finally {
sampleStream.close();
if (modelOut != null) {
modelOut.close();
}
}
}
catch (Exception e) {
e.printStackTrace();
}
return "Something goes wrong with training module.";
}

public String train(String lang, String entity, String taggedCoprusFile, String modelFile)
{
try {
return train(lang, entity, new FileInputStream(taggedCoprusFile), new FileOutputStream(modelFile));
}
catch (Exception e) {
e.printStackTrace();
}
return "Something goes wrong with training module.";
}

train("en", "medical", "/home/Opennlp/en-ner-custom-medical.train","/home/Opennlp/en-ner-custom-medical.bin");

It will create model for disease entity in /home/Opennlp directory
Write a comment
Cancel Reply
  • Rohit June 24, 2019, 10:53 am
    This article is very helpful. Thanks.
    reply
  • Bhagyashree P. April 4, 2019, 9:55 am
    Thank you for sharing such an informative article.
    reply