Open NLP Name Finder Model Training


By: Sagar Gole | October 30, 2015

Open NLP Name Finder Model Training

 

Named Entity Recognition

The Name Finder is used to detect entities like person, location, date, money, organization time and date in the text. These entities are detected using trained model. The model is specific to the language and entity type.

The Open NLP provides following pre-trained name finder models.

  1. en-ner-location.bin
  2. en-ner-money.bin
  3. en-ner-organization.bin
  4. en-ner-percentage.bin
  5. en-ner-person.bin
  6. en-ner-date.bin

These are trained on various freely available corpora.

 

Open NLP Name Finder Training

 

To detect custom entities from Name Finder APIs, we have to train the models for requires entity and language specific.

 

Training Tool

 

Open NLP provides command line tool to train the models.

  1. To train the model we have to provide the data in Open NLP name finder training format, which is one sentence per line.
  2. We can use other format also to train the model, the sentence must be tokenized and contain spans which mark the entities.
  3. Documents within train file are separated by empty lines which trigger the reset of the adaptive feature generators.

 

Sample training data for medical entity (en-ner-medical.train file):

<START:disease> Cancer <END> is the uncontrolled growth of abnormal cells anywhere in a body.These abnormal cells are termed <START:disease> cancer <END> cells, <START:disease> malignant <END> cells, or <START:disease> tumor <END> cells.

 

Command to train data :

$ opennlp TokenNameFinderTrainer -model en-ner-medical.bin -lang en -data en-ner-medical.train -encoding UTF-8

Additionally it’s possible to specify the number of iterations, the cutoff.

 

Training API

 

We can train the name finder using training API. Open NLP recommends to use the training API instead of the command line tool.

To train the model we have to follow the three steps:

  • Read data from train file (Input Stream)
  • Train the data using NameFinderME.train method (Training process)
  • Store the model (TokenNameFinderModel) to a file (Output Stream)
public String train(String lang, String entity,
    		InputStream taggedCorpusStream, OutputStream modelStream) {
          Charset charset = Charset.forName(CHAR_ENCODING);
	try {
		ObjectStream<String> lineStream = new PlainTextByLineStream(
				taggedCorpusStream, charset);
		ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
				lineStream);
		TokenNameFinderModel model;
		OutputStream modelOut = null;
		try {
			model = NameFinderME.train(lang, entity, sampleStream, null);
			modelOut = new BufferedOutputStream(modelStream);
			if (model != null) {
				model.serialize(modelOut);
			}
			return entity + " model trained successfully";
		} catch (Exception ex) {
			ex.printStackTrace();
		} finally {
			sampleStream.close();
			if (modelOut != null) {
				modelOut.close();
			}
		}
	} catch (Exception e) {
		e.printStackTrace();
	}
	return "Something goes wrong with training module.";
}

public String train(String lang, String entity, String taggedCoprusFile,
		String modelFile) {
	try {
		return train(lang, entity, new FileInputStream(taggedCoprusFile),
			new FileOutputStream(modelFile));
	} catch (Exception e) {
		e.printStackTrace();
	}
	return "Something goes wrong with training module.";
}

Call the “train” method of above code from main method :

train("en", "medical", "/home/Opennlp/en-ner-custom-medical.train","/home/Opennlp/en-ner-custom-medical.bin");

It will create model for disease entity in /home/Opennlp directory

This post has been viewed 9,848 times

Leave a Reply

Your email address will not be published. Required fields are marked *


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>