Tokenization Using Apache OpenNLP

In this post we will create tokens of the given string using Apache OpenNLP library.

This post is a continuation of the previous post Find Sentences using Apache Open NLP. Refer to this post first for the setting up your project with Apache OpenNLP library. In the previous post we used SentenceDetecterME, SentenceModel, and Span class of the openNLP library.

Apache OpenNLP library provides the SimpleTokenizer, WhitespaceTokenizer and TokenizerME which will use the normal process to tokenize by splitting the string either on whitespace or using some already trained models. These classes doesn't expose their constructors, so we have to use INSTANCE to get the instance of these classes.

SimpleTokenizer

This will use split the sentences into each word or any exclamation, dot, comma etc used. It provides two methods to tokenize a given string. We will try both the methods.
  • First step is to get the instance of the SimpleTokenizer
//Get the Simple Tokenizer Instance
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE
;
  • Use this object to get the tokenize() method which will take the string as input.
//Call the tokenize method
String[] tokens = tokenizer.tokenize(text);

  • Print the tokens and the output will be
//Print the tokens
System.out.println(Arrays.toString(tokens));
Hi              .               How
are you ?
Hope everything is
going well .
Welcome to ChatBot
Lesson1 . We
will try to
understand Open Apache
NLP for sentence
detection

  • There is another tokenizePos() method will print the position of the tokens. Its return type is an array of Span object.
Span[] spans = tokenizer.tokenizePos(text);
for(Span s: spans){
System.out.println(s);
}

The output of the above statement will be like this:

[0..2)
[2..3)
[4..7)
[8..11)
[12..15)
[15..16)

 Whitespace Tokenizer

This tokenizer will split the sentence based on white space. It also has two methods to returns tokens as string array and another tokenizePos() method which return an array of Span objects.

Here is the code to get the WhiteSpaceTokenizer instance and printing the tokens and their position.
//Get the Simple Tokenizer Instance
WhitespaceTokenizer tokenizer = WhitespaceTokenizer.INSTANCE;

//Call the tokenize method, this will split based on spaces
String[] tokens = tokenizer.tokenize(text);

//Print the tokens
for(String s: tokens){
System.out.println(s);
}

Span[] spans = tokenizer.tokenizePos(text);
for(Span s: spans){
System.out.println(s);
}

The output will be like this:
Hi.            How               are
you? Hope everything
is going well.
Welcome to ChatBot
Lesson1. We will
try to understand
Open Apache NLP
for sentence detection

[0..3) [4..7) [8..11)
[12..16) [17..21) [22..32)
[33..35) [36..41) [42..47)
[48..55) [56..58) [59..66)
[67..75) [76..78) [79..83)
[84..87) [88..90) [91..101)
[102..106) [107..113) [114..117)
[118..121) [122..130) [131..140)

The implementation is similar to the SimpleTokenizer.

TokenizerME

This class will create tokens based on the trained model it is using. It loads a particular model to TokenizerModel and then loads it to this class. 
  • Load the trained model into TokenizerModel object. I have used an existing model which comes with Apache OpenNLP library.
//Load the token models
InputStream is = new FileInputStream("src\\main\\resources\\models\\en-token.bin");
TokenizerModel model = new TokenizerModel(is);

  • This class has three methods tokenize(), tokenizePos() and getTokenProbablities() methods. It can used like this:
//Print the tokens
for(String tokens: tokenizerME.tokenize(text)){
System.out.println(tokens);
}

//Get the position of the tokens
Span[] spans = tokenizerME.tokenizePos(text);
for(Span span: spans){
System.out.println(span);
}

//Get the token probabilities
double[] prob = tokenizerME.getTokenProbabilities();
for(double s: prob){
System.out.println(s);
}

The output of the above code will be like this:

Hi . How
are you ?
Hope everything is
going well .
Welcome to ChatBot
Lesson1 . We
will try to
understand Open Apache
NLP for sentence
detection


[0..2) [2..3) [4..7)
[8..11) [12..15) [15..16)
[17..21) [22..32) [33..35)
[36..41) [42..46) [46..47)
[48..55) [56..58) [59..66)
[67..74) [74..75) [76..78)
[79..83) [84..87) [88..90)
[91..101) [102..106) [107..113)
[114..117) [118..121) [122..130)
[131..140)
You can find the code here in the Github repository.

This concludes the tokenization using Open ApacheNLP library in Java. This will be another step in using natural language processing. Also, comment down your feedback for any suggestions.

Comments