We know that computers understand programming languages but how about making them understand human language, the language that you and me speak? Natural Language Processing (NLP)...
By: Shelly Singh | July 16, 2015
There are several open APIs that provide analysis of text and content discovery services. We conducted an informal study of some of the free services to identify their capabilities and to gain an understanding of new happenings and development in the area of semantic analysis. The last I had looked at these tools was in 2010. At that time these tools mainly offered extraction of 4 generic Named Entity types – Person, Organization, Place and Date. The industry has come a long way since then. This is evident in the bountiful named entity extractions, detailed sentiment analysis and precise classification. The tools also provide another semantic angle by linked to other Linked Open Data (LOD) on the web like Wikipedia, DBPedia, Yago etc. To top it, most of the tools have amazing visualizations.
Here are some details of interesting features we came across during the study.
Named Entity Types (from 4 to 40)
Earlier the extraction was for 4 generic types – Person, Organization, Date, Place. Give or take a couple but that was pretty much what we got from NER, atleast free NER. Now, there are many more entity types being addressed. They stil are generic types, but much more advanced and detailed. JobTitle, Position, Role, Measurement, Quantity, Facility, Building, College, University, Company.. to name a few. AlchemyAPI states on its web-site that they extract over 100 entity types in all! The nice thing is that this is done with a good amount of precision. The tools have more developed dictionaries and sophisticated pattern recognition to give a high quality NER. Also more and more machine readable dictionaries are being published by government and semi-government agencies, dictionary of Company names, organization names, drug brand names etc. All these are helping the extraction of entities.
On several open APIs, we see specialized NERs like Military Organization, Crime Report, Case, Drug, Chemical, Protein etc. Alchemy, OpenCalais and Cogito all have very exhaustive NER which goes much beyond the generic NER.
The semantic analysis tools (I am not calling it text analysis any more.. the analysis is exceedingly meaningful, very SEMNATIC) also identify keywords like magic. These keywords are not simple words, but phrases arrived at using collocations and background-foreground methods. Moreover, only relevant and truly reflective words/phrases make it to the list of keywords. The keywords are identified considering various factors like – those most appearing most and also not so common that they are no longer differentiating, but yet hold a significant meaning! The algorithms seem to be really sophisticated.
Most tools also identify topics that the text pertains to. Now, this is not trivial, as topics themselves do not appear in original text. The tools figure out the topics considering multiple factors. Comparing the text to other texts for which topics are known (say text from Wikipedia) and using similarity to identify topics is one mechanism. Tools would generally couple this mapping to the conceptual space of texts to gain high precision. Different tools call it differently, some call it Themes, some concepts, some topics etc. This is a very valuable feature offered by these tools and definitely expedites information retrieval.
Almost all tools provide a classification (even taxonomic classification drilling down several levels) of the content. This feature works on almost all tools and with a very high precision. Can we say classification problem is solved!
Sentiment analysis was only budding in 2010. Now we see a more precise and thorough sentiment analysis. Quite often the degree and/or polarity of sentiment depends on the context and/or the domain, so the word alone isn’t really enough to make a decision. Example, ‘resistant’ may by itself carry a negative sentiment, but when part if phrase ‘scratch resistant’ it become positive. The tools, like AlchemyAPI and Semantria, provide not only document level sentiment, but also a Target Sentiment. Target sentiment is the sentiment exhibited towards an entity or a keyword. With this, one can dig sentiments for specific entities (person or products) in a very large text of mixed sentiment. If you think this is new, there is more. Cogito provides emotion analysis! Emotions like fear, success, defeat, pride, modernity etc., are all extracted by Cogito.
The features provided by semantic analysis tools are certainly very advanced. More over the tools support a good number of languages. And yes, they are also coping with an even more challenging language – the Twitter language! This space is certainly enhanced. My besties from the entire lot of features are ‘emotions extraction’ and of course ‘linking to open data’ due to its usefulness.