Named Entity Recognition(NER)

Named entity recognition (NER) is probably the first step towards

information extraction that seeks to locate and classify named entities in

text into pre-defined categories. In any text document, there are particular

terms representing specific entities that are more informative and have a

unique context. These entities are known as named entities , which more

specifically refer to terms that represent real-world objects such as people,

places, organizations, etc, which are often denoted by proper names. A

naive approach would be to find these by looking at the noun phrases in

text documents.

NER does not evaluate the truth of statements

Image Credits : Christopher Marshall

How does NER work?

There are some basic steps that every NER model should take, and this is

what we are going to talk about now.

First step in Named Entity Recognition is actually preparing the data to

be parsed. Preparing the data for NLP is quite a long and complicated

journey. We are talking about building a pipeline that can do the following

tasks for us:

  • Sentence boundary segmentation
  • Word tokenization
  • Part of Speech tagging

Second step in Named Entity Recognition would be searching the tokens

we got from the previous step against a knowledge base. The search can be

made using deep learning models. The advantage of using this approach is

that it gets better results when identifying new words which were not seen

before (as opposed to the ontology, where we would get no results in this

situation).

Third step in Named Entity Recognition would happen when we get more

than one result for one search. In such a situation we would need some

statistical model to correctly choose the best entity for our input.

Industry Use Cases Of NER

With NER, you can, at a glance, understand the subject or theme of a body

of text and quickly group texts based on their relevancy or similarity. Some

notable NER use cases include:

Human resources: Speed up the hiring process by summarizing applicant

CVs; improve internal workflows by categorizing employee complaints and

questions.

Customer support: Improve response times by categorizing user requests,

complaints and questions and then filtering based on priority keywords.

Search and Recommendation: Improve the speed and relevance of search

results and recommendations by summarizing descriptive text, reviews,

and discussions.

Content classification: Peruse content more easily and gain insights into

trends by identifying the subjects and themes of blog posts and news

articles.

Health care: Improve patient care standards and reduce workloads by

extracting essential information from lab reports.

There are a number of excellent open-source libraries that can get you

going, including NLTKspaCy, and Stanford NER. I have touched upon

each of them below. I will also talk about the pros and cons of each

library later on.

1) Named Entity Recognition with spaCy

One of the best NER’s in the industry at the moment is spaCy. The link has

lots of functionalities for basic and advanced NLP tasks.

Install spaCy and download the English model

Next is implementation

This will give us the following entities:

2) Named Entity Recognition with NLTK

NLTK has many modules/features built-in like tokenization, POS chunking

etc. There are two major options with NLTK’s named entity recognition:

either recognize all named entities, or recognize named entities as their

respective type, like people, places, locations, etc.

Implementation example:

Depending on your goals, you may use the binary option as you see fit. Here

are the example of types of Named Entities that you can get if you have

binary as false:

The Stanford NER library is very similar to the NLTK library and can be

used simultaneously with NLTK. We can import it from nltk.tag import

StanfordNERTagger.

We can implement it as shown below:

Comparative analysis of modules

SpaCy vs Stanford NER

spaCyStanford NER
It is faster as compared to other libraries.It is slower.
It has three versions en_core_web_sm, en_core_web_md, en_core_web_lg; with different accuracy levels.It has good command over the English language.
Training custom models using spaCy has an advantage. spaCy reads the sentence and converts it to a doc object which stores info like pos tag, dependencies, entities etc. Hence this same name entity model could be used anywhere without loading any other spaCy module for the respective tasks. Different libraries or functions need to be loaded or built to do different tasks over a sentence.

spaCy Vs NLTK

spaCyNLTK
It is small in size and faster than NLTK.It is a huge library with more functions. 
spaCy returns document object whose words and sentences are objects themselves.Takes string as input and returns strings as output.
It comes with token dependency which means in a sentence it classifies which words are interconnected and which words enhance the meaning of other words.NLTK doesn’t provide any such dependency check.
spaCy constructs a syntactic tree for each sentence.NLTK attempts to split the text into sentences.
Its sentence tokenizer doesn’t match the performance of NLTK.Sentence tokenizer functions better than spaCy’s.

spaCy pre-trained models

en_core_web_smen_core_web_mden_core_web_lg
size11MB91MB789MB
Performance over English language85.21%-85.55%86.17%-86.25%86.36%-86.55%
Syntax Accuracy89.71%-97.05%90.09%-97.15%90.17%-97.22%

Summary

In this article, we have gained knowledge about NER(Named Entity Recognition), about the use cases of NER in various industries and about multiple types of pre-existing libraries to perform NER. In the next part of this blog we will elucidate on how to re-train, fine-tune and apply Transfer Learning to pre-trained models.