Named entity recognition (NER) is probably the first step towards
information extraction that seeks to locate and classify named entities in
text into pre-defined categories. In any text document, there are particular
terms representing specific entities that are more informative and have a
unique context. These entities are known as named entities , which more
specifically refer to terms that represent real-world objects such as people,
places, organizations, etc, which are often denoted by proper names. A
naive approach would be to find these by looking at the noun phrases in
How does NER work?
There are some basic steps that every NER model should take, and this is
what we are going to talk about now.
First step in Named Entity Recognition is actually preparing the data to
be parsed. Preparing the data for NLP is quite a long and complicated
journey. We are talking about building a pipeline that can do the following
tasks for us:
- Sentence boundary segmentation
- Word tokenization
- Part of Speech tagging
Second step in Named Entity Recognition would be searching the tokens
we got from the previous step against a knowledge base. The search can be
made using deep learning models. The advantage of using this approach is
that it gets better results when identifying new words which were not seen
before (as opposed to the ontology, where we would get no results in this
Third step in Named Entity Recognition would happen when we get more
than one result for one search. In such a situation we would need some
statistical model to correctly choose the best entity for our input.
Industry Use Cases Of NER
With NER, you can, at a glance, understand the subject or theme of a body
of text and quickly group texts based on their relevancy or similarity. Some
notable NER use cases include:
Human resources: Speed up the hiring process by summarizing applicant
CVs; improve internal workflows by categorizing employee complaints and
Customer support: Improve response times by categorizing user requests,
complaints and questions and then filtering based on priority keywords.
Search and Recommendation: Improve the speed and relevance of search
results and recommendations by summarizing descriptive text, reviews,
Content classification: Peruse content more easily and gain insights into
trends by identifying the subjects and themes of blog posts and news
Health care: Improve patient care standards and reduce workloads by
extracting essential information from lab reports.
There are a number of excellent open-source libraries that can get you
each of them below. I will also talk about the pros and cons of each
library later on.
1) Named Entity Recognition with spaCy
One of the best NER’s in the industry at the moment is spaCy. The link has
lots of functionalities for basic and advanced NLP tasks.
Install spaCy and download the English model
Next is implementation
This will give us the following entities:
2) Named Entity Recognition with NLTK
NLTK has many modules/features built-in like tokenization, POS chunking
etc. There are two major options with NLTK’s named entity recognition:
either recognize all named entities, or recognize named entities as their
respective type, like people, places, locations, etc.
Depending on your goals, you may use the binary option as you see fit. Here
are the example of types of Named Entities that you can get if you have
binary as false:
The Stanford NER library is very similar to the NLTK library and can be
used simultaneously with NLTK. We can import it from nltk.tag import
We can implement it as shown below:
Comparative analysis of modules
SpaCy vs Stanford NER
|It is faster as compared to other libraries.||It is slower.|
|It has three versions en_core_web_sm, en_core_web_md, en_core_web_lg; with different accuracy levels.||It has good command over the English language.|
|Training custom models using spaCy has an advantage. spaCy reads the sentence and converts it to a doc object which stores info like pos tag, dependencies, entities etc. Hence this same name entity model could be used anywhere without loading any other spaCy module for the respective tasks.||Different libraries or functions need to be loaded or built to do different tasks over a sentence.|
spaCy Vs NLTK
|It is small in size and faster than NLTK.||It is a huge library with more functions.|
|spaCy returns document object whose words and sentences are objects themselves.||Takes string as input and returns strings as output.|
|It comes with token dependency which means in a sentence it classifies which words are interconnected and which words enhance the meaning of other words.||NLTK doesn’t provide any such dependency check.|
|spaCy constructs a syntactic tree for each sentence.||NLTK attempts to split the text into sentences.|
|Its sentence tokenizer doesn’t match the performance of NLTK.||Sentence tokenizer functions better than spaCy’s.|
spaCy pre-trained models
|Performance over English language||85.21%-85.55%||86.17%-86.25%||86.36%-86.55%|
In this article, we have gained knowledge about NER(Named Entity Recognition), about the use cases of NER in various industries and about multiple types of pre-existing libraries to perform NER. In the next part of this blog we will elucidate on how to re-train, fine-tune and apply Transfer Learning to pre-trained models.