Domain-specific Named Entity Recognition (NER)

Your entities deserve to be recognized

Page content

What broadly distinguishes one domain from another? It is the named entities. It is essential to pick out entities from the text that you deal with for any useful NLP/NLU task.

However, this task is quite domain specific. No matter how good a pre-trained entity recognizer is in recognizing entities, it fails to satisfy every domain. Let us explore how to use Spacy v3 to train a custom NER pipeline.

The different approaches to entity recognition

Spacy has an NER component that is native to its pretrained ...-md and ...-lg models. However, these are very general and not domain-specific.

Spacy has an entity recognizer called the entity ruler that is based on exact entity matches. The entity ruler identifies and labels entities based on exact matches. The mapping of the entity matching expression and the label have to be provided by us. The flipside is that this works very well for closed domains where the named entities are a finite set; however, it fails on entities it has not seen, but have to be recognized from the context. Check out this case study to know more.

The other way is to train a custom Spacy NER pipeline that can recognize named entities based on its learnings from the training data. This way, it can also recognize entities it hasn’t seen making it well suited for open domains such as automotive news.

Training format

Keeping in mind that some of the features have been upgraded in Spacy v3, here are the steps to train your custom NER model for your domain.

  • Generate your dataset If you have ready-made datasets for your domain, by all means use them. However, if you have a mass of text with your entities, you have to convert it to a dataset format that Spacy can understand. You will need to split your dataset into training, validation and testing partitions respectively.

  • Convert to Spacy binary format The datasets need to be converted to a binary format for ease of training.

  • Generate the configuration file Spacy requires a configuration file for the training process. Most of the parameters can be left at their defaults.

  • Execute the training Training occurs via the command line with the Spacy command. The various CLI options let you dump the trained NER model at a specified location. The training process dumps the epoch-wise training results including precision and recall to the console.

  • Testing your model You can now load your custom NER model into your Spacy pipeline and put it through its paces using your test dataset.

For a quick start and demo code, check out this Github repo.