Some observations from implementing Named Entity Recognition Algorithms in the real world.

04.04.21 | BY Simon Mahony


We spend a large part of our time working with text data and as a result we often find ourselves implementing some kind of NLP tool either as part of our product or on behalf of customers. One of those tools is Named Entity Recognition (NER) – a process for identifying entities, e.g. people and places, in text. There are a wide range of possible use cases for NER but ours are usually focused on investing where we have used it to augment search tools, automatically connect individuals and companies, and screen huge volumes of text for known bad actors. While we work with NER regularly, and with text data every day, we don’t consider ourselves to be experts in NER specifically – our interest is practical in nature and as such our experience is in implementing other people’s models rather than in developing new models from scratch.

This short piece contains some of the lessons we’ve learned: it’s primarily aimed at the potential NER user without much experience of the technology but we hope there will be something here too for the algorithm developers and vendors who may find it useful to see the experiences of a more narrowly commercially focused organisation.

What is NER?

Before we go any further, for those who aren’t familiar with NER it is roughly speaking the process of going from this:

Elon Musk invites Vladimir Putin to connect on the Clubhouse app.

to this:

(Elon Musk)[PERSON] invites (Vladimir Putin)[PERSON] to connect on the (Clubhouse)[ORGANISATION] app.

As you can see there are basically two components to NER: (1) identifying the strings (i.e. words) which are entities, and (2) identifying the type of entity. The main benefit of this technology is that because it is automated you can process volumes of text that would be completely unfeasible for a human with the result that several tools and applications which were previously impossible become possible.

One NER algorithm to rule them all?

One of the most immediately striking things when you attempt to implement NER in real use cases is that the available solutions are very generic and may not fit that well for your specific use case. One way in which this appears is in the labels used to categorize entities.

The overwhelming majority of solution developers and vendors use 4 categories of entity: people, places, organisations, and miscellaneous. This is mostly for the historical reason that these were the categories used in the CoNLL-2003 shared task which has acted as a touchstone for the industry ever since. Nor are the categories bad or unreasonable – given a task to group entities into four intuitive groups it is likely many people would arrive at something similar.

The problem is that reasonable though those categories may be they don’t necessarily make sense in every use case. In our use cases, for example, we typically don’t care about organisations as such but about companies. The difference being that ‘organisation’ is a broader category which includes things like the U.N. and the FBI.

The practical implication of this is that many off-the-shelf NER models may simply not be that useful for your application and may require some additional work or post-processing steps to deliver what you want. This situation seems to be improving as some more targeted vendors/developers are beginning to offer a wider range of commercially useful categories but given the number of possible entity categories we might want to target we are probably some way off having readily available models for all of them and it may not even be possible. We are also seeing more platforms which offer some ‘train your own’ capabilities and this may be a promising path.

What this touches on is that, practically speaking, it is a little misleading to think of NER as a generic problem with a generic solution. Perhaps one day there will be one algorithm to rule them all but today the best solution for identifying drugs in research papers is unlikely to be the same one as for identifying companies in regulatory filings. It may be more helpful for both consumers and producers to think of NER as a family of solutions to a range of different problems which share some formal characteristics. From a commercial perspective this is no bad thing because it is usually far easier to solve problems in a domain or application specific way than in a general manner as we will see below.

What does good look like?

One of the surprising facts about implementing NER solutions in the wild is not just that it is tricky to work out which solution is best, but that it is also much harder than you might think to establish some objective and general measure of quality against which you can measure them.

If you wanted to know which NER solution was best for you, a reasonable starting place might be the publicly available benchmarking data. Unfortunately, our experience is that these benchmarks have not been especially useful. Any benchmark is only useful if it matches your intended use case in the relevant ways and we have found that the absolute and relative performance of NER algorithms on benchmarks was quite different to the performance on our specific data (mainly company filings and regulatory releases). Further, just as with a CPU or database benchmark, implementation details matter and can contribute significantly to the outcome. Even when the implementation details in a benchmark have been honestly chosen they may not be applicable to your circumstances and again result in the benchmark being unrepresentative.

These issues are really true of all benchmarking processes to a greater or lesser extent, but there are some features of NER that make it difficult even in theory to establish a general and objective measure of quality. Consider an algorithm which returns us the following:

Philip (Seymour Hoffman)[PERSON] was a prolific actor.

It correctly identifies “Seymour Hoffman” as a person, but misses the full name. Should this receive a full point, half point or no point? And what about cases in which the correct string is identified but the wrong label is applied? There just isn’t a right answer here which doesn’t depend on the context and use case – even among the relatively homogenous set of use cases we have encountered it has sometimes made sense to use very strict scoring criteria and at other times to weight it differently.

What this means is that if commercial circumstances allow it is far preferable to have (1) an independent benchmark which reflects the data you will be running the models on and (2) a scoring model which reflects the value of different kinds of partial scores in your use case. This might seem intimidating or impractical but we have found it more achievable than you might expect – we have built test datasets of >90,000 tokens which have been hand labelled multiple times (with additional human reviews) in under a week with a small team. In general for situations where you are considering investing substantial resources into a solution or where you are sensitive to performance differences we have found the ROI to be compelling.

Algos in action: YMMV

You could be forgiven for assuming that all NER solutions would perform about the same – after all they are all using roughly similar methodologies, designed by people with similar skills and experience, and tend to be in the same ballpark on claimed performance. This has not been our experience at all and we have seen very wide differences in performance on our independent benchmarks. For example below is a table showing the F1 scores (this is a value between 0 and 1) of different providers (OS = Open Source, COM = Commercial) in one of our benchmarks and you can see that in a sense the best performing model was almost twice as good as the worst.

Com1

OS1

OS2

OS3

OS4

COM2

COM3

OS6

COM4

F1-Score

0.36

0.42

0.44

0.46

0.47

0.54

0.56

0.63

0.68

There are of course other dimensions on which one algorithm can be superior to another including speed (latency and/or throughput), ease-of-use, cost etc. NER algorithms are highly parallelizable so for self hosted solutions the limiting factor is usually going to be the size of the compute grid you have access to and the cost of running it. Here again the differences can be enormous and a range of 100x is not uncommon which, depending on the size of your dataset, may make some options unfeasible.

Another important and often overlooked consideration is that with a vendor API you do not typically have visibility into which model they are using or what changes are being deployed.  There are ways to monitor the stability of a vendor algorithm in production but in practice these are cumbersome and not widely used – functionally these are black boxes. Again the exact use case matters, for use cases that are highly sensitive to changes or refinements in the underlying model you may need to self-host, seek guarantees from your vendor, and/or put a monitoring system in place.

A side note on performance is that we have seen some success using post processing steps on the outputs of the models.  Because we tend to care about specific data types (in our case usually company filings) we can often achieve meaningful accuracy improvements with simple heuristics which exploit domain specific formal or content based features of our datasets. Below is the same table as above showing F1 scores after some very simple post processing steps were applied. We can’t promise that these results will be replicable in every case but for commercial applications this is certainly worth considering because the datasets being used will often have features that lend themselves to this kind of exploitation – for example, formal/structural characteristics or a tight domain focus.

Com1

OS1

OS2

OS3

OS4

COM2

COM3

OS6

COM4

F1-Score

0.36

0.47

0.49

0.51

0.52

0.61

0.61

0.67

0.75

% uplift

0.2%

11.2%

11.4%

11.2%

11.3%

12.2%

9.7%

5.2%

10.0%

Takeaways

As people that work with text data everyday we are excited about the potential role NER technologies can play in a wide range of applications as they improve and evolve. To recap our observations for potential users and developers are:

  1. Many solutions currently available are very generic meaning a little more effort may be required to deliver what you want.
  2. A definition of ‘good’ is less obvious than it might seem and benchmarking against your definition is both advisable and surprisingly achievable.
  3. In our experience performance varies greatly across different providers and it may be possible to improve results with some post processing steps.