Improve named entity recognition in spaCy with custom rules

spaCy allows to augment the model-based named entity recognition with custom rules. I found that the documentation on this is a bit lacking ( cross-references.

Here I would like to gather some links and hints for working with custom rules.

The example I use throughout is labeling US-style telephone numbers (e.g. “(123)-456-7890” or “123-456-7890”).


My first recommendation is to add a lot of tests for your use case. When you download a new model, add another custom rule to the tokenizer or just upgrade spacy, things might break. The differences in named entity recognition between the provided small, medium and large English models are significant.

So we aim for tests in the style of:

texts = [("This is Fred and his number is 123-456-7890.", 1),
         ("Peter (001)-999-4321 is the big Apple Microsoft.", 1),
         ("Peter (001)-99-4321 is the big Apple Microsoft.", 0)]
for text, num_tel in texts:
    doc = nlp(texts) 
    self.assertEqual(len([ent for ent in doc.ents if ent.label_ == "TELEPHONE"]), num_tel)

Phrase patterns vs. token patterns

The documentation describes phrase- and token patterns here What might be misleading here is that the example for phrase patterns uses a single word.

Phrase patterns and token patterns from (2020/10/29)

The point here is that for a phrase pattern you can match arbitrary text spanning multiple tokens with a single pattern, while for token patterns you specify a list of patterns with one pattern per token. Example:

# phrase pattern 
pattern_phrase = {"label": "FRUIT", "pattern": "apple pie"}

# similar token pattern
pattern_token = {"label": "FRUIT", "pattern": [{"TEXT": "apple"}, {"TEXT": "pie"}]}

Note that those two are not completely equivalent. The token pattern is dependent on the tokenizer. Which means if the rules of the tokenizer change, the pattern might not match anymore. At the same time the token pattern also matches “apple pie” (multiple whitespaces between the words).

For our example it’s obvious we will have to use token patterns because phrase patterns allow only matching of exact strings.

Pattern options

It’s not explicitly mentioned in the documentation for the entity ruler but you can use the same patterns as described in the token matcher documentation here: Most notably, the REGEX option as described here also works:

So one option for our use case is to specify the token patterns as follows:

pattern_tel = {"label": "TELEPHONE", "pattern": [
    {"TEXT": {"REGEX": r"\(?\d{3}\)?"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{3}"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{4}"}}

It has to be noted that this relies on the tokenizer to split the tokens by “-“. So we actually match on 5 tokens here. Another option is to first use a custom component to merge such phone numbers as described here

Adding the pattern

This one is straightforward:

import spacy
from spacy.pipeline import EntityRuler 

pattern_tel = {"label": "TELEPHONE", "pattern": [
    {"TEXT": {"REGEX": r"\(?\d{3}\)?"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{3}"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{4}"}}

nlp = spacy.load('en_core_web_sm')
ruler = EntityRuler(nlp)
ruler.add_patterns([pattern_tel, pattern_name, pattern_apple]) 
doc = nlp("This is Fred and his number is 123-456-7890 to get an apple  pie") 
print([(ent, ent.label_) for ent in doc.ents])


[(Fred, 'PERSON'), (123, 'CARDINAL')]

Seems there’s something wrong.

Adding entity ruler at the correct position

The documentation states that you have to add the entity ruler before the “ner” component but does not describe how, nor gives a link in this paragraph. So here we go: So let’s do this:

import spacy
from spacy.pipeline import EntityRuler 

pattern_tel = {"label": "TELEPHONE", "pattern": [
    {"TEXT": {"REGEX": r"\(?\d{3}\)?"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{3}"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{4}"}}

nlp = spacy.load('en_core_web_sm')
ruler = EntityRuler(nlp)
ruler.add_patterns([pattern_tel, pattern_name, pattern_apple]) 
nlp.add_pipe(ruler, before="ner")
doc = nlp("This is Fred and his number is 123-456-7890 to get an apple  pie") 
print([(ent, ent.label_) for ent in doc.ents])


[(Fred, 'VIP'), (123-456-7890, 'TELEPHONE')]


We’ve seen how to add a regex based custom rule to the named entity recognition of spaCy. In a subsequent article we will explore more complex patterns.

Paper summary: Maximizing Mutual Information for Tacotron

Maximizing Mutual Information for Tacotron, Peng LiuXixin WuShiyin KangGuangzhi LiDan SuDong Yu

This publication investigates local information preference as source for errors in the Tacotron model.

Autoregressive (AR) models predict their next output given their previous outputs – given by a conditional probability P(x_i|x_<i). If they are dependent only on a single previous output they have the Markov Property. AR models are commonly trained using teacher forcing, which means that a previous value x_<i in each autoregressive step is taken from training data (ground truth) instead of the prediction. When both ground truth and prediction are used, this is commonly called scheduled sampling or curriculum learning. A conditional autoregressive (CAR) model furthermore bases each prediction on additional input – P(x_i|x_<i, t) – in case of Tacotron on some sort of linguistic specification t, e.g. characters of an orthographic transcription or phones from the phonetic transcription.

The hypothesis of this paper is that a significant part of errors produced by Tacotron are caused by local information preference, which intuitively means that a given prediction is mostly based on the autoregressive part – i.e. the previous prediction x_<i and not on the additional input t. For Tacotron this is the previous Mel spectral frame while the linguistic input is mostly ignored.

The authors furthermore hypothesize that this is the reason why a reduction factor > 1 works well to make the training more robust. The reduction factor determines, how many frames are produced per inference step. If the model has to produce multiple frames given only one previous frame, it is forced to make use of the additional input t to still satisfy the loss function. Furthermore they argue that this might also be the reason, why the 0.5 prenet dropout and the rather large frame shift in Tacotron are vital for good results.

The proposed methods aim to reward the model for making use of the additional input t. Dropout Frame Rate (DFR) randomly sets the previous input x_i-1 to the global mean, so that the model can not solely rely on those to make future predictions. The other option is Maximizing Mutual Information (MMI) between linguistic input and predicted output features to assure the model learns a meaningful mapping instead of some, subjectively meaningless, representation that just satisfies the loss function.

The results show that both DFR and MMI decrease the number of errors produced by the model and also achieve a good alignment at a much earlier step than without those additions. Furthermore MMI solves the gap between training and validation error often seen when training Tacotron.

Artificial intelligence in speech synthesis

(Dear international readers, this content appeared first in the German “IT Freelancer Magazin” and is unfortunately only available in German)

Es war Anfang 2012 als ich mich für ein Doktorat im Bereich Sprachsynthese bewarb. Ehrlich gesagt konnte ich mir damals nicht viel darunter vorstellen, aber die Ausschreibung beinhaltete Machine Learning und Künstliche Intelligenz. Und es klang irgendwie… anders. Heute – über 5 Jahre später – habe ich einen ganz guten Überblick über das Thema, stoße aber im Gespräch mit anderen Menschen stets auf das selbe Unverständnis, das ich damals aufwies. Dieser Artikel soll nun in groben Zügen und allgemeinverständlich die Aufgabenstellung und Probleme moderner Sprachsynthese aufzeigen.

Continue reading

