Artificial Intelligence, Speech Technology and Software Development

Month: October 2020

Improve named entity recognition in spaCy with custom rules

spaCy allows to augment the model-based named entity recognition with custom rules. I found that the documentation on this is a bit lacking (https://spacy.io/usage/rule-based-matching#entityruler) cross-references.

Here I would like to gather some links and hints for working with custom rules.

The example I use throughout is labeling US-style telephone numbers (e.g. “(123)-456-7890” or “123-456-7890”).

Tests

My first recommendation is to add a lot of tests for your use case. When you download a new model, add another custom rule to the tokenizer or just upgrade spacy, things might break. The differences in named entity recognition between the provided small, medium and large English models are significant.

So we aim for tests in the style of:

texts = [("This is Fred and his number is 123-456-7890.", 1),
         ("Peter (001)-999-4321 is the big Apple Microsoft.", 1),
         ("Peter (001)-99-4321 is the big Apple Microsoft.", 0)]
for text, num_tel in texts:
    doc = nlp(texts) 
    self.assertEqual(len([ent for ent in doc.ents if ent.label_ == "TELEPHONE"]), num_tel)

Phrase patterns vs. token patterns

The documentation describes phrase- and token patterns here https://spacy.io/usage/rule-based-matching#entityruler. What might be misleading here is that the example for phrase patterns uses a single word.

Phrase patterns and token patterns from https://spacy.io/usage/rule-based-matching#entityruler (2020/10/29)

The point here is that for a phrase pattern you can match arbitrary text spanning multiple tokens with a single pattern, while for token patterns you specify a list of patterns with one pattern per token. Example:

# phrase pattern 
pattern_phrase = {"label": "FRUIT", "pattern": "apple pie"}

# similar token pattern
pattern_token = {"label": "FRUIT", "pattern": [{"TEXT": "apple"}, {"TEXT": "pie"}]}

Note that those two are not completely equivalent. The token pattern is dependent on the tokenizer. Which means if the rules of the tokenizer change, the pattern might not match anymore. At the same time the token pattern also matches “apple pie” (multiple whitespaces between the words).

For our example it’s obvious we will have to use token patterns because phrase patterns allow only matching of exact strings.

Pattern options

It’s not explicitly mentioned in the documentation for the entity ruler but you can use the same patterns as described in the token matcher documentation here: https://spacy.io/usage/rule-based-matching#adding-patterns-attributes. Most notably, the REGEX option as described here also works: https://spacy.io/usage/rule-based-matching#regex

So one option for our use case is to specify the token patterns as follows:

pattern_tel = {"label": "TELEPHONE", "pattern": [
    {"TEXT": {"REGEX": r"\(?\d{3}\)?"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{3}"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{4}"}}
    ]}

It has to be noted that this relies on the tokenizer to split the tokens by “-“. So we actually match on 5 tokens here. Another option is to first use a custom component to merge such phone numbers as described here https://spacy.io/usage/rule-based-matching#matcher-pipeline.

Adding the pattern

This one is straightforward:

import spacy
from spacy.pipeline import EntityRuler 

pattern_tel = {"label": "TELEPHONE", "pattern": [
    {"TEXT": {"REGEX": r"\(?\d{3}\)?"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{3}"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{4}"}}
    ]}

nlp = spacy.load('en_core_web_sm')
ruler = EntityRuler(nlp)
ruler.add_patterns([pattern_tel, pattern_name, pattern_apple]) 
nlp.add_pipe(ruler)
doc = nlp("This is Fred and his number is 123-456-7890 to get an apple  pie") 
print([(ent, ent.label_) for ent in doc.ents])

Output:

[(Fred, 'PERSON'), (123, 'CARDINAL')]

Seems there’s something wrong.

Adding entity ruler at the correct position

The documentation states that you have to add the entity ruler before the “ner” component but does not describe how, nor gives a link in this paragraph. So here we go: https://spacy.io/api/language#add_pipe. So let’s do this:

import spacy
from spacy.pipeline import EntityRuler 

pattern_tel = {"label": "TELEPHONE", "pattern": [
    {"TEXT": {"REGEX": r"\(?\d{3}\)?"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{3}"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{4}"}}
    ]}

nlp = spacy.load('en_core_web_sm')
ruler = EntityRuler(nlp)
ruler.add_patterns([pattern_tel, pattern_name, pattern_apple]) 
nlp.add_pipe(ruler, before="ner")
doc = nlp("This is Fred and his number is 123-456-7890 to get an apple  pie") 
print([(ent, ent.label_) for ent in doc.ents])

Output:

[(Fred, 'VIP'), (123-456-7890, 'TELEPHONE')]

Summary

We’ve seen how to add a regex based custom rule to the named entity recognition of spaCy. In a subsequent article we will explore more complex patterns.

On classic university education

In the vast Internet one regularly encounters dismissive statements about formal education, dismissing it as useless in the so-called real world. Here I would like to take a different stance and elaborate on the positive aspects of university.

First let’s set the context straight:
I studied at the Vienna University of Technology and Medical University of Vienna with no tuition. I can see that throwing yourself in debt like it’s common in the USA is something you have to carefully consider.
I know that nowadays there are lots of high-quality MOOCs out there. This wasn’t the case when I studied. I think you can get a lot out of them. But a large portion of the best ones out there are created by universities or at least based on university courses (like the infamous Machine Learning course by Andrew Ng). Also you have to be really careful to not become a cherry-picker. I would definitely not spent so much time with Mathematics without being forced to. Now I am glad I did – actually I wished they would have forced me to even more, even though I had to take about 5 statistics-focused and 4 general math courses.

A quite common complaint is that universities teach many esoteric topics or to a degree which you don’t need in your day-to-day life. But in fact, if they would not teach it, who would? Who would work on the foundations of technology if no one taught it?
As an example, when learning 3D graphics at university you usually start with topics like Bresenham’s line drawing algorithm, look at the projection matrices, probably implement that stuff from scratch. Obviously in 99% of the Unity3D-game-dev jobs out there you don’t REALLY need this knowledge. It will help you with the understanding but in fact you can do a lot without even knowing what a matrix is (I’ve worked on 3D viz in Java3D and programmed fixed pipeline OpenGL in C++ using NeHes tutorials in 2000 before knowing what a rotation matrix is). But who will write the next Unreal Engine or develop the latest raytracing techniques if everyone just taught using existing game engines?

But besides this global view on education, why is it interesting for the individual who does not plan to work in research?

Your career will likely last decades. It’s well worth spending a few of them on foundations. This is the kind of knowledge you can’t easily pick up on the job. In my first year I easily spent 15-20 hours a week on Mathematics. If I remember correctly it was one lecture a day where I reviewed the material from the last day in the train commute and afterwards reviewed it again. Then spent about 8 hours every Sunday on the weekly exercises which then have to presented at the blackboard during another weekly unit (where I usually also prepared my self yet again before the unit).
You will never get the chance to dive so deep and use so much time once you are part of the workforce. You can easily learn React on the job, but they probably won’t pay you for solving differential equation exercises during work. And in fact I often wished I put more credits into such courses. Programming exercises were useless for me – I already worked as embedded and network developer part-time and got paid for non-toy exercises. Writing simple FTP clients in Java were a waste of time when I worked on a network monitoring solution in my job. So the courses I did really benefit from were those exotic foundational topics.

That being said, there were also more than enough interesting things I could work on at university that you’ll likely work on in the “real world” afterwards. The real world is often boring in comparison. In university you might deal with surgery robots and then in your job it’s much more likely to fix computers of doctors, maintain a patient database or write format converters in Java.

Let’s look at some of the cooler topics I encountered at university:
– 3D renderer from scratch (Computer Graphics 1)
– Internet security challenges, gaining ranks and titles, with leaderboard and option to join a CTF (Internet Security 1)
– X-Ray segmentation algorithms (Medical Computer Vision)
– Iris recognition systems (Seminar work)
– Played with a cathode ray tube (Physics practicals)
– Knowledge based system for infant ventilation (Knowledge based systems)
– Networked 3D game (Computer Graphics 2)
– Neuroprosthetics (Summer course)
– Developed a SNMP client for some exotic embedded device (Practical)
– Recorded EMG of my face muscles (Clinical signal processing)
– Multimodal image registration of ophthalmologic images (Thesis at General Hospital Vienna)
– Played with Virtual Reality equipment before it was so readily available in the consumer market (Virtual and Augmented Reality)

Enjoy university while it lasts. Having the opportunity for dense learning and working on such a variety of topics in a short time span is rare after graduation. Of course a job will teach you a lot, but not 40h+ a week. It’s not unlikely you spend a large portion of your work week with repetitive and not exactly enlightening tasks. Also: you can still specialize on one topic the subsequent 30-40 years.

© 2020 neuratec blog

Theme by Anders NorenUp ↑