neuratec blog

Improve named entity recognition in spaCy with custom rules

October 29, 2020 / Markus Toman

spaCy allows to augment the model-based named entity recognition with custom rules. I found that the documentation on this is a bit lacking (https://spacy.io/usage/rule-based-matching#entityruler) cross-references.

Here I would like to gather some links and hints for working with custom rules.

The example I use throughout is labeling US-style telephone numbers (e.g. “(123)-456-7890” or “123-456-7890”).

Tests

My first recommendation is to add a lot of tests for your use case. When you download a new model, add another custom rule to the tokenizer or just upgrade spacy, things might break. The differences in named entity recognition between the provided small, medium and large English models are significant.

So we aim for tests in the style of:

texts = [("This is Fred and his number is 123-456-7890.", 1),
         ("Peter (001)-999-4321 is the big Apple Microsoft.", 1),
         ("Peter (001)-99-4321 is the big Apple Microsoft.", 0)]
for text, num_tel in texts:
    doc = nlp(texts) 
    self.assertEqual(len([ent for ent in doc.ents if ent.label_ == "TELEPHONE"]), num_tel)

Phrase patterns vs. token patterns

The documentation describes phrase- and token patterns here https://spacy.io/usage/rule-based-matching#entityruler. What might be misleading here is that the example for phrase patterns uses a single word.

Phrase patterns and token patterns from https://spacy.io/usage/rule-based-matching#entityruler (2020/10/29)

The point here is that for a phrase pattern you can match arbitrary text spanning multiple tokens with a single pattern, while for token patterns you specify a list of patterns with one pattern per token. Example:

# phrase pattern 
pattern_phrase = {"label": "FRUIT", "pattern": "apple pie"}

# similar token pattern
pattern_token = {"label": "FRUIT", "pattern": [{"TEXT": "apple"}, {"TEXT": "pie"}]}

Note that those two are not completely equivalent. The token pattern is dependent on the tokenizer. Which means if the rules of the tokenizer change, the pattern might not match anymore. At the same time the token pattern also matches “apple pie” (multiple whitespaces between the words).

For our example it’s obvious we will have to use token patterns because phrase patterns allow only matching of exact strings.

Pattern options

It’s not explicitly mentioned in the documentation for the entity ruler but you can use the same patterns as described in the token matcher documentation here: https://spacy.io/usage/rule-based-matching#adding-patterns-attributes. Most notably, the REGEX option as described here also works: https://spacy.io/usage/rule-based-matching#regex

So one option for our use case is to specify the token patterns as follows:

pattern_tel = {"label": "TELEPHONE", "pattern": [
    {"TEXT": {"REGEX": r"\(?\d{3}\)?"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{3}"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{4}"}}
    ]}

It has to be noted that this relies on the tokenizer to split the tokens by “-“. So we actually match on 5 tokens here. Another option is to first use a custom component to merge such phone numbers as described here https://spacy.io/usage/rule-based-matching#matcher-pipeline.

Adding the pattern

This one is straightforward:

import spacy
from spacy.pipeline import EntityRuler 

pattern_tel = {"label": "TELEPHONE", "pattern": [
    {"TEXT": {"REGEX": r"\(?\d{3}\)?"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{3}"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{4}"}}
    ]}

nlp = spacy.load('en_core_web_sm')
ruler = EntityRuler(nlp)
ruler.add_patterns([pattern_tel, pattern_name, pattern_apple]) 
nlp.add_pipe(ruler)
doc = nlp("This is Fred and his number is 123-456-7890 to get an apple  pie") 
print([(ent, ent.label_) for ent in doc.ents])

Output:

[(Fred, 'PERSON'), (123, 'CARDINAL')]

Seems there’s something wrong.

Adding entity ruler at the correct position

The documentation states that you have to add the entity ruler before the “ner” component but does not describe how, nor gives a link in this paragraph. So here we go: https://spacy.io/api/language#add_pipe. So let’s do this:

import spacy
from spacy.pipeline import EntityRuler 

pattern_tel = {"label": "TELEPHONE", "pattern": [
    {"TEXT": {"REGEX": r"\(?\d{3}\)?"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{3}"}},
    {"TEXT": "-"},
    {"TEXT": {"REGEX": r"\d{4}"}}
    ]}

nlp = spacy.load('en_core_web_sm')
ruler = EntityRuler(nlp)
ruler.add_patterns([pattern_tel, pattern_name, pattern_apple]) 
nlp.add_pipe(ruler, before="ner")
doc = nlp("This is Fred and his number is 123-456-7890 to get an apple  pie") 
print([(ent, ent.label_) for ent in doc.ents])

Output:

[(Fred, 'VIP'), (123-456-7890, 'TELEPHONE')]

Summary

We’ve seen how to add a regex based custom rule to the named entity recognition of spaCy. In a subsequent article we will explore more complex patterns.

On classic university education

October 15, 2020 / Markus Toman

In the vast Internet one regularly encounters dismissive statements about formal education, dismissing it as useless in the so-called real world. Here I would like to take a different stance and elaborate on the positive aspects of university.

First let’s set the context straight:
I studied at the Vienna University of Technology and Medical University of Vienna with no tuition. I can see that throwing yourself in debt like it’s common in the USA is something you have to carefully consider.
I know that nowadays there are lots of high-quality MOOCs out there. This wasn’t the case when I studied. I think you can get a lot out of them. But a large portion of the best ones out there are created by universities or at least based on university courses (like the infamous Machine Learning course by Andrew Ng). Also you have to be really careful to not become a cherry-picker. I would definitely not spent so much time with Mathematics without being forced to. Now I am glad I did – actually I wished they would have forced me to even more, even though I had to take about 5 statistics-focused and 4 general math courses.

A quite common complaint is that universities teach many esoteric topics or to a degree which you don’t need in your day-to-day life. But in fact, if they would not teach it, who would? Who would work on the foundations of technology if no one taught it?
As an example, when learning 3D graphics at university you usually start with topics like Bresenham’s line drawing algorithm, look at the projection matrices, probably implement that stuff from scratch. Obviously in 99% of the Unity3D-game-dev jobs out there you don’t REALLY need this knowledge. It will help you with the understanding but in fact you can do a lot without even knowing what a matrix is (I’ve worked on 3D viz in Java3D and programmed fixed pipeline OpenGL in C++ using NeHes tutorials in 2000 before knowing what a rotation matrix is). But who will write the next Unreal Engine or develop the latest raytracing techniques if everyone just taught using existing game engines?

But besides this global view on education, why is it interesting for the individual who does not plan to work in research?

Your career will likely last decades. It’s well worth spending a few of them on foundations. This is the kind of knowledge you can’t easily pick up on the job. In my first year I easily spent 15-20 hours a week on Mathematics. If I remember correctly it was one lecture a day where I reviewed the material from the last day in the train commute and afterwards reviewed it again. Then spent about 8 hours every Sunday on the weekly exercises which then have to presented at the blackboard during another weekly unit (where I usually also prepared my self yet again before the unit).
You will never get the chance to dive so deep and use so much time once you are part of the workforce. You can easily learn React on the job, but they probably won’t pay you for solving differential equation exercises during work. And in fact I often wished I put more credits into such courses. Programming exercises were useless for me – I already worked as embedded and network developer part-time and got paid for non-toy exercises. Writing simple FTP clients in Java were a waste of time when I worked on a network monitoring solution in my job. So the courses I did really benefit from were those exotic foundational topics.

That being said, there were also more than enough interesting things I could work on at university that you’ll likely work on in the “real world” afterwards. The real world is often boring in comparison. In university you might deal with surgery robots and then in your job it’s much more likely to fix computers of doctors, maintain a patient database or write format converters in Java.

Let’s look at some of the cooler topics I encountered at university:
– 3D renderer from scratch (Computer Graphics 1)
– Internet security challenges, gaining ranks and titles, with leaderboard and option to join a CTF (Internet Security 1)
– X-Ray segmentation algorithms (Medical Computer Vision)
– Iris recognition systems (Seminar work)
– Played with a cathode ray tube (Physics practicals)
– Knowledge based system for infant ventilation (Knowledge based systems)
– Networked 3D game (Computer Graphics 2)
– Neuroprosthetics (Summer course)
– Developed a SNMP client for some exotic embedded device (Practical)
– Recorded EMG of my face muscles (Clinical signal processing)
– Multimodal image registration of ophthalmologic images (Thesis at General Hospital Vienna)
– Played with Virtual Reality equipment before it was so readily available in the consumer market (Virtual and Augmented Reality)

Enjoy university while it lasts. Having the opportunity for dense learning and working on such a variety of topics in a short time span is rare after graduation. Of course a job will teach you a lot, but not 40h+ a week. It’s not unlikely you spend a large portion of your work week with repetitive and not exactly enlightening tasks. Also: you can still specialize on one topic the subsequent 30-40 years.

Paper summary: Maximizing Mutual Information for Tacotron

June 5, 2020 / Markus Toman

Maximizing Mutual Information for Tacotron, Peng Liu, Xixin Wu, Shiyin Kang, Guangzhi Li, Dan Su, Dong Yu
https://arxiv.org/abs/1909.01145

This publication investigates local information preference as source for errors in the Tacotron model.

Autoregressive (AR) models predict their next output given their previous outputs – given by a conditional probability P(x_i|x_<i). If they are dependent only on a single previous output they have the Markov Property. AR models are commonly trained using teacher forcing, which means that a previous value x_<i in each autoregressive step is taken from training data (ground truth) instead of the prediction. When both ground truth and prediction are used, this is commonly called scheduled sampling or curriculum learning. A conditional autoregressive (CAR) model furthermore bases each prediction on additional input – P(x_i|x_<i, t) – in case of Tacotron on some sort of linguistic specification t, e.g. characters of an orthographic transcription or phones from the phonetic transcription.

The hypothesis of this paper is that a significant part of errors produced by Tacotron are caused by local information preference, which intuitively means that a given prediction is mostly based on the autoregressive part – i.e. the previous prediction x_<i and not on the additional input t. For Tacotron this is the previous Mel spectral frame while the linguistic input is mostly ignored.

The authors furthermore hypothesize that this is the reason why a reduction factor > 1 works well to make the training more robust. The reduction factor determines, how many frames are produced per inference step. If the model has to produce multiple frames given only one previous frame, it is forced to make use of the additional input t to still satisfy the loss function. Furthermore they argue that this might also be the reason, why the 0.5 prenet dropout and the rather large frame shift in Tacotron are vital for good results.

The proposed methods aim to reward the model for making use of the additional input t. Dropout Frame Rate (DFR) randomly sets the previous input x_i-1 to the global mean, so that the model can not solely rely on those to make future predictions. The other option is Maximizing Mutual Information (MMI) between linguistic input and predicted output features to assure the model learns a meaningful mapping instead of some, subjectively meaningless, representation that just satisfies the loss function.

The results show that both DFR and MMI decrease the number of errors produced by the model and also achieve a good alignment at a much earlier step than without those additions. Furthermore MMI solves the gap between training and validation error often seen when training Tacotron.

Teaching material on Markov Models

August 13, 2019 / Markus Toman

I’ve created a short exercise on Markov Models for my students, here’s the story

“You are a valued employee of MegaCorp Inc.

Locked in your basement office you’ve been gathering information about your coworkers Adalbert, Beatrice and Celis. They’re working in shifts on a single machine and you’ve been monitoring and logging their daily routines. Unfortunately they found out, disabled their webcams, removed your Trojan and cut off your access to switches and routers. Even though you are blind now, it’s still mandatory for you to know who is currently working for any, because only Adalbert isn’t vigilant enough to notice when you sneak up to the ground floor to snatch some donuts and cake from their office fridge.

Except… you still got the old logs and you can see the LEDs of their switch blinking. Unfortunately you don’t have physical access, but perhaps you can still try to figure out, who’s shift it is?”

So what’s the task?

We model this problem using Markov Models (obviously)
Our Markov states are the activities of a given co-worker
Each activity produces different LED blinking patterns – so each state/activity emits an observation
Three iterations:
- Task 1: Markov chain, only modeling states (activities) over time
- Task 2: Markov chain with observations, modeling states (activities) and observations (blinking LEDs) over time, given we have access to both
- Task 3: Hidden Markov model, modeling states (activities) and observations (blinking LEDs) over time, given we have access only to the observations

Let’s look at the data:

You own logs of daily activities for each of the coworkers A, B and C
The logs are CSVs in the format: date|sequence of activities
- Activities: Working (0), pause (1), online gaming (2), browsing the internet (3), streaming TV series (4)
Furthermore you’ve been (independently) gathering data on LED blinking patterns for different activities in the format: activity|pattern
- Pattern: none (0), very low frequency (1), low frequency (2), medium frequency (3), high frequency (4), very high frequency (5)

Get started

You can get the Python/Jupyter notebook here:

https://github.com/m-toman/osue_exercise2/blob/master/osue_markov_exercise_tofix.ipynb

And check out how the correct plots should look like here:

https://github.com/m-toman/osue_exercise2/blob/master/exercise.ipynb

You can also try it on Google colab:

https://colab.research.google.com/drive/1U9BvfaaxiBRsWG3o3HQTC_wRLNaBYAHI

Have fun!

Reading list 10/2018

October 8, 2018 / Markus Toman

I’m a notorious parallel reader, jumping between books all the time.
Here I give a couple opinions on the books currently in my pipeline:

Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville
Fluent Python, Luciano Ramalho
Seven Languages in Seven Weeks, Bruce Tate
Clojure for the Brave and True, Daniel Higginbotham
A Tour of C++, Bjarne Stroustrup
The Rust Programming Language, Steve Klabnik and Carol Nichols
(Hyperion, Dan Simmons)

Should recursion be avoided?

September 18, 2018 / Markus Toman

I’ve recently answered a question on the German Quora (https://de.quora.com/Warum-sollte-man-in-Java-lieber-auf-die-Rekursion-verzichten/answer/Markus-Toman) that stated: “Why should you refrain from recursion in Java?”.

I think we should avoid recursion by default, except if:

you are working in a language where recursion is essential and highly-optimized (Haskell, LISP dialects, Erlang/elixir and most other functional languages)
the recursive version is much easier to maintain and understand AND does not negatively affect performance to an unreasonable degree
the recursive algorithm is easier parallelizable (typically because of immutability)

In this post we discuss this topic, starting with high-level Python, going over to C++ and finally ending up in assembly.

Evolution of programming – is simple better?

August 13, 2018 / Markus Toman

Disclaimer: this is an opinionated article, flowing directly from brain to keyboard without lots of references and explanations. I know much of its content is debatable and it just reflects my current opinions and mindset, likely to change without notice.

Recently I watched the Design Patterns Coursera course (https://www.coursera.org/learn/design-patterns) and noticed that it’s full of Gang of Four (GoF) design patterns. If you are a programmer reading about programming online, you’ll find that discussions and material on GoF has been very sparse in the recent years, especially with the uprising functional programming hype (if we can call it a hype).

I wondered about this and a bit of investigation led me to (mostly) the following opinions:

GoF patterns are now common knowledge and therefore nobody talks about it anymore, just like nobody talks about using functions or control structures anymore.
They were a fail, born in the object-oriented-Java-mindset, made software even more complex and went overboard with useless abstractions.
They were there to help with programming language shortcomings which are less and less of in issue now, especially with functional features entering existing languages.

So, was OOP a fail? Will functional programming save us? Are strapped-on design patterns useless and replaced by simpler language features? How did we get to the current situation, where will we and up?

Artificial intelligence in speech synthesis

September 14, 2017 / Markus Toman

(Dear international readers, this content appeared first in the German “IT Freelancer Magazin” and is unfortunately only available in German)

Es war Anfang 2012 als ich mich für ein Doktorat im Bereich Sprachsynthese bewarb. Ehrlich gesagt konnte ich mir damals nicht viel darunter vorstellen, aber die Ausschreibung beinhaltete Machine Learning und Künstliche Intelligenz. Und es klang irgendwie… anders. Heute – über 5 Jahre später – habe ich einen ganz guten Überblick über das Thema, stoße aber im Gespräch mit anderen Menschen stets auf das selbe Unverständnis, das ich damals aufwies. Dieser Artikel soll nun in groben Zügen und allgemeinverständlich die Aufgabenstellung und Probleme moderner Sprachsynthese aufzeigen.

Examples for performance optimization in C and C++, Part 2

August 21, 2017 / Markus Toman

This is part 2 of “Examples for performance optimization in C and C++”, you can find part 1 here: Examples for performance optimization in C and C++, Part 1

Examples for performance optimization in C and C++, Part 1

July 18, 2017 / Markus Toman

This brief case report shows an example for performance gains that can be achieved in C and C++ by simple analysis and code restructuring. It does not describe performance optimization in detail.