Quick, is Popeye a person? The answer really is, “it depends”. And we think that’s the right one.
Named Entity Recognition is one of the basic building blocks of natural language processing. It’s crucial for any kind of sentiment analysis or text analysis because if you aren’t sure what you’re talking about is Washington D.C., George Washington, or Washington Memorial Hospital then your results will be, well, meaningless.
One of our products is “information extraction” to identify People, Locations, Organizations and the like in any amount of text, in any language. If you’re asking us to help you do named entity identification, odds are you want to use it for some specific purpose. So flexibility is important.
This post is about some of the cases in Korean that we came across this week that I think are the most fun.
What’s a person?
First, do you want 뽀빠이 (‘Popeye’) and 스파이더맨 (‘Spider-Man’) or not? If you are trying to track and understand real-world events, probably not (sorry, fans). But if you are trying to understand the ways in which languages express the actions of people, then you probably do want them.
If Spider-Man isn’t a person is ‘web-sling’ not a verb?
Moving to humans, what about 달라이 라마 (‘Dalai Lama’). Technically, this is the name of the position that heads of the the Gelug school of Tibetan Buddhism (the phrase is from Mongolian ‘ocean’ plus Tibetan ‘teacher’).
The actual man that we call the Dalai Lama is བསྟན་འཛིན་རྒྱ་མཚོ་ (‘Tenzin Gyatso’), the 14th Dalai Lama. In most contexts, people who are referring to “The President” mean some particular president (in English news, Barack Obama). And in most texts—English or Korean, certainly—the Dalai Lama means a particular individual. But your system needs to be able to discern when you’re talking about a generic role and when you’re talking about a person. Not all languages have capitalization to help (and even in English, we don’t always use capitalization to make a distinction between the president and The President).
The importance of context
Let’s say you want to automatically extract real people. In that case, you probably don’t want to get 현무 (Hyunmoo), one of four legendary gods. But there’s a famous MC whose full name is 전현무 (Jeon Hyunmoo) who can be referred to as 현무.
Do you want gods and creatures to count as “People” in your Named Entity Recognition system? In Korea, there’s a famous MC who shares the name of this legendary god
Similarly, 탑 could be referring to a ‘pagoda’ or a ‘tower’, but it’s also the name of a famous rapper in the boy band Big Bang.
사야 is the name of a character in the movie Blood: The Last Vampire (Saya). But this also looks like the verb to buy when you put it in a ‘must buy’ context. One of the ways you’d want to incorporate “context” is to understand if you’re talking about movies. A simpler linguistic way is that Korean verbs appear at the end of sentences, so if the word 사야 is appearing at the end of a Korean sentence, odds are that the sentence is about shopping and not about a 400 year-old samurai.
Culture, a hot guy and an awesome alphabet
Sometimes what you have to understand is more cultural in nature. 세븐 is the way you represent the English word ‘seven’ in Korean. That is, you pronounce 세븐 as something like ‘sebuen’ in Korean. I bring this up for two reasons. One is show you how awesome the Hangul (Korean) alphabet is. The other reason is that it’s the name of a famous Korean singer (‘Se7en’ when it appears in Latin characters). I also call reason #2, “Friday eye candy”.
세븐 would like to introduce you to the Hangul alphabet
Back to the alphabet: 세 is a single character that represents a syllable. It’s made up of two parts: ㅅ, which is ‘s’ and ㅔ, which is ‘e’, so together they are ‘se’. Now things get even more exciting. The single character 븐 (‘buen’) is made up of ㅂ for the ‘b’ sound, ㅡ for an ‘eu’ sound AND ALSO THERE’S A ㄴ SMOOSHED IN THERE! That’s what gets the ‘n’ sound. You can see the full list of combinations here. There are 11,172 mathematically possible characters—although as you can imagine all but a couple thousand of those are basically impossible in Korean phonology.
At any rate, Hangul is such a cool alphabet that it gets a holiday: Hangul Day is on October 9th in South Korea. We like this alphabet so much that Hangul Day is one of our official company holidays. No joke.
Read more about…
- Improving privacy with Named Entity Recognition
- The best street names
- Burning Man camp names vs. Fortune 500 company names
- Horse naming
- John Wayne and hyponyms but mostly hyponyms
- 16 places that aren’t anywhere
– Tyler Schnoebelen (@TSchnoebelen)