One of the things we talked about in our post about love-across-122-years-of-pop-music was that song/movie/book titles are tough for Named Entity Recognition. We said that people’s names are generally easy to identify, but it turns out this doesn’t extend very far down the animal kingdom. You know what else is hard to detect?
Many of the technologies that we used are built on top of probability theory and related statistics. For a long time, probability theory was a second-class citizen in the Math World because of its association with gambling. The world of Big Data Analytics can trace many of its foundational technologies to gambling traditions of dusty horse tracks thousands of years old. But unless you have a good dictionary of horse names, it’s about as difficult to identify the names of horses with statistical machine-learning as it is to predict the outcome of the races themselves. Mostly this post is about naming conventions for horses. But I’ll close with betting strategies, so stay tuned.
The Kentucky Derby is about to happen so let’s look at some of the names of horses that have finished in the top four since 1875 (“137 years of horse names”).
- Atswhatimtalkinabout (4th in 2003)
- Went The Day Well (4th in 2012)
- No Le Hace (2nd in 1972)
- J.R.’s Pet (4th in 1974)
- Run Dusty Run (2nd in 1977)
- Funny Cide (1st in 2003)
- Winning Colors (1st in 1988)
- Lucky Debonair (1st in 1965)
- Once Again (3rd in 1889)
- With Regards (4th in 1942)
We get all sorts of things—smooshedtogetherphrases, prepositional phrases, verb phrases, adverbials, expressions (“Viva America”, “Boola Boola”), possessives where the noun that’s possessed is still part of the named entity (sometimes these don’t even have apostrophes).
The sample above was chosen to show diversity but it’s a little heavy on multiple-word names, for the data considered here, 57.5% of horse names (316 / 550) were made up of multiple words.
Let’s do some part-of-speech analysis of the names. We’ll start with the noun-y stuff since that’s the biggest chunk of the data.
Most horse names are nouns
400 of the 550 horse names can be fairly easily classified as nouns/noun phrases, but there’s a fair amount of diversity. Here’s the break-down for these 400:
- 54 of these horses have or are place names, like “Hudson County”, “Seattle Slew” (this one is named for the ‘sloughs’ of Seattle), “Thunder Gulch”, “Spokane”, “Omaha”, “Strathmore”
- 60 are what I might call roles/jobs: “War Admiral”, “Son of John”, “Prince of Thieves”, “Dust Commander”, “Inventor”, “Exterminator”, “Native Dancer”, “Forty Niner”
- 119 have human-like names: “Sir Ribot” (there are lots of sir‘s and exalted titles), “Ben Ali”, “Dr. Barkley”, “Ned O.”, “Lady Navarre”, “Omar Khayyam”, “Mata Hari”, “King Celebrity”, “My Dad George”, “Dapper Dan”
- 161 are common nouns/noun phrases: “Tale of Ekati”, “Bad News”, “Barbs Delight”, “Brevity”, “Phalanx”, “Secretariat”, “Citation”, “Gallant Fox”, “Damask”, “Hydromel”
There are a handful of nominal names I decided not to subclassify: “Alexandria” is a fine name and a historic city; “Bourbon” is a drink named after a place named after a royal family called the House of Bourbon. And speaking of House of __, “Vera Cruz” was probably named after the place in Mexico…but it’s a solid human-like name for a drag queen/horse, too.
About the non-nouny names
What’s happening in the other names?
- There are 30 names that are verbs/verb phrases—mostly imperatives like “Make Music for Me”, “Shut Up”, “Thrive”, and “Never Bend”.
- There are 18 adjectival names: “Agile”, “Charismatic”, “Economic”, “Faultless”, “Rickety”, “Real Quiet”
- 6 prepositional phrases: “On My Honor”, “Under Fire”, “At The Threshold”
- 6 phrases from languages other than English: “Gato Del Sol”, “Semper Ego”
- 5 smooshed phrases: “Stepenfetchit”, “Imawildandcrazyguy”
- 4 are odd-ball phrases: “Went The Day Well”, “Classic Go Go”
- 2 are adverbial: “Once Again”, “Decidedly”
There are 21 that are ambiguous, like “Upset”, “Favor”, “Regret”, “Assault”, “Affirmed”, “Misstep”, and “Mate”.
Thanks to the Internet, I know that Zal (2nd place in 1907) was likely named after a legendary Persian warrior. Of course sometimes Wikipedia ruins things, for example, the image I had of Runnymede (2nd place in 1882) was a galloping Ganymede needing a handkerchief. It turns out that Runnymede is a water meadow near the Thames. Thanks a lot, facts.
Despite the Internet, there are still a lot of unknowns: 58 of the 550 (mostly single words) were left out of the analysis here because their origin was unclear (to me). Was “Ruhe” (3rd place, 1951) really chosen because it’s the German word for ‘rest/calm’? If you know what to do with “Reigh Count”, “Challedon”, “Staretor”, “Flamma”, “Jumron”, “Tompion”, or “Goyamo”, please leave a comment!
Place your bets!
A big part of what we do around here is probabilities. So you may naturally ask, “This year, should I bet on Frac Daddy or Will Take Charge? How about Charming Kitten or Itsmyluckyday?”
Now, I can help you make money. And I can also help you save money. That’s what I’m about to do. If you want to bet on horses based on their names: don’t.
This is a case—as is probably evident to all but the most onanistic onomasticists—where it’s really hard to conceive how it is that a name would give a horse a leg up. The best hypotheses would be something like “people who like certain types of names tend to have better horses” or “horses with certain names get more love, attention, and training”. It would be fascinating (and money-making) if either of these were true but I find nothing with significance or effect size to report. Predictive modeling requires measures that matter.
The data are, of course, limited to the top-four placing horses. But in that data, there is nothing in the names distinguishes first-place horses from 2nd/3rd/4th place finishers. For example, a quarter of the horses in the data finish in first place and a quarter of noun-y horses finish first. Whenever it looks like there’s a difference in the overall percentages, the counts are too small to really indicate anything. Sorry, book-makers and bet-placers!
– Tyler Schnoebelen (@TSchnoebelen)