[sociable]
Idibon’s focus is on language technologies, but we also have pretty good chops when it comes to spatial data—check out our post on hosting FEMA’s aerial damage assessments following Hurricane Sandy. Geotagging gives us a way to understand language in terms of latitude/longitude. That’s often a way to make text analysis even more insightful and actionable.
Recently, while we were stitching together textual and geographic information, we found 16 places that (almost) slip through the cracks.
The Natural Earth database is a great resource for geolocation, giving the outlines of ~250 countries and territories that allow people to easy map coordinates to countries. In addition to a fairly standard English-language name, another piece of information that comes back is a country ISO code, which acts as a unique identifier so we can then pull in information from other data sources.
In particular, we use the ISO code to get information from GeoNames, which has a ton of details about places in the world (like giving alternative names for a place in lots of different languages). The vast majority of places in the Natural Earth data have ISO codes. But 16 of them don’t. We can still find these places in GeoNames, we just can’t do them by direct reference. A tour of these mismatches takes us around the world—mostly to controversial places. You’ll be well aware of some of them but others you’ve probably never heard of.
1: Siachen Glacier
The Karakoram range of the Himalayas has the highest density of tall peaks in the world (it doesn’t have Mount Everest but it has the number two peak in the world, K2). You’ll find references to it in Rudyard Kipling’s Kim and it’s a big part of Greg Mortenson’s Three Cups of Tea.
In addition to being the boundary between two colliding continents, the Karakoram has been the boundary of two colliding nuclear powers: India and Pakistan. Skirmishes are major news events in the area, but more soldiers have died from weather conditions than combat.
The Siachen Glacier is 43 miles (70 km) long, which is really big. But India, who controls the area, can expect to have less and less of it. By 2035, it’s estimated that it will be only 1/5 of its current size. (Go watch Chasing Ice.)
2 and 3: Serranilla Bank and Bajo Nuevo Bank
If you look in Wikipedia, you might wonder why Banco Serranilla is so disputed given that it is mostly underwater. Serranilla Bank has been a point of conflict between Colombia, Honduras, Nicaragua, and the US; Bajo Nuevo Bank has been disputed by Colombia, Jamaica, Nicaragua, and the US. It looks like most of these claims are pretty dormant, except for the Nicaraguan vs. Colombia claims. Colombia has been occupying both of them and last November the International Court of Justice said, yep, they had sovereignty over the areas.
4 and 5: Scarborough Reef and the Spratly Islands
There’s a lot of stuff going on in the South China Sea. The Scarborough Reef/Shoal (Huangyan Island in Chinese) involves a dispute between China, Taiwan, and the Philippines. The People’s Republic effectively has control over the area, though there were some military conflicts last year with the Philippines and it’s still an actively contested area. The shoal is about 58 square miles (150 sq km).
The Spratly Islands cover a much bigger area: about 164,100 square miles of sea (425,000 sq km), though the land in this area is only about 1.9 square miles total (4.9 sq km). They are also more disputed: China, Taiwan, the Philippines but also Malyasia and Brunei.
6: Baikonur
Baikonur is the site of the Baikonur Cosmodrome (which is obviously one of the best words possible). That’s the world’s first and biggest space launch facility: Sputnik 1 and Vostok 1 both took off from here.
Baikonur is not so much a disputed city as it is a “rented” one. It’s situated in Kazakhstan but Russia administers it.
7. Coral Sea Islands
These are mostly uninhabited islands and reefs northeast of Queensland, Australia. But they are involved in what is now my favorite dispute in the world because the national anthem for the underdog in the dispute is Gloria Gaynor’s version of I Am What I Am. In January of 2004, the Gay & Lesbian Kingdom of the Coral Sea Islands claimed the territory. Their first stamps were issued in 2006, “with the aim of creating a high and distinctive reputation amongst the philatelic fraternity”. Obviously, that is worth repeating: “philatelic fraternity”.
8 and 9: Northern Cyprus and the Cyprus UN Buffer Zone
Cyprus gained independence from British rule in 1960, with a constitution meant to treat Greek Cypriots and Turkish Cypriots fairly. These protections started being threatened relatively soon thereafter. Then there was a coup in 1974, a possible annexation to Greece, and a Turkish invasion and…can we go back to the Coral Sea Islands? Turkey is really the only nation that recognizes Northern Cyprus as a nation of its own. It is separated from the rest of Cyprus by the UN Buffer Zone, which is about 134 square miles (346 sq km). There are about 1.1 million people on the whole island. About 300,000 in Northern Cyprus, about 10,000 of them living in the buffer zone.
10 and 11: Dhekelia and Akrotiri
Wait! We’re not done with Cyprus. When the British Empire said Cyprus could be independent in 1960 it said “Well, except we want to keep about 3% of it.” Mainly because military bases there are a great asset (you’re close to the Suez Canal).
12: Clipperton Island
The French control this coral atoll (2.3 sq mi/6 sq km). Or rather coconut palms control it but the Minister of Overseas France lists it on his LinkedIn profile. Good line from Wikipedia: “It has had no permanent inhabitants since 1945. It is visited on occasion by fishermen, French Navy patrols, scientific researchers, film crews, and shipwreck survivors.”
13: Somaliland
What we see as “Somalia” on most of today’s maps used to be two parts, one ruled by the British and one by the Italians (and actually even the Italian part, which includes Mogadishu, was eventually under British rule). “British Somaliland” was a protectorate up til 1960 (the same year Cyprus got its independence). The two parts of Somalia became independent at separate times, but joined together by the end of 1960. There was a coup in 1969 and the military took control and took the country in a communist direction (“Major General Mohamed Siad Barre, Chairman of the Supreme Revolutionary Council”).
In 1991, Barre was overthrown and the northern part of Somalia, the part that was British Somaliland, declared independence. You are probably aware of all the chaos of Somalia, most of that has been in the south. The story of the violence there is long and difficult, but Somaliland has been relatively stable and functional. No one recognizes it as a nation, however.
14: Guantanamo Bay
You are almost certainly aware that the US Navy operates a base in Cuba known as Guantanamo Bay. It’s the largest harbor on the south side of Cuba and its steep hills keep it cut off from the rest of Cuba. The Cuban-American Treaty of 1903 gave the US a lease, but Cuba considers that treaty invalid (obtained by threats of force) and has been protesting it since 1959. The US sends Cuba a check for renting the space every month for $4,085. Only one of these has ever been cashed (back in 1959—Castro says it was by mistake since it was still during the early part of the Cuban Revolution).
15: Kosovo
You are also probably aware of Kosovo, an area long associated with Serbia that has an Albanian majority. Kosovo declared its independence in 2008. There are 101 countries that recognize it as such, though Serbia does not. (Northern Cyprus does but the Republic of Cyprus does not.)
16: Indian Ocean Territory
The British claim over 21,000 square miles (54,400 sq km) of ocean, made up of about 23 sq miles of land (60 sq km). The largest island is Diego Garcia (17 sq mi/44 sq km), which the US and the UK operate as a joint military facility. The native population of Chagossians was forcefully evicted in the 1960s (to Mauritius and the Seychelles). They are making some strides in winning court battles (meanwhile there are British environmental plans that have pleasant environmental rationales but are at least partly about keeping the Chagossians at bay).
According to economist Peter Hammond, the fact that the British still control both the British Indian Ocean Territory and the Pitcairn Islands (in the Pacific) is the reason that the sun still does not, technically, set on the British Empire.
Pitfalls of place names
Another major natural language processing task is turning places-that-come-in-the-form-of-words into places-in-the-form-of-coordinates. Consider coming across Montmartre. If I give you no other context, you’d want to guess that I’m talking about something in the north of Paris. But am I talking about the area or the hill that the area is named after? The Moulin Rouge is in Montmartre but it isn’t on Montmartre. And if the occurrence were actually part of Cafe Montmartre then it could be a number of places. Yelp tells me there’s a Cafe Montmartre in Reston, Virginia. It’s also known in literary circles as a cafe in Prague where Franz Kafka hung out. (Fair is fair: Harry’s New York Bar is most famously a bar in Paris.)
These are the kinds of reasons why you want at least a little bit more natural language processing intelligence in your Named Entity Recognition system rather than just using keywords. This also helps you know if someone talking about US is discussing the United States of America or just really yelling the first person plural.
One of the ways maps are useful is that they relate places together: New York is in New York is in the US is in North America is on planet Earth. Here again you have a little bit of trouble: there are lots of alternative words and phrases (“Big Apple”, “NY”). And what do you do with places that don’t have obvious latitude/longitude coordinates (“Atlantis”) or which have really high altitude coordinates (“Heaven”)? What do you do about places that don’t exist anymore? (“USSR”, “Free Independent Republic of West Florida”)? If you’ve a date in Constantinople she’ll be waiting in Istanbul.
– Tyler Schnoebelen (@TSchnoebelen)
ps–Special thanks to Mark Johnston for finding the mismatch that made this post possible and for his general geo-wizardry!
[sociable]