October | 2011 | Corpus linguistics

Archive | October, 2011

Very basic TGrep2

“Grep” is a way of searching for strings in files, so it’s a pretty basic tool for your linguistics toolbox. For example, if you’re at a (Unix) command prompt and type:

grep -wi "word" file.txt

You’ll get back a list of all the lines that have word in them within file.txt.

The -wi means that grep will only search for whole words (that’s the w) and will be insensitive to case (so it’ll get word, Word, even wOrD).
Note that the use of single quotes, double quotes, or no-quotes depends on your shell and some other things.

If you’d like me to cover grep more in future posts, let me know–most of the time I get questions about TGrep2, not grep since there are oodles of grep tutorials all over the web. For example, this one geared for linguists: http://arts.anu.edu.au/linguistics/misc/comp_resources/grep.html.

TGrep2 has the “grep” morpheme at its heart–the T is for “trees” because TGrep/TGrep2 search through syntactic trees to find lines that match a given syntactic structure. Mainly it’s used on the Penn Treebank data. That’s Wall Street Journal stuff, Brown Corpus, ATIS, and maybe the most commonly used, Switchboard corpus.

The hardest part about using a parsed corpus is figuring out the trees. So start off by getting examples of very simple structures similar to what you want. From the WSJ, I search for “he really”.

If I do something like this:

tgrep2 "/^he$/ . /^really$/"

Then I’ll get dumb output:

he

Because all it’s going to return to you is the FIRST part of what you tgrep2. We’ll see some workarounds for this, but for the moment, the point is that we need to have an S (for Sentence) in front.

To keep “he really” together, put parentheses around them. And relate them by saying “I want sentences that dominate ‘he really’.

tgrep2 -l "S << (/^he$/ . /^really$/)"

That will get you things like:

(S (`` ``)

   (S (SBAR-NOM-SBJ (WHNP-2 (WP What))

                    (S (NP-SBJ-1 (PRP he))

                       (ADVP (RB really))

                       (VP (VBD wanted)

                           (S (NP-SBJ (-NONE-
*-1))

                              (VP (TO to)

                                  (VP (VB know)

                                      (NP
(-NONE- *T*-2))))))))

      (VP (VBD was)

          (PP-PRD (IN about)

                  (NP (DT a)

                      (JJ particular)

                      (NN company)))))

   (, ,)

   (CC but)

   (S (NP-SBJ (PRP you))

      (VP (VBD did)

          (RB n't)

          (VP (VB know)

              (NP (DT that)))))

   (. .))

If you wanted sentences that IMMEDIATELY dominated “he really”, then you’d just use one “>”.

But let’s look again at the query we just ran:

tgrep2 -l "S << (/^he$/ . /^really$/)"

What are each of the pieces doing?

tgrep2: calls the function–it does need to know where to point to, so hopefully you’ve set up your path.
-l: this “switch” is what makes the trees display in “long form”, with everything layout with indents and whatnot. Alternatively, you could use -t to just print the words (no long line-after-line stuff, no POS tags, no parentheses). If you leave both of these off, you’ll get a hybrid–all the POS tags and parentheses but no indenting.
We’ve talked about the S, the <<, and the parentheses (“I want sentenced dominating something that’s in parentheses, which I want kept together”).
The slashes are what give you a regular expression. In the case of /^he$/, we’re saying we only want words that start with “he” (that’s the carrot) and which end with “he” (that’s the dollar sign). If you left off the dollar sign, you would start looking for matches like “hen” and “hedonism”.
The dot says that you want “he” to immediately precede “really”. If you just want any kind of preceding (letting other words intervene), use two dots: ..

Let’s take another example. Here, I’m curious about “under” vs. “beneath”.

tgrep2 "NP < (PP <1 (IN [ < beneath | < under]))"

So what’s happening?

This will return me NPs that are dominated by either “beneath” or “under”.
You can probably see the domination part (NP < PP)
The “<1” means that I want an “IN” tag that is the first child of the “PP” tag. You can tweak the number to get the n-th child of something.
The brackets and the | say that I want the IN to dominate either “beneath” or “under”

I’ll wrap up with a few other tips and two examples I got from Hal Tily:

tgrep2 "VP < (/^VB/ << /^load/) < (/^PP/ <1 (IN < with))"

tgrep2 "* < (* < either) < (* < or)"

The first one here searched for VPs that have the verb “load” using a “with” PP. The second one finds all structures with an “either…or” construction.

Notice that the tag matched in the load-example is /^VB/, that is, it starts with “VB” but doesn’t have to stop there–there are various VB flavors, so this will match all of them.
Similarly, we want to match “load”, “loading”, “loads”, etc. so we search just for /^load/ NOT for /^load$/.
In the either-or example, we use asterisks as a “wildcard” to match anything. (This is an update from TGrep, which used “__”, that still works in TGrep2, but asterisks are a little more familiar of a wildcard to most people.

Other thoughts:

If you mark a node with \’, then what you’ll print out is that node. That stops you from having to reverse a bunch of >> and << relationships in order to get your desired node on the far left of your query. The following query will let you get the adjectives that appear before nominal “jog”.

```
tgrep2 -t "NP < (/^N/ < jog) << \'JJ"
```

Using -a will get you multiple matches within each sentence instead of just the first.
Using -i will make matching case-insensitive
You can stack these up: so for a long-form tree that is case-insensitive and matches multiple occurrences per sentence, you’d do something like:

tgrep2 -ail "S << (/^he$/ . /^really$/)"

There’s a lot more to be said about TGrep2, but this should give you a basic orientation. You can find the manual here: http://tedlab.mit.edu/~dr/Tgrep2/tgrep2.pdf

Tags: queries, search, structure, syntax, tgrep, tgrep2, trees

Comments 1 Comment
Categories Uncategorized

Top five LDC corpora

30 Oct

In this post, I’d like to start off reviewing some of the most popular corpora that the Linguistics Data Consortium provides–with a few possibilities for alternatives. If you have a favorite corpus send it in!

1. TIMIT Acoustic-Phonetic Continuous Speech Corpus

If you’re interested in speech recognition, here’s one of your main resources. It’s basically 630 people (8 American dialects) reading 10 “phonetically rich sentences”. Plus these are time-aligned with transcripts (orthographic and phonetic). It’s been hand-verified and it’s pre-split into training/test subsets.

2. Web 1T 5-gram Version 1

This is basically Google n-gram stuff for English (unigrams to 5-grams). So if you want collocates and word frequencies, this is pretty good. There are 1 trillion word tokens, after all.

95 billion sentences
13 million unigrams
1 billion 5-grams

This data was released in 2006, though, so there should be more up-to-date resources.

There’s also a 2010 (Mandarin) Chinese 5-gram web corpus: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T06

A 2009 Japanese 7-gram web corpus: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T08

And a 2009 “European” 5-gram on Czech Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish, Swedish: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T25

3. CELEX2 (but why not try SUBTLEX?)

This corpus, circa 1996, gives you ASCII versions of three lexical databases for English, Dutch, and German. You get:

orthography variations
phonological stuff like syllables and stress
morphology
word class, argument structures
word frequency, lemma frequency (based on “recent and representative text corpora).

In truth, if you just want word counts for American English then consider using SUBTLEXus: http://subtlexus.lexique.org/. They make the case that CELEX is actually bad for relying on for frequency information (I’ll let you follow the link for their arguments against it and Kucera and Francis. Actually, if you go ahead and check out http://elexicon.wustl.edu/, you can download words (and non-words) with reaction times and all the morphology/phonology/syntax stuff that CELEX2 gives you.

4. TIDIGITS

Okay, I had never heard of this one. The main use for this corpurs is speech recognition–for digits. You get 111 men, 114 women, 50 boys, and 51 girls each pronouncing 77 different sequences of digits in 1982.

5. ECI Multilingual Text

So the European Corpus Initiative Multilingual Corpus 1 (ECI/MCI) has 46 subcorpora totally 92 million words (marked up but you can get the non-marked up stuff, too).

12 of the component corpora have parallel translated corpora from 2-9 other corpora.

Most of the stuff is journalistic, and there are some dictionaries, literature, and international organization publications/proceedings/reports. The stuff seems to come mostly from the 1980’s and early 1990’s.

Anyone have a favorite corpus of UN delegates talking and being translated into a bunch of different languages?

Languages available: Albanian, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, French, Gaelic, German, Italian, Japanese, Latin, Lithuanian, Mandarin Chinese, Modern Greek, Northern Uzbek, Norwegian, Norwegian Bokmaal, Norwegian Nynorsk, Portuguese, Portuguese, Russian, Serbian, Slovenian, Spanish, Standard Malay, Swedish, Turkish

Tags: Albanian, Bulgarian, celex, Chinese, computational, Croatian, Czech, Danish, Dutch, english, Estonian, French, Gaelic, German, Italian, Japanese, Latin, Lithuanian, Mandarin, Mandarin Chinese, Modern Greek, morphology, ngram, Northern Uzbek, Norwegian, Norwegian Bokmaal, Norwegian Nynorsk, parallel, phonetics, phonology, Portuguese, Russian, semantics, Serbian, Slovenian, Spanish, speech recognition, Standard Malay, subtlex, Swedish, syntax, translation, Turkish

Comments Leave a Comment
Categories Uncategorized

Getting started with Stanford corpora

29 Oct

(Much of this blog is general-purpose information, but this post is pretty specific to people at Stanford.)

To get started with our corpora, please email the corpus TA (that’s me–tylers at stanford). What you need to do depends a bit on the corpora you want to use–here are the instructions on how to get approved for access.

Now, let’s say you have approval. A number of our corpora are stored on Stanford servers, which means round-the-clock access (other corpora involve checking out CDs). We’re going to be overhauling what’s stored on the servers, btw, so if you have any requests, let me know.

How to connect to AFS and the online corpora

You’ll need to be able to connect to the Stanford servers, so download “terminal emulation” software. Stanford recommends Secure CRT for Windows or LelandSSH for the Mac.
Once you’ve got a terminal emulation program, use it to connect to cardinal.stanford.edu, corn.stanford.edu, or spires.stanford.edu (using “ssh”).
You can find our corpora by change to this directory (cd=change directory):
- ```
cd /afs/ir/data/linguistic-data/.
```
“ls” will list the contents of the directory and you can jump into interesting subdirectories by using “cd”. If this is feeling unfamiliar to you, you probably want to ask me or one of your geekier friends for some help.
Readme files give useful information in order to read one of them (or anything else), try this command and use the space bar to get to the next page.
- ```
less readme.txt
```

Adding TGrep2 to your path

When you add something to your “path”, it means that you don’t have to type as much later on. You’ll want to do this if you have any desire to use the syntactically parsed portions of, say, the Wall Street Journal or Switchboard.

To add stuff to your account so you can use tgrep2 on the Wall Street Journal wherever you are, type:

```
cat >>~/.bashrc
```

export PATH=$PATH:/afs/ir/data/linguistic-data/bin/linux_2_4

export TGREP2_CORPUS=/afs/ir/data/linguistic-data/Treebank/tgrep2able/wsj_mrg.t2c.gz

Note that if you prefer, you can make Switchboard your default. Instead of “wsj_mrg.t2c.gz”, type “swbd.t2c.gz” above.
Press Ctrl+D. You want to log out and log back in because your path won’t change until you do.

Note that you can always call the OTHER corpus in TGrep2 by using a command like this:

tgrep2 -c /afs/ir/data/linguistic-data/Treebank/tgrep2able/EITHER-'wsj'-OR-'swbd'.t2c.gz

Tags: access, afs, Stanford, tgrep, tgrep2

Comments 2 Comments
Categories Uncategorized

Some favorites

Intro to corpus linguistics

Here’s my presentation to Stanford undergrads about corpus linguistics. You’ll find it full of examples and resources. And even some findings. http://www.stanford.edu/~tylers/notes/presentations/IntroductionToCorpusLinguistics.pptx
Chat room corpus

Went hunting around for some chat room corpora today–I though I’d find tons and tons but really just turned up one resource. But it’s a big one: over 30 billion words across 47,860 English language news groups from Oct 2005 to Jan 2011. Posts that are not in English are pulled out and the people […]
African language corpora

There are over two thousand African languages, spoken (in situ) by 15% of the world’s population. In density of linguistic diversity it is rivaled only by New Guinea (which probably exceeds it to be honest). And yet it is the Electronic Dark Continent. The LRE Map will give you 663 corpora/computational tools on English. But (almost) […]
COCA: What a fantastic source of data!

Intro 425 million words from 1990-2011. I believe that one of the best resources out there for linguists (or anyone interested in language) is the Corpus of Contemporary American English (COCA). Mark Davies has put together a bunch of corpora and put together an easy-to-use interface so you can make sophisticated queries on vast amounts […]
What were the cultural keywords when you were born?

Raymond Williams published a fascinating (and often-cited) book called Keywords (first in the 70s, then an update in the 80s). It’s full of really interesting stuff (my notes are here). But Williams’ words were just sort of the ones he saw flying around and took an interest in. This post gives you something a little more […]

Search

Corpus linguistics

Very basic TGrep2

Top five LDC corpora

Getting started with Stanford corpora

Recent Posts

Archives

Meta

On Twitter…

Some favorites

Intro to corpus linguistics

Chat room corpus

African language corpora

COCA: What a fantastic source of data!

What were the cultural keywords when you were born?