Archive | October, 2011

Very basic TGrep2

31 Oct

Grep” is a way of searching for strings in files, so it’s a pretty basic tool for your linguistics toolbox. For example, if you’re at a (Unix) command prompt and type:

grep -wi "word" file.txt

You’ll get back a list of all the lines that have word in them within file.txt.

  • The -wi means that grep will only search for whole words (that’s the w) and will be insensitive to case (so it’ll get word, Word, even wOrD).
  • Note that the use of single quotes, double quotes, or no-quotes depends on your shell and some other things.

If you’d like me to cover grep more in future posts, let me know–most of the time I get questions about TGrep2, not grep since there are oodles of grep tutorials all over the web. For example, this one geared for linguists: http://arts.anu.edu.au/linguistics/misc/comp_resources/grep.html.

TGrep2 has the “grep” morpheme at its heart–the T is for “trees” because TGrep/TGrep2 search  through syntactic trees to find lines that match a given syntactic structure. Mainly it’s used on the Penn Treebank data. That’s Wall Street Journal stuff, Brown Corpus, ATIS, and maybe the most commonly used, Switchboard corpus.

The hardest part about using a parsed corpus is figuring out the trees. So start off by getting examples of very simple structures similar to what you want. From the WSJ, I search for “he really”.

If I do something like this:

tgrep2 "/^he$/ . /^really$/"

Then I’ll get dumb output:

he

Because all it’s going to return to you is the FIRST part of what you tgrep2. We’ll see some workarounds for this, but for the moment, the point is that we need to have an S (for Sentence) in front.

To keep “he really” together, put parentheses around them. And relate them by saying “I want sentences that dominate ‘he really’.

tgrep2 -l "S << (/^he$/ . /^really$/)"

That will get you things like:

(S (`` ``)
   (S (SBAR-NOM-SBJ (WHNP-2 (WP What))
                    (S (NP-SBJ-1 (PRP he))
                       (ADVP (RB really))
                       (VP (VBD wanted)
                           (S (NP-SBJ (-NONE-
*-1))
                              (VP (TO to)
                                  (VP (VB know)
                                      (NP
(-NONE- *T*-2))))))))
      (VP (VBD was)
          (PP-PRD (IN about)
                  (NP (DT a)
                      (JJ particular)
                      (NN company)))))
   (, ,)
   (CC but)
   (S (NP-SBJ (PRP you))
      (VP (VBD did)
          (RB n't)
          (VP (VB know)
              (NP (DT that)))))
   (. .))

If you wanted sentences that IMMEDIATELY dominated “he really”, then you’d just use one “>”.

But let’s look again at the query we just ran:

tgrep2 -l "S << (/^he$/ . /^really$/)"

What are each of the pieces doing?

  • tgrep2: calls the function–it does need to know where to point to, so hopefully you’ve set up your path.
  • -l: this “switch” is what makes the trees display in “long form”, with everything layout with indents and whatnot. Alternatively, you could use -t to just print the words (no long line-after-line stuff, no POS tags, no parentheses). If you leave both of these off, you’ll get a hybrid–all the POS tags and parentheses but no indenting.
  • We’ve talked about the S, the <<, and the parentheses (“I want sentenced dominating something that’s in parentheses, which I want kept together”).
  • The slashes are what give you a regular expression. In the case of /^he$/, we’re saying we only want words that start with “he” (that’s the carrot) and which end with “he” (that’s the dollar sign). If you left off the dollar sign, you would start looking for matches like “hen” and “hedonism”.
  • The dot says that you want “he” to immediately precede “really”.  If you just want any kind of preceding (letting other words intervene), use two dots: ..

Let’s take another example. Here, I’m curious about “under” vs. “beneath”.

tgrep2 "NP < (PP <1 (IN [ < beneath | < under]))"

So what’s happening?

  • This will return me NPs that are dominated by either “beneath” or “under”.
  • You can probably see the domination part (NP < PP)
  • The “<1” means that I want an “IN” tag that is the first child of the “PP” tag. You can tweak the number to get the n-th child of something.
  • The brackets and the | say that I want the IN to dominate either “beneath” or “under”

I’ll wrap up with a few other tips and two examples I got from Hal Tily:

tgrep2 "VP < (/^VB/ << /^load/) < (/^PP/ <1 (IN < with))"
tgrep2 "* < (* < either) < (* < or)"

The first one here searched for VPs that have the verb “load” using a “with” PP. The second one finds all structures with an “either…or” construction.

  • Notice that the tag matched in the load-example is /^VB/, that is, it starts with “VB” but doesn’t have to stop there–there are various VB flavors, so this will match all of them.
  • Similarly, we want to match “load”, “loading”, “loads”, etc. so we search just for /^load/ NOT for /^load$/.
  • In the either-or example, we use asterisks as a “wildcard” to match anything. (This is an update from TGrep, which used “__”, that still works in TGrep2, but asterisks are a little more familiar of a wildcard to most people.

Other thoughts:

  • If you mark a node with \’, then what you’ll print out is that node. That stops you from having to reverse a bunch of >> and << relationships in order to get your desired node on the far left of your query. The following query will let you get the adjectives that appear before nominal “jog”.
    • tgrep2 -t "NP < (/^N/ < jog) << \'JJ"
  • Using -a will get you multiple matches within each sentence instead of just the first.
  • Using -i will make matching case-insensitive
  • You can stack these up: so for a long-form tree that is case-insensitive and matches multiple occurrences per sentence, you’d do something like:
    • tgrep2 -ail "S << (/^he$/ . /^really$/)"

There’s a lot more to be said about TGrep2, but this should give you a basic orientation. You can find the manual here: http://tedlab.mit.edu/~dr/Tgrep2/tgrep2.pdf

Top five LDC corpora

30 Oct

In this post, I’d like to start off reviewing some of the most popular corpora that the Linguistics Data Consortium provides–with a few possibilities for alternatives. If you have a favorite corpus send it in!

1. TIMIT Acoustic-Phonetic Continuous Speech Corpus

If you’re interested in speech recognition, here’s one of your main resources. It’s basically 630 people (8 American dialects) reading 10 “phonetically rich sentences”. Plus these are time-aligned with transcripts (orthographic and phonetic). It’s been hand-verified and it’s pre-split into training/test subsets.

2. Web 1T 5-gram Version 1

This is basically Google n-gram stuff for English (unigrams to 5-grams). So if you want collocates and word frequencies, this is pretty good. There are 1 trillion word tokens, after all.

  • 95 billion sentences
  • 13 million unigrams
  • 1 billion 5-grams

This data was released in 2006, though, so there should be more up-to-date resources.

There’s also a 2010 (Mandarin) Chinese 5-gram web corpus: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2010T06

A 2009 Japanese 7-gram web corpus: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T08

And a 2009 “European” 5-gram on Czech Dutch, French, German, Italian, Polish, Portuguese, Romanian, Spanish, Swedish: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T25

3. CELEX2 (but why not try SUBTLEX?)

This corpus, circa 1996, gives you ASCII versions of three lexical databases for English, Dutch, and German. You get:

  • orthography variations
  • phonological stuff like syllables and stress
  • morphology
  • word class, argument structures
  • word frequency, lemma frequency (based on “recent and representative text corpora).

In truth, if you just want word counts for American English then consider using SUBTLEXus: http://subtlexus.lexique.org/. They make the case that CELEX is actually bad for relying on for frequency information (I’ll let you follow the link for their arguments against it and Kucera and Francis. Actually, if you go ahead and check out http://elexicon.wustl.edu/, you can download words (and non-words) with reaction times and all the morphology/phonology/syntax stuff that CELEX2 gives you.

4. TIDIGITS

Okay, I had never heard of this one. The main use for this corpurs is speech recognition–for digits. You get 111 men, 114 women, 50 boys, and 51 girls each pronouncing 77 different sequences of digits in 1982.

5. ECI Multilingual Text

So the European Corpus Initiative Multilingual Corpus 1 (ECI/MCI) has 46 subcorpora totally 92 million words (marked up but you can get the non-marked up stuff, too).

12 of the component corpora have parallel translated corpora from 2-9 other corpora.

Most of the stuff is journalistic, and there are some dictionaries, literature, and international organization  publications/proceedings/reports. The stuff seems to come mostly from the 1980’s and early 1990’s.

Anyone have a favorite corpus of UN delegates talking and being translated into a bunch of different languages?

Languages available: Albanian, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, French, Gaelic, German, Italian, Japanese, Latin, Lithuanian, Mandarin Chinese, Modern Greek, Northern Uzbek, Norwegian, Norwegian Bokmaal, Norwegian Nynorsk, Portuguese, Portuguese, Russian, Serbian, Slovenian, Spanish, Standard Malay, Swedish, Turkish

Getting started with Stanford corpora

29 Oct

(Much of this blog is general-purpose information, but this post is pretty specific to people at Stanford.)

To get started with our corpora, please email the corpus TA (that’s me–tylers at stanford). What you need to do depends a bit on the corpora you want to use–here are the instructions on how to get approved for access.

Now, let’s say you have approval. A number of our corpora are stored on Stanford servers, which means round-the-clock access (other corpora involve checking out CDs). We’re going to be overhauling what’s stored on the servers, btw, so if you have any requests, let me know.

How to connect to AFS and the online corpora

  1. You’ll need to be able to connect to the Stanford servers, so download “terminal emulation” software. Stanford recommends Secure CRT for Windows or LelandSSH for the Mac.
  2. Once you’ve got a terminal emulation program, use it to connect to cardinal.stanford.edu, corn.stanford.edu, or spires.stanford.edu (using “ssh”).
  3. You can find our corpora by change to this directory (cd=change directory):
    • cd /afs/ir/data/linguistic-data/.
  4. “ls” will list the contents of the directory and you can jump into interesting subdirectories by using “cd”. If this is feeling unfamiliar to you, you probably want to ask me or one of your geekier friends for some help.
  5. Readme files give useful information in order to read one of them (or anything else), try this command and use the space bar to get to the next page.
    • less readme.txt

Adding TGrep2 to your path

When you add something to your “path”, it means that you don’t have to type as much later on. You’ll want to do this if you have any desire to use the syntactically parsed portions of, say, the Wall Street Journal or Switchboard.

  1. To add stuff to your account so you can use tgrep2 on the Wall Street Journal wherever you are, type:
    • cat >>~/.bashrc
    • export PATH=$PATH:/afs/ir/data/linguistic-data/bin/linux_2_4
    • export TGREP2_CORPUS=/afs/ir/data/linguistic-data/Treebank/tgrep2able/wsj_mrg.t2c.gz
  2. Note that if you prefer, you can make Switchboard your default. Instead of “wsj_mrg.t2c.gz”, type “swbd.t2c.gz” above.
  3. Press Ctrl+D. You want to log out and log back in because your path won’t change until you do.
  4. Note that you can always call the OTHER corpus in TGrep2 by using a command like this:
    • tgrep2 -c /afs/ir/data/linguistic-data/Treebank/tgrep2able/EITHER-'wsj'-OR-'swbd'.t2c.gz