Getting started with Stanford corpora

29 Oct

(Much of this blog is general-purpose information, but this post is pretty specific to people at Stanford.)

To get started with our corpora, please email the corpus TA (that’s me–tylers at stanford). What you need to do depends a bit on the corpora you want to use–here are the instructions on how to get approved for access.

Now, let’s say you have approval. A number of our corpora are stored on Stanford servers, which means round-the-clock access (other corpora involve checking out CDs). We’re going to be overhauling what’s stored on the servers, btw, so if you have any requests, let me know.

How to connect to AFS and the online corpora

  1. You’ll need to be able to connect to the Stanford servers, so download “terminal emulation” software. Stanford recommends Secure CRT for Windows or LelandSSH for the Mac.
  2. Once you’ve got a terminal emulation program, use it to connect to,, or (using “ssh”).
  3. You can find our corpora by change to this directory (cd=change directory):
    • cd /afs/ir/data/linguistic-data/.
  4. “ls” will list the contents of the directory and you can jump into interesting subdirectories by using “cd”. If this is feeling unfamiliar to you, you probably want to ask me or one of your geekier friends for some help.
  5. Readme files give useful information in order to read one of them (or anything else), try this command and use the space bar to get to the next page.
    • less readme.txt

Adding TGrep2 to your path

When you add something to your “path”, it means that you don’t have to type as much later on. You’ll want to do this if you have any desire to use the syntactically parsed portions of, say, the Wall Street Journal or Switchboard.

  1. To add stuff to your account so you can use tgrep2 on the Wall Street Journal wherever you are, type:
    • cat >>~/.bashrc
    • export PATH=$PATH:/afs/ir/data/linguistic-data/bin/linux_2_4
    • export TGREP2_CORPUS=/afs/ir/data/linguistic-data/Treebank/tgrep2able/wsj_mrg.t2c.gz
  2. Note that if you prefer, you can make Switchboard your default. Instead of “wsj_mrg.t2c.gz”, type “swbd.t2c.gz” above.
  3. Press Ctrl+D. You want to log out and log back in because your path won’t change until you do.
  4. Note that you can always call the OTHER corpus in TGrep2 by using a command like this:
    • tgrep2 -c /afs/ir/data/linguistic-data/Treebank/tgrep2able/EITHER-'wsj'-OR-'swbd'.t2c.gz

2 Responses to “Getting started with Stanford corpora”


  1. Very basic TGrep2 « Corpus linguistics - October 31, 2011

    […] tgrep2: calls the function–it does need to know where to point to, so hopefully you’ve set up your path. […]

  2. Emotion corpora « Corpus linguistics - November 7, 2011

    […] the Emotional Prosody Speech and Transcripts corpus (if you’re at Stanford and you’ve gotten corpus access, you’ll find it at […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: