“Grep” is a way of searching for strings in files, so it’s a pretty basic tool for your linguistics toolbox. For example, if you’re at a (Unix) command prompt and type:
grep -wi "word" file.txt
You’ll get back a list of all the lines that have word in them within file.txt.
- The -wi means that grep will only search for whole words (that’s the w) and will be insensitive to case (so it’ll get word, Word, even wOrD).
- Note that the use of single quotes, double quotes, or no-quotes depends on your shell and some other things.
If you’d like me to cover grep more in future posts, let me know–most of the time I get questions about TGrep2, not grep since there are oodles of grep tutorials all over the web. For example, this one geared for linguists: http://arts.anu.edu.au/linguistics/misc/comp_resources/grep.html.
TGrep2 has the “grep” morpheme at its heart–the T is for “trees” because TGrep/TGrep2 search through syntactic trees to find lines that match a given syntactic structure. Mainly it’s used on the Penn Treebank data. That’s Wall Street Journal stuff, Brown Corpus, ATIS, and maybe the most commonly used, Switchboard corpus.
The hardest part about using a parsed corpus is figuring out the trees. So start off by getting examples of very simple structures similar to what you want. From the WSJ, I search for “he really”.
If I do something like this:
tgrep2 "/^he$/ . /^really$/"
Then I’ll get dumb output:
Because all it’s going to return to you is the FIRST part of what you tgrep2. We’ll see some workarounds for this, but for the moment, the point is that we need to have an S (for Sentence) in front.
To keep “he really” together, put parentheses around them. And relate them by saying “I want sentences that dominate ‘he really’.
tgrep2 -l "S << (/^he$/ . /^really$/)"
That will get you things like:
(S (`` ``)
(S (SBAR-NOM-SBJ (WHNP-2 (WP What))
(S (NP-SBJ-1 (PRP he))
(ADVP (RB really))
(VP (VBD wanted)
(S (NP-SBJ (-NONE- *-1))
(VP (TO to)
(VP (VB know)
(NP (-NONE- *T*-2))))))))
(VP (VBD was)
(PP-PRD (IN about)
(NP (DT a)
(S (NP-SBJ (PRP you))
(VP (VBD did)
(VP (VB know)
(NP (DT that)))))
If you wanted sentences that IMMEDIATELY dominated “he really”, then you’d just use one “>”.
But let’s look again at the query we just ran:
tgrep2 -l "S << (/^he$/ . /^really$/)"
What are each of the pieces doing?
- tgrep2: calls the function–it does need to know where to point to, so hopefully you’ve set up your path.
- -l: this “switch” is what makes the trees display in “long form”, with everything layout with indents and whatnot. Alternatively, you could use -t to just print the words (no long line-after-line stuff, no POS tags, no parentheses). If you leave both of these off, you’ll get a hybrid–all the POS tags and parentheses but no indenting.
- We’ve talked about the S, the <<, and the parentheses (“I want sentenced dominating something that’s in parentheses, which I want kept together”).
- The slashes are what give you a regular expression. In the case of /^he$/, we’re saying we only want words that start with “he” (that’s the carrot) and which end with “he” (that’s the dollar sign). If you left off the dollar sign, you would start looking for matches like “hen” and “hedonism”.
- The dot says that you want “he” to immediately precede “really”. If you just want any kind of preceding (letting other words intervene), use two dots: ..
Let’s take another example. Here, I’m curious about “under” vs. “beneath”.
tgrep2 "NP < (PP <1 (IN [ < beneath | < under]))"
So what’s happening?
- This will return me NPs that are dominated by either “beneath” or “under”.
- You can probably see the domination part (NP < PP)
- The “<1” means that I want an “IN” tag that is the first child of the “PP” tag. You can tweak the number to get the n-th child of something.
- The brackets and the | say that I want the IN to dominate either “beneath” or “under”
I’ll wrap up with a few other tips and two examples I got from Hal Tily:
tgrep2 "VP < (/^VB/ << /^load/) < (/^PP/ <1 (IN < with))"
tgrep2 "* < (* < either) < (* < or)"
The first one here searched for VPs that have the verb “load” using a “with” PP. The second one finds all structures with an “either…or” construction.
- Notice that the tag matched in the load-example is /^VB/, that is, it starts with “VB” but doesn’t have to stop there–there are various VB flavors, so this will match all of them.
- Similarly, we want to match “load”, “loading”, “loads”, etc. so we search just for /^load/ NOT for /^load$/.
- In the either-or example, we use asterisks as a “wildcard” to match anything. (This is an update from TGrep, which used “__”, that still works in TGrep2, but asterisks are a little more familiar of a wildcard to most people.
- If you mark a node with \’, then what you’ll print out is that node. That stops you from having to reverse a bunch of >> and << relationships in order to get your desired node on the far left of your query. The following query will let you get the adjectives that appear before nominal “jog”.
tgrep2 -t "NP < (/^N/ < jog) << \'JJ"
- Using -a will get you multiple matches within each sentence instead of just the first.
- Using -i will make matching case-insensitive
- You can stack these up: so for a long-form tree that is case-insensitive and matches multiple occurrences per sentence, you’d do something like:
tgrep2 -ail "S << (/^he$/ . /^really$/)"
There’s a lot more to be said about TGrep2, but this should give you a basic orientation. You can find the manual here: http://tedlab.mit.edu/~dr/Tgrep2/tgrep2.pdf