Dealing with knockouts in R (ditching Goldvarb)

14 Nov

Once you’ve got your corpora all situated and annotated, then next part is analysis.

Robin Melnick offers this guest blog post for folks learning R–especially if you’re a sociolinguist moving off of Goldvarb (if this is the case, check out Daniel Johnson’s website for other resources: http://danielezrajohnson.com/index.html).

Heeeere’s Robin:

———

I wrote up these instructions for a sociolinguistics colleague at another institution who’s in the throes of moving her life from Goldvarb to R. Pretty straightforward stuff for you R veterans but possibly useful for anyone less experienced in regression analysis.

One nice feature of Goldvarb is that it automatically identifies “knockouts,” i.e., empty cells. R, on the other hand, lets you proceed with regression even when you have such an invariant factor. This typically results in a fixed effect with a falsely huge error and estimated beta coefficient, so you really want to remove these before you fit your model.

The good news is that it’s straightforward and easily done.

I’ll illustrate with some of John Rickford’s Bajan (Barbadian Creole) data, in particular where we were looking at question formation and the factors constraining (predicting) inversion.

My dataframe is ‘ba’.

My dependent variable is ‘inv’ (whether or not the given question token is inverted), with values ‘y’ and ‘n’.

For each predictor (independent variable) I want to see if there are any structural zeros. To do this we generate the table that Goldvarb does for you automatically. Let’s illustrate with auxiliary type, ‘aux’.

 > table(ba$aux,ba$var)
       y   n
  b   0  11
  d   9 623
  g   0   5
  l  27  87
  m   4  90
  x   0   5
  z   0  60

Just like Goldvarb we visually inspect the table to see which aux types we want to remove/keep. We have four “knockouts” (here, all in favor of non-inversion), leaving three with variation: ‘d’, ‘l’, and ‘m’ (these represent do-support, copula be, and modals). To keep only tokens with these:

 > ba = ba[ba$aux%in%c('d','l','m'),]

This says to replace ba with itself, keeping just those rows for which aux is among the three factor levels we want. To see that it worked, let’s look at the table again:

> table(ba$aux,ba$var)
      y   n
  b   0   0
  d   9 623
  g   0   0
  l  27  87
  m   4  90
  x   0   0
  z   0   0

We can see that tokens corresponding to the knockout levels have been removed. There is, however, one last step: In R, we need to actually tell it to entirely remove the levels from the factor, not just the corresponding tokens. We can now do that with:

> ba$aux = ba$aux[drop=T]

This tells R to remove from the factor any levels for which there are no corresponding tokens left. A final view of the table confirms we have what we want:

> table(ba$aux,ba$var)
      y   n
  d   9 623
  l  27  87
  m   4  90

Repeat this procedure for each of your factors. The approach above closely parallels what you do with Goldvarb – visual inspection, then manual encoding of what to remove. We can, however, write R code that does all of the above automatically:

> aux.table = table(ba$aux,ba$var)
> aux.keep  = aux.table[(aux.table[,1]>0) & (aux.table[,2]>0),]
> ba        = ba[ba$aux%in%rownames(aux.keep),]
> ba$aux    = ba$aux[drop=T]

The first line generates the table as before, but stores it (as aux.table). The second line creates a second table (aux.keep) from the first, only keeping those rows where both columns are non-zero (i.e., not a knockout). Now the names of the rows in this reduced table will be the names of the levels of the factor group that we want to keep. The third line now keeps only those rows in the dataframe for which aux is among the names of the rows in our little table (rownames(aux.keep)). The fourth line is as before where we then want to fully remove the levels for which no tokens remain. So you can see that this does exactly what we did before, just without having to do the visual inspection of the table. Then you’d still repeat this for each IV.

———————–

Any questions? Add a comment or write to Robin directly at rmelnick at stanford (you know the rest).

Advertisements

2 Responses to “Dealing with knockouts in R (ditching Goldvarb)”

  1. Tal Linzen November 14, 2011 at 9:02 pm #

    Thanks for the post, Robin. Can you explain why knockouts result in a fixed effect with a falsely huge error and estimated beta coefficient?

    • Robin Melnick November 17, 2011 at 11:11 pm #

      Certainly we want to distinguish here between the statistics and the science.

      A “knockout” again is the Goldvarb term for when within your sample all tokens representing a certain level of an independent variable (IV) behave invariably, meaning they’re all associated with just one value of the dependent variable (DV). Noting these is certainly important to the science involved — it says that your DV is completely predictable from this single IV for this level.

      Just on the statistics, the problem is that linear regression is designed to fit a model of variation and can’t (or doesn’t well) handle such invariability. For each level of each IV, the regression algorithm seeks a coefficient that does the best job of predicting (fitting) each token in the data in combination with the coefficients for other levels/factors. The effect of a “knockout” level is that whenever such a value/level is encountered, the DV suddenly becomes invariable, and the values for all other factors have no effect on predicting the DV for such a token. The only way the regression algorithm can handle this is to make the coefficient for such a factor effectively infinite — so much larger in magnitude than everything else that it will always “win out.” This in turn has unpredictable effects on the fitting of other factors.

      Best,
      Robin

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: