QUANTITATIVE THINKING FOR CORPUS LINGUISTS
One of the huge advantages of using corpora is the frequency information that they offer us. Many corpus-based studies rely to some degree on frequencies in their argumentation, citing, for example, figures for the frequency of a particular lexical expression or grammatical pattern in a particular functional, social, or regional variety of a language, or for the frequency with which two words occur together.
Imagine that you are a corpus linguist investigating the use of the modals will and shall + INFINITIVE (I will learn statistics vs. I shall learn statistics) as means of expressing futurity in different varieties of English. Say you have found that will+INFINITIVE occurs 2,316 times in the LOB corpus of British English and 1,974 times in the KOLHAPUR corpus of Indian English, while shall+INFINITIVE occurs 363 times in each of the two corpora, i.e. that the distribution is as follows:
The first question you are now faced with is how can you find out whether this distribution of modal auxiliaries is potentially important, i.e. whether, mathematically speaking, it is statistically significant. A distribution is said to be statistically significant if the likelihood that it has come about accidentally is below a certain level (more about this level later). Note that the question whether a distribution is statistically significant is quite a different question from the one whether you can say something interesting about this distribution. “Statistical significance” refers exclusively to the question whether the distribution could have come about by chance -- it says nothing at all about whether the distribution is linguistically significant (i.e., whether it tells us something about the way in which language works). However, if the distribution is not statistically significant, then the issue of linguistic interest does not even arise in the first place. Thus, statistical significance is a precondition for linguistic significance, but not a guarantee.
The Chi-Square Test
If we want to determine whether a distribution could have come about by chance, the first thing we must do is calculate what exactly the chance distribution would be. In other words, we have to know what distribution we would have expected based on chance. Once we know these expected frequencies, we can compare them to our observed frequencies to see whether the latter are different enough for us to say that they cannot have come about by accident. So let us see how we can calculate the expected frequencies for the distribution of the two modals discussed earlier.
But wait! What does this value mean? Is 3.97 good or bad? Is it big enough a difference? A real statistics package (and the spreadsheet you can download below) will convert this number into a probability of error for you. This is the level we mentioned above -- way back before you knew how to do a Chi-Square test -- i.e. the level at which you are prepared to state your certaincy that you are or are not dealing with a chance distribution. In statistics, it is customary to expect a probability of error of 5 percent or lower in order to say that something is significant; in other words, a distribution is statistically significant if the likelihood that it has come about by chance is below five per cent. If the likelihood is below one percent, we say that the distribution is very significant, and if it is below 0.1 per cent, we say that it is highly significant. Incidentally, in statistics, percentages are written as decimal numbers, so five per cent is written as 0.05, one per cent is written as 0.01, and you can work out for yourself what 0.1 per cent is written as.
So how do we look up the probability of error in a table such as this? Well, we determine the degrees of freedom (df) for our distribution, then we find the appropriate line and check whether our Chi-Square value is bigger than one of the numbers given in that line. If so, we check the column heading, and we finally know at what level our distribution is significant. Oh no, you moan, what is a “degree of freedom”? Haven't we come across enough strange concepts already? Unfortunately, you haven't, but degrees of freedom are easy enough to understand: the df of a table is the number of rows (without totals) minus one multiplied by the number of columns (without totals) minus one. In our case, this is (2-1) * (2-1), which is 1 * 1 = 1. So we go to the first line of Table 5, and start checking: 3.97 is bigger than 3.84, and it is not bigger than any other figures in this line. The value 3.84 appears in the column headed by the information 5%, so our distribution is significant at the five-per-cent level. In statistical parlance, p<0.05. If we were to report this result in a paper, we would give the Chi-Square value and the degrees of freedom too -- we might say something like “This paper has shown that the modal shall is used significantly more frequently in Indian English than in British English (X2=3.97 (df=1), p<0.05)” (the X is meant to represent the Greek letter Chi, as shown in the formula below). And now, we are finally through, exept for the...
For each of the following distributions, calculate the Chi-Square values and check whether they are significant.
Note: You can use the tables in the text above to perform the calculations for exercise 1 to 5, but it may be easier to use this summary. When you are secure in your use of the procedure, you may switch to this .xls spreadsheet, which contains a 2-by-2 test, as well as a 2-by-3 test and a 3-by-3 test, which you need for some of the later exercises. All solutions can be found in the answer sheet (PDF, 116KB).
1. The LOB corpus (British English) contains 2,316 cases of will+INFINITIVE and 505 cases of the contracted form 'll+INFINITIVE; the BROWN corpus (American English) contains 2,237 cases of the full form and 441 cases of the contracted form. Does this distribution differ from the expected one, and if so, is the difference significant?
2. The KOLHAPUR corpus (Indian English) contains 1,974 cases of will+INFINITIVE and 230 cases of the contracted form 'll+INFINITIVE. Contrast this with the British English data from Exercise 1. Does the distribution differ from the expected one, and if so, is the difference significant?
3. The LOB corpus contains 3184 cases of will/'ll/shall+INFINITIVE and 172 cases of going-to+INFINITIVE. The BROWN corpus contains 2945 cases of will/'ll/shall+INFINITIVE and 144 cases of going-to+INFINITIVE (as in I am going to learn statistics. Do Americans use the going-to-future more frequently than British speakers?
4. The KOLHAPUR corpus contains 2567 cases of will/'ll/shall+INFINITIVE and 75 cases of going-to+INFINITIVE. Do Indian English speakers use the going-to-future more or less frequently than British speakers?
5. Now compare all three varieties with respect to the will/'ll/shall-future and the going-to-future. You will have to use the downloadable .xls file for this and some of the following exercises (open the file and select the appropriate spreadsheet (2-by-2, 2-by-3, or 3-by-3) by clicking on the tabs at the bottom of the page).
6. Let's look at something other than the future tense for a moment. Gries and Stefanowitsch find that the verb give occurs 461 times with the ditransitive construction (Billy gave Diane a present and 146 times with the to-dative. The ditransitive occurred with other verbs 574 times and the to-dative 1,773 times. Calculate the expected frequencies for give in the two constructions and determine whether the distribution is significant.
7. Stubbs, arguing that the adjective small a neutral description of size while the adjective little has a connotation of “cuteness”, cites the following co-occurrence frequencies from a 200 million word corpus: little girl(s) 3100, little boy(s) 2000, small girl(s) 100, small boy(s) 440. (a) What are the expected frequencies for each adjective-noun combination, and is this 2-by-2 distribution significant? (b) Assume that in the same corpus, the combination little child(ren) occurs 1000 times and small child(ren) occurs 2250 times. Repeat your calculations for this 2-by-3 distribution
8. Back to the future (tense)! Test the following distribution using the 3-by-3 spreadsheet in the spreadsheet: LOB will 2,316, shall 363, going-to 172; BROWN will 2,237, shall 267, going-to 144; KOLHAPUR will 1,974, shall 363, going-to 75.
The data used in the text and in the exercises is taken from:
Berglund, Ylva. 1997. Future in Present-day English: Corpus-based evidence on the rivalry of expressions. ICAME Journal 21: 7-19.
Gries, Stefan, and Anatol Stefanowitsch. 2004. Extending collostructional analysis: a corpus-based perspective on 'alternations'. International Journal of Corpus Linguistics 9.1: 97-129.
Stubbs, Michael. 2001. Words and phrases: corpus studies of lexical semantics. Oxford: Blackwell.
For those who are interested, the formula for calculating the Chi-Square value is as follows (r stands for “row”, c stands for “column”, O stands for “observed frequency”, and E stands for “expected frequency”):
© 2004 Anatol Stefanowitsch
Version 1.0 / Last Update October 2005