ANATOL STEFANOWITSCH
 

QUANTITATIVE THINKING FOR CORPUS LINGUISTS

One of the huge advantages of using corpora is the frequency information that they offer us. Many corpus-based studies rely to some degree on frequencies in their argumentation, citing, for example, figures for the frequency of a particular lexical expression or grammatical pattern in a particular functional, social, or regional variety of a language, or for the frequency with which two words occur together.
    Yet despite this reliance on frequencies, there is often a conspicuous absence of analytic statistics in corpus-based studies, i.e., an absence of methods that help the researcher determine whether the frequencies that they have observed are actually statistically significant. It is unclear why this should be so -- access to statistical software is more readily available than ever before. Although commercial statistics packages are relatively expensive, there are a number of freeware solutions that compare favorably even where advanced calculations are concerned, let alone in the case of the kind of simple statistical procedures that corpus linguists are likely to have a need for. In addition, spreadsheet applications like OpenOffice Calc and MS Excel, at least one of which most corpus linguists will have access to, have integrated a number of these procedures, and there is a variety of websites that offer the possibility to perform statistical procedures on-line.
    Thus, the only explanation for the absence of analytic statistics in corpus linguistics is that many corpus linguists are unclear about the need for such procedures, and that they are generally unaccustomed to quantitative thinking in general. Or, to put it more bluntly, that they are scared of statistics. But performing statistical calculations is nothing to be scared of -- it is a little dull perhaps, but a corpus linguist has got to do what a corpus linguist has got to do, so here we go.

Statistical Significance

Imagine that you are a corpus linguist investigating the use of the modals will and shall + INFINITIVE (I will learn statistics vs. I shall learn statistics) as means of expressing futurity in different varieties of English. Say you have found that will+INFINITIVE occurs 2,316 times in the LOB corpus of British English and 1,974 times in the KOLHAPUR corpus of Indian English, while shall+INFINITIVE occurs 363 times in each of the two corpora, i.e. that the distribution is as follows:

LOB KOLHAPUR
will 2316 1974
shall 363 363
Table 1: Distribution of Two Modals in Two Varieties of English

The first question you are now faced with is how can you find out whether this distribution of modal auxiliaries is potentially important, i.e. whether, mathematically speaking, it is statistically significant. A distribution is said to be statistically significant if the likelihood that it has come about accidentally is below a certain level (more about this level later). Note that the question whether a distribution is statistically significant is quite a different question from the one whether you can say something interesting about this distribution. “Statistical significance” refers exclusively to the question whether the distribution could have come about by chance -- it says nothing at all about whether the distribution is linguistically significant (i.e., whether it tells us something about the way in which language works). However, if the distribution is not statistically significant, then the issue of linguistic interest does not even arise in the first place. Thus, statistical significance is a precondition for linguistic significance, but not a guarantee.
    So how do we determine whether the distribution in Table 1 (or any other distribution) is statistically significant? There are several methods, but the most famous and widespread one is probably the Chi-Square test (“chi” is pronounced ky, rhyming with fly or high). So let us go step by step through the calculations necessary to perform this test (once you have understood how it works, you can skip most of these steps by using the spreadsheet available for download at the bottom of this page, or by using one of the many on-line procedures or freeware applications available over the internet).

The Chi-Square Test

If we want to determine whether a distribution could have come about by chance, the first thing we must do is calculate what exactly the chance distribution would be. In other words, we have to know what distribution we would have expected based on chance. Once we know these expected frequencies, we can compare them to our observed frequencies to see whether the latter are different enough for us to say that they cannot have come about by accident. So let us see how we can calculate the expected frequencies for the distribution of the two modals discussed earlier.
    Consider Table 2, which repeats the frequencies shown in Table 1 above (don't click the “Calculate” button yet!):

Table 2: Observed Frequencies


The first step consists in calculating the so-called marginal sums, i.e. the sums of each row and each column in the table, as well as the total sum of all four cells (a sum is the result of an addition, so we must add the values in the cells of each row, each column, and the whole table). If you click on the “Calculate” button now, the marginal sums will be calculated for you, but you might want to perform the calculations yourself (your computer should contain a calculator) to make sure you have understood exactly how the results are achieved (by the way, you could theoretically enter your own numbers into Table 2, and all calculations here and below would be performed on these numbers, but this would make it difficult to follow our example, so it's best to stick with these numbers for now, and then go back and play around later).

These marginal frequencies now allow us, in a second step, to calculate the expected frequencies. The way in which this is done may sound strange to you at first, but is actually quite simple. For each cell, you simply have to multiply its column total by its row total, and divide the product by the overall table total. For example, for the top left cell in Table 2 -- the one containing the figure 2,316, you take the column total (2,679) and multiply it by the row total (4,290). This gives you the rather large figure 11,492,910. You then divide this figure by the table total (5,016), giving you the result 2,291.25. You enter this figure in a new table, and repeat this procedure for each of the remaining three cells. If you click on the “Calculate” button next to Table 3, these calculations will be performed to you. Again, you might want to repeat the calculations manually with your calculator to help you understand how it is done.

Table 3: Expected Frequencies

Note that the marginal frequencies are not shown in Table 3, since they are obviously the same as in Table 2 (calculate them, if you don't believe it)! Now you know how expected frequencies are calculated, but you may not know why they can be calculated in this way. The logic behind this is as follows. The marginal sums tell you, how many times each individual “condition” is met in your data: how many times did will occur in your data in total, how many times did shall occur in your data in total, how many of your examples are from the LOB in total, and how many are from KOLHAPUR in total. These frequencies are fixed and you must accept them as given. If the distribution of these frequencies across the four combined conditions (will+LOB, shall+LOB, will+KOLHAPUR, shall+KOLHAPUR) were based on chance, then they should be distributed proportionally. What does this mean? Well, 2,676 future tense modals occurred in the LOB, and 2,337 occurred in the KOLHAPUR. In other words, 53.4 per cent of your modals occurred in the LOB (2,679 divided by 5,016), and 46.6 per cent occurred in the KOLHAPUR (2,337 divided by 5,016). If both modals occurred based on chance, then they should follow this general distribution, i.e. 53.4 per cent of all instances of will and 53.4 per cent of all instances of shall should occur in the LOB, and 46.6 per cent of each of the two modals in the KOLHAPUR. 53.4 per cent of 4,290 (the total number of cases of will is 2291.25, which is, of course, exactly the number we have in the top left cell of Table 3. The other three cell values can be arrived at in the same way (again, feel free to check this for yourself)! To get back to our example, we can see that will occurred more frequently than expexted in the LOB, while shall occurred less frequently; in the KOLHAPUR it was the other way around. In other words, it seems to be the case that shall is used more frequently in Indian English than in British English. Why this is the case or whether it tells us something interesting about language need not concern us here, but it is important that you understand the logic by which we arrived at this result, so take your time thinking about it.

Okay, now that you have understood, you are ready for the third, and almost final step: calculating the differences between the expected and observed frequencies in order to see whether they are big enough for you to claim that they cannot be due to chance! This is done in the following way: for each cell, you take the difference between the observed and the expected frequency and multiply it by itself (i.e., you take it to the power of 2). This ensures that we get only postive numbers (remember this from high school?), and it stresses big differences (which get even bigger if you multiply them by themselves), and plays down small differences (which don't get much bigger if you multiply them by themselves). Then you take each of your results, and divide it by the corresponding expected frequency (this will express each difference as a proportion of the expected frequency, which is good if these are very different for each cell). For example, for the top left cell, you subtract 2,291.25 from 2,316, giving you 24.75. Taken to the power of 2, this gives you 612.56. You divide this by 2291.25, giving you 0.2673. Now you repeat this for the other three cells. If you click on the “Calculate” button next to Table 4, these calculations will be performed for you, but -- you've guessed it -- you should check the results for yourself for extra practice.

Table 4: Differences Between Observed and Expected Frequencies

Now, you are ready for the final step: calculating the Chi-Square value for your distribution. This is very easy (and you've deserved a break)! You simply add up the cell values from Table 4. Clicking on the “Calculate” button next to this field will do it for you (relax -- this is so easy, you don't even have to check it with your calculator).

Chi Square:

But wait! What does this value mean? Is 3.97 good or bad? Is it big enough a difference? A real statistics package (and the spreadsheet you can download below) will convert this number into a probability of error for you. This is the level we mentioned above -- way back before you knew how to do a Chi-Square test -- i.e. the level at which you are prepared to state your certaincy that you are or are not dealing with a chance distribution. In statistics, it is customary to expect a probability of error of 5 percent or lower in order to say that something is significant; in other words, a distribution is statistically significant if the likelihood that it has come about by chance is below five per cent. If the likelihood is below one percent, we say that the distribution is very significant, and if it is below 0.1 per cent, we say that it is highly significant. Incidentally, in statistics, percentages are written as decimal numbers, so five per cent is written as 0.05, one per cent is written as 0.01, and you can work out for yourself what 0.1 per cent is written as.
    Okay, you say, but what about our Chi-Square value? Well, in the old days, people had to look up their Chi-Square values in long printed tables, which would have looked something like this:

df 5.0% 1.0% 0.1%
1 3.841 6.635 10.828
2 5.991 9.210 13.816
3 7.815 11.345 16.266
4 9.488 13.277 18.467
... ... ... ...
Table 5: Old-fashioned Table of Chi-Square Values

So how do we look up the probability of error in a table such as this? Well, we determine the degrees of freedom (df) for our distribution, then we find the appropriate line and check whether our Chi-Square value is bigger than one of the numbers given in that line. If so, we check the column heading, and we finally know at what level our distribution is significant. Oh no, you moan, what is a “degree of freedom”? Haven't we come across enough strange concepts already? Unfortunately, you haven't, but degrees of freedom are easy enough to understand: the df of a table is the number of rows (without totals) minus one multiplied by the number of columns (without totals) minus one. In our case, this is (2-1) * (2-1), which is 1 * 1 = 1. So we go to the first line of Table 5, and start checking: 3.97 is bigger than 3.84, and it is not bigger than any other figures in this line. The value 3.84 appears in the column headed by the information 5%, so our distribution is significant at the five-per-cent level. In statistical parlance, p<0.05. If we were to report this result in a paper, we would give the Chi-Square value and the degrees of freedom too -- we might say something like “This paper has shown that the modal shall is used significantly more frequently in Indian English than in British English (X2=3.97 (df=1), p<0.05)” (the X is meant to represent the Greek letter Chi, as shown in the formula below). And now, we are finally through, exept for the...

Exercises

For each of the following distributions, calculate the Chi-Square values and check whether they are significant.

Note: You can use the tables in the text above to perform the calculations for exercise 1 to 5, but it may be easier to use this summary. When you are secure in your use of the procedure, you may switch to this .xls spreadsheet, which contains a 2-by-2 test, as well as a 2-by-3 test and a 3-by-3 test, which you need for some of the later exercises. All solutions can be found in the answer sheet (PDF, 116KB).

1. The LOB corpus (British English) contains 2,316 cases of will+INFINITIVE and 505 cases of the contracted form 'll+INFINITIVE; the BROWN corpus (American English) contains 2,237 cases of the full form and 441 cases of the contracted form. Does this distribution differ from the expected one, and if so, is the difference significant?

2. The KOLHAPUR corpus (Indian English) contains 1,974 cases of will+INFINITIVE and 230 cases of the contracted form 'll+INFINITIVE. Contrast this with the British English data from Exercise 1. Does the distribution differ from the expected one, and if so, is the difference significant?

3. The LOB corpus contains 3184 cases of will/'ll/shall+INFINITIVE and 172 cases of going-to+INFINITIVE. The BROWN corpus contains 2945 cases of will/'ll/shall+INFINITIVE and 144 cases of going-to+INFINITIVE (as in I am going to learn statistics. Do Americans use the going-to-future more frequently than British speakers?

4. The KOLHAPUR corpus contains 2567 cases of will/'ll/shall+INFINITIVE and 75 cases of going-to+INFINITIVE. Do Indian English speakers use the going-to-future more or less frequently than British speakers?

5. Now compare all three varieties with respect to the will/'ll/shall-future and the going-to-future. You will have to use the downloadable .xls file for this and some of the following exercises (open the file and select the appropriate spreadsheet (2-by-2, 2-by-3, or 3-by-3) by clicking on the tabs at the bottom of the page).

6. Let's look at something other than the future tense for a moment. Gries and Stefanowitsch find that the verb give occurs 461 times with the ditransitive construction (Billy gave Diane a present and 146 times with the to-dative. The ditransitive occurred with other verbs 574 times and the to-dative 1,773 times. Calculate the expected frequencies for give in the two constructions and determine whether the distribution is significant.

7. Stubbs, arguing that the adjective small a neutral description of size while the adjective little has a connotation of “cuteness”, cites the following co-occurrence frequencies from a 200 million word corpus: little girl(s) 3100, little boy(s) 2000, small girl(s) 100, small boy(s) 440. (a) What are the expected frequencies for each adjective-noun combination, and is this 2-by-2 distribution significant? (b) Assume that in the same corpus, the combination little child(ren) occurs 1000 times and small child(ren) occurs 2250 times. Repeat your calculations for this 2-by-3 distribution

8. Back to the future (tense)! Test the following distribution using the 3-by-3 spreadsheet in the spreadsheet: LOB will 2,316, shall 363, going-to 172; BROWN will 2,237, shall 267, going-to 144; KOLHAPUR will 1,974, shall 363, going-to 75.

The data used in the text and in the exercises is taken from:

Berglund, Ylva. 1997. Future in Present-day English: Corpus-based evidence on the rivalry of expressions. ICAME Journal 21: 7-19.

Gries, Stefan, and Anatol Stefanowitsch. 2004. Extending collostructional analysis: a corpus-based perspective on 'alternations'. International Journal of Corpus Linguistics 9.1: 97-129.

Stubbs, Michael. 2001. Words and phrases: corpus studies of lexical semantics. Oxford: Blackwell.
 

For those who are interested, the formula for calculating the Chi-Square value is as follows (r stands for “row”, c stands for “column”, O stands for “observed frequency”, and E stands for “expected frequency”):


 

© 2004 Anatol Stefanowitsch
Version 1.0 / Last Update October 2005