Read the following text, paying particular attention to the highlighted words.
4.1 The problem
Until now we have been considering how to describe or summarise a set of data considered simply as an object in its own right. Very often we want to do more than this: we wish to use a collection of observed values to make inferences about a larger set of potential values; we would like to consider a particular set of data we have obtained as representing a larger class. It turns out that to accomplish this is by no means straightforward. What is more, an exhaustive treatment of the difficulties involved is beyond the scope of this book. In this chapter we can only provide the reader with a general outline of the problem of making inferences from observed values. A full understanding of this exposition will depend to some degree on familiarity with the content of later chapters. For this reason we suggest that this chapter is first read to obtain a general grasp of the problem, and returned to later for re-reading in the light of subsequent chapters.
We will illustrate the problem of inference by introducing some of the cases which we will analyse in greater detail in the chapters to come. One, for example, in chapter 8, concerns the size of the comprehension vocabulary of British children between 6 and 7 years of age. It is obviously not possible, for practical reasons, to test all British children of this age. We simply will not have the resources. We can only test a sample of children. We have learned, in chapters 2 and 3, how to make an adequate description of an observed group, by, for example, constructing a histogram or calculating the mean and standard deviation of the vocabulary sizes of the subset of children selected. But our interest is often broader than this; we would like to know the mean and standard deviation which would have been obtained by testing all children of the relevant age. How close would these have been to the mean and standard deviation actually observed? This will depend on the relationship we expect to hold between the group we have selected to measure and the larger group of children from which it has been selected. How far can we assume the characteristics of this latter group to be similar to those of the smaller group which has been observed? This is the classical problem of statistical inference: how to infer from the properties of a part the likely properties of the whole. It will turn up repeatedly from now on. It is worth emphasising at the outset that because of the way in which samples are selected in many studies in linguistics and applied linguistics, it is often simply not possible to generalise beyond the samples. We will return to this difficulty.
A population is the largest class to which we can generalise the results of an investigation based on a subclass. The population of interest (or target population) will vary in type and magnitude depending on the aims and circumstances of each different study or investigation. Within the limits set by the study in question, the population, in statistical terms, will always be considered as the set of all possible values of a variable. We have already referred to one study which is concerned with the vocabulary of 6-7-year-olds. The variable here is scores on a test for comprehension vocabulary size; the population of interest is the set of all possible values of this variable which could be derived from all 6-7-year-old children in the country. There are two points which should be apparent here. First, although as investigators our primary interest is in the individuals whose behaviour we are measuring, a statistical population is to be thought of as a set of values; a mean vocabulary size calculated from a sample of observed values is, as we shall see in chapter 7, an estimate of the mean vocabulary size that would be obtained from the complete set of values which form the target population. The second point that should be apparent is that it is often not straightforward in language studies to define the target population. After all, the set of '6-7-year-old children in Britain', if we take this to refer to the period between the sixth and seventh birthdays, is changing daily; so for us to put some limit on our statistical population (the set of values which would be available from these children) we have to set some kind of constraint. We return to this kind of problem below when we consider sampling frames. For the moment let us consider further the notion of 'target' or 'intended' population in relation to some of the other examples used later in the book.
Utterance length. If we are interested in the change in utterance length over time in children's speech, and collect data which sample utterance length, the statistical population in this case is composed of the length values of the individual utterances, not the utterances themselves. Indeed, we could use the utterances of the children to derive values for many different variables and hence to construct many different statistical populations. If instead of measuring the length of each utterance we gave each one a score representing the number of third person pronouns it contained, the population of interest would then be 'third person pronoun per utterance scores'.
Voice onset time (VOT). In the study first referred to in chapter 1, Macken & Barton (1980a) investigated the development of children's acquisition of initial stop contrasts in English by measuring VOTs for plosives the children produced which were attempts at adult voiced and voiceless stop contrasts. The statistical population here is the VOT measurements for /p, b, t, d, k and g/ targets, not the phonological items themselves. Note once again that it is not at all easy to conceptualise the target population. If we do not set any limit, the population (the values of all VOTs for word-initial plosives pronounced by English children) is infinite. It is highly likely, however, that the target population will necessarily be more limited than this as a result of the circumstances of the investigation from which the sample values are derived. Deliberate constraints (for example a sample taken only from children of a certain age) or accidental ones (non-random sampling - see below) will either constrain the population of interest or make any generalisation difficult or even impossible.
Tense marking. In the two examples we have just looked at, the population values can vary over a wide range. For other studies we can envisage large populations in which the individual elements can have one of only a few, or even two, distinct values. In the Fletcher & Peters (1984) study (discussed in chapter 7) one of the characteristics of the language of children in which the investigators were interested was their marking of lexical verbs with the various auxiliaries and/or suffixes used for tense, aspect and mood in English. They examined frequencies of a variety of individual verb forms (modal, past, present, do-support, etc.). However, it would be possible to consider, for example, just past tense marking and to ask, for children of a particular age, which verbs that referred to past events were overtly marked for past tense, and which were not. So if we looked at the utterances of a sample of children of 2;6, we could assign the value 1 to each verb marked for past tense, and zero to unmarked verbs. The statistical population of interest (the values of the children's past referring verbs) would similarly be envisaged as consisting of a large collection of elements, each of which could only have one or the other of these two values.
A population then, for statistical purposes, is a set of values. We have emphasised that in linguistic studies of the kind represented in this book it is not always easy to conceptualise the population of interest. Let us assume for the moment, however, that by various means we succeed in defining our target population, and return to the problem of statistical inference from another direction. While we may be ultimately interested in populations, the values we observe will be from samples. How can we ensure that we have reasonable grounds for claiming that the values from our sample are accurate estimates of the values in the population? In other words, is it possible to construct our sample in such a way that we can legitimately make the inference from its values to those of the population we have determined as being of interest? This is not a question to which we can respond in any detail here. Sampling theory is itself the subject of many books. But we can illustrate some of the difficulties that are likely to arise in making generalisations in the kinds of studies that are used for exemplification in this book, which we believe are not untypical of the field as a whole.
Common sense would suggest that a sample should be representative of the population, that is, it should not, by overt or covert bias, have a structure which is too different from the target population. But more technically (remembering that the statistical population is a set of values), we need to be sure that the values that constitute the sample somehow reflect the target statistical population. So, for example, if the possible range of values for length of utterance for 3-year-olds is 1 to 11 morphemes, with larger utterances possible but very unusual, we need to ensure that we do not introduce bias into the sample by only collecting data from a conversational setting in which an excessive number of yes-no questions are addressed to the child by the interlocutor. Such questions would tend to increase the probability of utterance lengths which are very short - minor utterances like yes, no, or short sentences like I don't know. The difficulty is that this is only one kind of bias that might be introduced into our sample. Suppose that the interlocutor always asked open-ended questions, like What happened? This might increase the probability of longer utterances by a child or children in the sample. And there must be sources of bias that we do not even contemplate, and cannot control for (even assuming that we can control for the ones we can contemplate) 1.
Fortunately there is a method of sampling, known as random sampling that can overcome problems of overt or covert bias. What this term means will become clearer once we know more about probability. But it is important to understand from the outset that 'random' here does not mean that the events in a sample are haphazard or completely lacking in order, but rather that they have been constructed by a procedure that allows every element in the population a known probability of being selected in the sample.
While we can never he entirely sure that a sample is representative (that it has roughly the characteristics of the population relevant to our investigation), our best defence against the introduction of experimenter bias is to follow a procedure that ensures random samples (one such procedure will be described in chapter 5). This can give us reasonable confidence that our inferences from sample values to population values are valid. Conversely, if our sample is not constructed according to a random procedure we cannot be confident that our estimates from it are likely to be close to the population values in which we are interested, and any generalisation will be of a dubious nature.
How are the samples constructed in the studies we consider in this book? Is generalisation possible from them to a target population?
4. 3 The theoretical solution
It will perhaps help us to answer these questions if we introduce the notion of a sampling frame by way of a non-linguistic example. This will incidentally clarify some of the difficulties we saw earlier in attempts to specify populations.
Suppose researchers are interested in the birth weights of children born in Britain in 1984 (with a view ultimately to comparing birth weights in that year with those of 1934). As is usual with any investigation, their resources will only allow them to collect a subset of these measurements - but a fairly large subset. They have to decide where and how this subset of values is to be collected. The first decision they have to make concerns the sources of their information. Maternity hospital records are the most obvious choice, but this leaves out babies born at home. Let us assume that health visitors (who are required to visit all new-born children and their mothers) have accessible records which can be used. What is now required is some well-motivated limits on these records, to constitute a sampling frame within which a random sample of birth weights can be constructed.
The most common type of sampling frame is a list (actual or notional) of all the subjects in the group to which generalisation is intended. Here, for example, we could extract a list of all the babies with birth-dates in the relevant year from the records of all health visitors in Great Britain. We could then choose a simple random sample (chapter 5) of n of these babies and note the birth weights in their record. If n is large, the mean weight of the sample should be very similar (chapter 7) to the mean for all the babies born in that year. At the very least we will be able to say how big the discrepancy is likely to be (in terms of what is known as a 'confidence interval' - see chapter 7).
The problem with this solution is that the construction of the sampling frame would be extremely time-consuming and costly. Other options are available. For example, a sampling frame could be constructed in two or more stages. The country (Britain) could be divided into large regions, Scotland, Wales, North-East, West Midlands, etc., and a few regions chosen from this first stage sampling frame. For each of the selected regions a list of Health Districts can be drawn up (second stage) and a few Health Districts chosen, at random, from each region. Then it may be possible to look at the records of all the health visitors in each of the chosen Districts or perhaps a sample of visitors can be chosen from the list (third stage) of all health visitors in each district.2
The major constraint is of course resources - the time and money available for data collection and analysis. In the light of this, sensible decisions have to be made about, for example, the number of Health Districts in Britain to be included in the frame; or it may be necessary to limit the inquiry to children born in four months in the year instead of a complete year. In this example, the sampling frame mediates between the population of interest (which is the birth weights of all children born in Britain in 1984) and the sample, and allows us to generalise from the sample values to those in the population of interest.
If we now return to an earlier linguistic example, we can see how the sampling frame would enable us to link our sample with a population of interest. Take word-initial VOTs. Our interest will always be in the individuals of a relatively large group and in the measurements we derive from their behaviour. In the present case we are likely to be concerned with English children between 1;6 and 2;6, because this seems to be the time when they are learning to differentiate voiceless from voiced initial stops using VOT as a crucial phonetic dimension. Our resources will be limited. We should, however, at least have a sampling frame which sets time limits (for instance, we could choose for the lower limit of our age-range children who are 1;6 in a particular week in 1984) ; we would like it to be geographically well-distributed (we might again use Health Districts) ; within the sampling frame we must select a random sample of a reasonable size.3
That is how we might go about selecting children for such a study. But how are language samples to be selected from a child? Changing the example, consider the problem of selecting utterances from a young child to measure his mean length of utterance (mlu - see chapter 13). Again it is possible to devise a sampling frame. One method would be to attach a radio microphone to the child, which would transmit and record every single utterance he makes over some period of time to a tape-recorder. Let us say we record all his utterances over a three-month period. We could then attach a unique number to each utterance and choose a simple random sample (chapter 5) of utterances. This is clearly neither sensible nor feasible - it would require an unrealistic expenditure of resources. Alternatively, and more reasonably, we could divide each month into days and each day into hours, select a few days at random and a few hours within each day and record all the utterances made during the selected hours. (See Wells 1985 : chapter 1 , for a study of this kind.) If this method of selection were to be used it would be better to assume that that child is active only between, say, 7 am. and 8 p.m. and select hours from that time period.
In a similar way, it will always be possible to imagine how a sampling frame could be drawn up for any finite population if time and other resources were unlimited. The theory which underlies all the usual statistical methods assumes that, if the results obtained from a sample are to be generalised to a wider population, a suitable sampling frame has been established and the sample chosen randomly from the frame. In practice, however, it is frequently impossible to draw up an acceptable sampling frame - so what, then, can be done?
4.4 The pragmatic solution
In any year a large number of linguistic studies of an empirical nature are carried out by many researchers in many different locations. The great majority of these studies will be exploratory in nature; they will be designed to test a new hypothesis which has occurred to the investigator or to look at a modification of some idea already studied and reported on by other researchers. Most investigators have very limited resources and, in any case, it would be extravagant to carry out a large and expensive study unless it was expected to confirm and give more detailed information on a hypothesis which was likely to be true and whose implications had deep scientific, social or economic significance. Of necessity, each investigator will have to make use of the experimental material (including human subjects) to which he can gain access easily. This will almost always preclude the setting up of sampling frames and the random selection of subjects.
At first sight it may look as if there is an unbridgeable gap here. Statistical theory requires that sampling should be done in a special way before generalisation can be made formally from a sample to a population. Most studies do not involve samples selected in the required fashion. Does this mean that statistical techniques will be inapplicable to these studies? Before addressing this question directly, let us step back for a moment and ask what it is, in the most general sense, that the discipline of statistics offers to linguistics if its techniques are applicable.
What the techniques of statistics offer is a common ground, a common measuring stick by which experimenters can measure and compare the strength of evidence for one hypothesis or another that can be obtained from a sample of subjects, language tokens, etc. This is worth striving after. Different studies will measure quantities which are more or less variable and will include different numbers of subjects and language tokens. Language researchers may find several different directions from which to tackle the same issue. Unless a common ground can be established on which the results of different investigations can be compared using a common yardstick it would be almost impossible to assess the quality of the evidence contained in different studies or to judge how much weight to give to conflicting claims.
Returning to the question of applicability, we would suggest that a sensible way to proceed is to accept the results of each study, in the first place, as though any sampling had been carried out in a theoretically 'correct' fashion. If these results are interesting - suggesting some new hypothesis or contradicting a previously accepted one, for example - then is time enough to question how the sample was obtained and whether this is likely to have a bearing on the validity of the conclusions reached. Let us look at an example.
In chapter 11 we discuss a study by Hughes & Lascaratou (1981) on the gravity of errors in written English as perceived by two different groups: native English-speaking teachers of English and Greek teachers of English. We conclude that there seems to be a difference in the way that the two groups judge errors, the Greek teachers tending to be more severe in their judgements. How much does this tell us about possible differences between the complete population of native-speaking English teachers and Greek teachers of English? The results of the experiment would carry over to those populations - in the sense to be explained in the following four chapters - if the samples had been selected carefully from complete sampling frames. This was certainly not done. Hughes and Lascaratou had to gain the co-operation of those teachers to whom they had ready access. The formally correct alternative would have been prohibitively expensive. However, both samples of teachers at least contained individuals from different institutions. If all the English teachers had come from a single English institution and all the Greek teachers from a single Greek school of languages then it could be argued that the difference in error gravity scores could be due to the attitudes of the institutions rather than the nationality of the teachers. On the other hand, all but one of the Greek teachers worked in Athens (the English teachers came from a wider selection of backgrounds) and we might query whether their attitudes to errors might be different from those of their colleagues in different parts of Greece. Without testing this argument it is impossible to refute it, but on common sense grounds (i.e. the 'common sense' of a researcher in the teaching of second languages) it seems unlikely.
This then seems a reasonable way to proceed. Judge the results as though they were based on random samples and then look at the possibility that they may be distorted by the way the sample was, in fact, obtained. However, this imposes on researchers the inescapable duty of describing carefully how their experimental material - including subjects - was actually obtained. It is also good practice to attempt to foresee some of the objections that might be made about the quality of that material and either attempt to forestall criticism or admit openly to any serious defects.
When the subjects themselves determine to which experimental group they belong, whether deliberately or accidentally, the sampling needs to be done rather more carefully. An important objective of the Fletcher & Peters (1984) study mentioned earlier was to compare the speech of language-normal with that of language-impaired children. In this case the investigators could not randomly assign children to one of these groups - they had already been classified before they were selected. It is important in this kind of study to try to avoid choosing either of the samples such that they belong obviously to some special subgroup.
There is one type of investigation for which proper random sampling is absolutely essential. If a reference test of some kind is to be established, perhaps to detect lack of adequate language development in young children, then the test must be applicable to the whole target population and not just to a particular sample. Inaccuracies in determining cut-off levels for detecting children who should be given special assistance have economic implications (e.g. too great a demand on resources) and social implications (language-impaired children not being detected). For studies of this nature, a statistician should be recruited before any data is collected and before a sampling frame has been established.
With this brief introduction to some of the problems of the relation between sample and population, we now turn in chapter 5 to the concept of probability as a crucial notion in providing a link between the properties of a sample and the structure of its parent population. In the final section of that chapter we outline a procedure for random sampling. Chapter 6 deals with the modelling of statistical populations, and introduces the normal distribution, an important model for our purposes in characterising the relation between sample and population.
1 We have passed over here an issue which we have to postpone for the moment, but which is of considerable importance for much of the research done in language studies. Imagine the case where the population of interest is utterance lengths of British English-speaking pre-school children. We have to consider whether it is better to construct a sample which consists of many utterances from a few children, or one which consists of a small number of utterances from each of many children. We will return to this question, and the general issue of sample size, in chapter 7.
2 The analysis of the data gathered by such complex sampling schemes can become quite complicated and we will not deal with it in this book. Interested readers should see texts on sampling theory or survey design or consult an experienced survey statistician.
3 The issues raised in the first footnote crop up again here; measurements on linguistic variables are more complex than birth weights. We could again ask whether we should collect many word-initial plosives from few children, or few plosives from many children ( see chapter 7). A similar problem arises with the sample chosen by several stages. Is it better to choose many regions and a few Health Districts in each region or vice versa?
In this chapter the basic problem of the sample-population relationship has been discussed.
(Statistics in Language Studies by Anthony Woods, Paul Fletcher & Arthur Hughes)
Now try the exercises: Exercise a, Exercise b, Exercise c, Exercise d, Exercise e, Exercise f, Exercise g, Exercise h.
Back to AWL Exercises: Contents