SAMPLING
In many disciplines, there is often a need to describe the characteristics of some large entity, such as the air quality in a region, the prevalence of smoking in the general population, or the output from a production line of a pharmaceutical company. Due to practical considerations, it is impossible to assay the entire atmosphere, interview every person in the nation, or test every pill. Sampling is the process whereby information is obtained from selected parts of an entity, with the aim of making general statements that apply to the entity as a whole, or an identifiable part of it. Opinion pollsters use sampling to gauge political allegiances or preferences for brands of commercial products, whereas water quality engineers employed by public health departments will take samples of water to make sure it is fit to drink. The process of drawing conclusions about the larger entity based on the information contained in a sample is known as statistical inference.
There are several advantages to using sampling rather than conducting measurements on an entire population. An important advantage is the considerable savings in time and money that can result from collecting information from a much smaller population. When sampling individuals, the reduced number of subjects that need to be contacted may allow more resources to be devoted to finding and persuading nonresponders to participate. The information collected using sampling is often more accurate, as greater effort can be expended on the training of interviewers, more sophisticated and expensive measurement devices can be used, repeated measurements can be taken, and more detailed questions can be posed.
DEFINITIONS
The term "target population" is commonly used to refer to the group of people or entities (the "universe") to which the findings of the sample are to be generalized. The "sampling unit" is the basic unit (e.g., person, household, pill) around which a sampling procedure is planned. For instance if one wanted to apply sampling methods to estimate the prevalence of diabetes in a population, the sampling unit would be persons, whereas households would be the sampling unit for a study to determine the number of households where one or more persons were smokers. The "sampling frame" is any list of all the sampling units in the target population. Although a complete list of all individuals in a population is rarely available, an alphabetic listing of residents in a community or of registered voters are examples of sampling frames.
SAMPLING METHODS
The general goal of all sampling methods is to obtain a sample that is representative of the target population. In other words, apart from random error, the information derived from the sample is expected to be the same had a complete census of the target population been carried out. The procedures used to select a sample require some prior knowledge of the target population, which allows a determination of the size of the sample needed to achieve a reasonable estimate (with accepted precision and accuracy) of the characteristics of the population. Most sampling methods attempt to select units such that each has a definable probability of being chosen. Methods that adopt this approach are called "probability sampling methods." Examples of such methods include simple random sampling, systematic sampling, stratified sampling, and cluster sampling.
A random sample is one where every person (or unit) in the population from which the sample is drawn has some chance of being included in it. Ideally, the selections that make up the sample are made independently; that is, the choice to select one unit will not affect the chance of another unit being selected. The simplest way of selecting sampling units where each unit has an equal probability of being chosen is referred to as a simple random sample.
Systematic random sampling involves deciding what fraction of the target population is to be sampled, and then compiling an ordered list of the target population. The ordering may be based on the date a patient entered a clinic, the last surname of patients, or other factors. Then, starting at the beginning of the list, the initial sample unit is randomly selected from within the first k units, and thereafter every kth individual is sampled. Typically, the integer k is estimated by dividing the size of the target population by the desired sample size. This method of sampling is easy to implement in practice, and the sampling frame can be compiled as the study progresses.
A stratified random sample divides the population into distinct nonoverlapping subgroups (strata) according to some important characteristics (e.g., age, income) and then a random sample is selected within each subgroup. The investigator can use this method to ensure that each subgroup of interest is represented in the sample. This method generally produces more precise estimates of the characteristics of the target population, unless very small numbers of units are selected within individual strata.
Cluster sampling may be used if the study units form natural groups or if an adequate list of the entire population is difficult to compile. In a national survey, for example, clusters may comprise individuals in a localized geographic area. The clusters or regions are selected, preferably at random, and the persons are enumerated in each selected region and random samples are drawn from these units of the population. Because sampling is performed at multiple levels, this method is sometimes referred to as multistage sampling.
With nonprobability sampling methods, the probability of being included in the sample is unknown. Examples of this sampling method include convenience samples and volunteers. These types of samples are prone to bias and cannot be assumed to be representative of the target population. For example, people who volunteer are frequently different in many respects from those who do not. Tests of hypothesis and statistical inference concerning the sampled units and the target population can only be applied with probability sampling methods. That is, there is no way to assess the validity of the samples obtained using nonprobability sampling strategies.
VALIDITY AND SOURCES OF ERROR
The distribution of values in any sample, no matter how it is selected, will differ from the distribution in sample chosen by chance alone. The larger the sample, the more likely it is that the sample reflects the characteristic of interest in the target population. However, there are sources of error not related to sampling that may bias comparisons between the sampled units and the target population. First, coverage error (selection bias) may arise when the sampling frame does not fully cover the target population. Second, nonresponse bias may occur when sampled individuals cannot be reached or will not provide the information requested. Bias is present if respondents differ systematically from the individuals who do not respond. Finally, the measuring device may not be able to accurately determine the characteristics being measured.