Chapter 13

Chapter 13: Big Data Basics: Describing Samples and Populations (p. 357-383) Introduction o Loss-leader: losing money on a particular item to hopefully make it up on additional purchases Descriptive Statistics and Basic Inferences o Metrics: a summary number that allows analysts to compare characteristics of a sample with some population benchmark, characteristics of another sample, or some other critical value o Inferential statistics: a summary representation of data from a sample that allows us to understand (i.e. infer from sample to population) an entire population o Two applications of statistics exist: To describe characteristics of the population sample (descriptive statistics) To generalize from a sample to a population (inferential statistics) o Sample statistics: summary measures about variables computed using only data taken from a sample o Population parameters: summary characteristics of information describing the properties of a population. o Frequency distribution: a table or chart summarizing the number of times a particular value of a variable occurs o Percentage distribution: a frequency distribution organized into a table (or graph) that summarizes percentage values associated with particular values of a variable o Probability: the long-run relative frequency with which an event will occur o Proportion: the percentage of elements that meet some criterion for membership in a category o Top-box score: the proportion of respondents who choose the most positive choice in a multiple-choice question usually dealing with customer opinion o Bottom-box score: the proportion of respondents who choose the least favorable response to some question about customer opinion o Mean: a basic statistic that quantifies central tendency computed as the arithmetic average Although widely relied upon, the mean can be misleading particularly when extreme values or outliers are present o Median: a measure of central tendency that is the midpoint; the value below which half the values in the distribution fall. o Mode: a measure of central tendency; the value that occurs most often o The simplest representation of dispersion is range, or the distance between the smallest and the largest values of a frequency distribution o Individual deviation scores: a method of calculating how far any observation is from the mean o Standard deviation: the most popular indicator of spread or dispersion o Variance: a metric of variability or dispersion. Its square root is the standard deviation o Standard deviation: a quantitative index of a distribution's spread, or variability; the square root of the variance for a distribution Distinguish Between Population, Sample and Sample Distribution o Normal distribution: a symmetrical, mean-centered, bell-shaped distribution that describes the expected probability distribution of observations
o Standardized normal distribution: a purely theoretical probability distribution that reflects a specific normal curve for the standardized value, Z. The most useful distribution in inferential statistics o Population distribution: a frequency distribution of the elements of a population o Sample distribution: a frequency distribution of a sample o Sampling distribution: a theoretical probability distribution of sample means for all possible samples of a certain size drawn from a particular population o Standard error of the mean: the standard deviation of the sampling distribution Central-Limit Theorem o Central-limit theorem: the theory that, as the sample size increases, the distribution of sample means of size n, randomly selected, approaches a normal distribution o The distribution of averages quickly approaches normal as sample size increases o The theoretical knowledge about sampling distributions helps us solve two basic and very practical marketing analytics problems: Estimating population parameters Determining sample size Estimation of Parameters and Confidence Intervals o Point estimate: an estimate of the population by mean in the form of a single value, usually the sample mean o Confidence interval estimate: a specified range of numbers within which a population mean is expected to lie; an estimate of the population mean based on the knowledge that it will be equal to the sample mean plus or minus a small sampling error o Confidence level: a percentage or decimal value that tells how confident a researcher can be about being correct; it states the long-run percentage of confidence intervals that will include the true population mean Sample Size o The larger the sample the more accurate the research o Random sampling error varies with sample size -- increasing the sample size decreases the width of confidence interval at a given confidence level o Three factors are required to specify sample size: The variance, or heterogeneity, of the population The magnitude of acceptable error The confidence level o As heterogeneity increases so must sample size o Magnitude of error: confidence level o Sequential sampling: the application of results from one or more pilot studies prior to deciding on the sample size for a definitive study o A general rule of thumb for estimating the value of the standard deviation is to expect it to be one sixth of the range o In most cases, the size of the population does not have a major effect on the sample size The variance of the population has a greater effect on sample size requirements than does the population size o Sample size for a proportion requires the researcher to make a judgement about confidence level and the maximum allowance for random sampling error Assess the Potential for Nonresponse Bias o Researchers provide an assessment of generalizability in their reports
o Nonresponse bias, in particular the bias caused when sample units provide no response, can significantly damage generalizability o Non-responders must be considered routinely as a threat to external validity because there could be a systematic reason that members selected for inclusion from a sampling frame did not respond o Auxiliary variables: those that the researcher should build into a survey that allow a comparison between sample units that do not respond and those that do respond
Page1of 3
Uploaded by LieutenantDove2617 on