ST 311 Topic 1- Basic Terminonlogy

.pdf
Topic 1: Course Information, Data Collection & Summary Statistics 1/65 - =y Topic 1: Course Information, Data Collection & Summary Statistics
Course Overview Course Goals Course Goals Differentiate between "good" and "bad" statistical methodology Understand how to use statistics to quantify uncertainty Learn to use statistical reasoning to draw inferences Focus: e less on formulas and computations e more on understanding statistical concepts O = = = Ay 2/65 Topic 1: Course Information, Data Collection & Summary Statistics
Background Data Collection - Big Idea appropriate inference. e How data are collected or generated is important if we want to make inference from it. In both experiments and surveys, randomization is necessary to ensure e Inference - the process of using sample information to make conclusions about some larger group of interest 3/65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
DETr-W@fe]|[Toitlely Background Definitions e Population: The entire group of items/individuals (units) that we want information about. e Sample: The smaller group, the part of the population we actually examine in order to gather information. The notation we will use to represent the sample size (how many people are in the sample) is n. e Census: Special case when every unit in the population is measured or surveyed. O = = = A 4/65 Topic 1: Course Information, Data Collection & Summary Statistics
DETr-W@fe]|[Toitlely Background More Definitions e Parameter: A (numerical) summary of a variable for the entire population. (typically unknown) e Statistic: A (numerical) summary of a variable for a sample. (used to estimate the parameter) o Sampling frame: The list of units from which a sample is selected. O = = = A 5/65 Topic 1: Course Information, Data Collection & Summary Statistics
Background The Basic Paradigm Population Sample Inference Parameters > ~ NCSTATEUNIVERSITY Statistics 6 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Example Average height of all undergrads at NCSU e What is the population? e What is the parameter? 7 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Example Average height of all undergrads at NCSU e What is the population? AIINCSU undergraduates e What is the parameter? 7 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Example Average height of all undergrads at NCSU e What is the population? AIINCSU undergraduates e What is the parameter? The average height of all NCSU undergraduates O = = = 7/65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Sampling: Motivation Sampling - Motivation Taking a census to measure the parameter directly is difficult e Often too expensive or time consuming Another option o Take a sample e Calculate the sample statistic o Use the sample statistic to estimate the population parameter 8/65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Sampling: Motivation Sampling - Methods Matter e How you take your sample is important e Bad sample = bogus results e Randomization is key! 9/65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
BEIE-R07e]I[=Ye]ile]a) Webberville ICA Example: Webberville e Webberville is a small Midwestern town. It is considering a new project to install solar panels on the roofs of all buildings to reduce energy costs over time. e The mayor wants to find the average number of solar panels for the buildings in the town. e To do this, we'll take a random sample of 10 buildings. O = = = A 10/65 Topic 1: Course Information, Data Collection & Summary Statistics
BEIE-R07e]I[=Ye]ile]a) Webberville ICA Webberville Paradigm Yy » L] ¢ = - - » -~ - - L ¥ @6 EB33 B_'_]" Baqz B | 1 11 11/65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
BEIE-R07e]I[=Ye]ile]a) Webberville ICA Webberville Paradigm All 115 buildings in Webberville 10 buildings selected 6 r e (& B 42 o B et ' 5 " ssslereel - L o B A o ------ ........... e L y 4 Average solar panels for all ¢ Average solar panels for 10 115 buildings in Webberville< buildings selected O = = = 12/65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Webbervile IGA Webberville Activity e Split into groups of 3 & work together e Each person has a copy of the worksheet e Each person should take two samples 1. Pick 10 houses that you think are "typical" 2. Use a random number generator e Follow the instructions e Calculate averages & add to plots e Work on parts 3-8, which we will discuss 13/65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Webbervile IGA Webberville Main Ideas e We sample because it is often difficult or impossible to conduct a census. e Personal judgments may make the sample biased. parameter. e As a result, the sample statistics may be consistently higher/lower than the population e Random Sampling uses a chance mechanism. This avoids bias. 14 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
BEIE-R07e]I[=Ye]ile]a) Webberville ICA Sample Assessment A college professor is interested in the average mathematics SAT score for all students at the university where he teaches. He surveys each of the students in the course Introduction to Engineering, and records their math score. From this sample, he calculates the average mathematics SAT score. @ [n this example the parameter of interest is a. The average mathematics SAT score for all students at the university. b. The average mathematics SAT score for the students he surveyed. c. The number of students he surveyed. d. The maximum mathematics SAT score you can achieve. O - = Ay Topic 1: Course Information, Data Collection & Summary Statistics
BEIE-R07e]I[=Ye]ile]a) Webberville ICA Sample Assessment A college professor is interested in the average mathematics SAT score for all students at the university where he teaches. He surveys each of the students in the course Introduction to Engineering, and records their math score. From this sample, he calculates the average mathematics SAT score. @ [n this example the parameter of interest is a. The average mathematics SAT score for all students at the university. b. The average mathematics SAT score for the students he surveyed. c. The number of students he surveyed. d. The maximum mathematics SAT score you can achieve. O - = Ay Topic 1: Course Information, Data Collection & Summary Statistics
BEIE-R07e]I[=Ye]ile]a) Webberville ICA Sample Assessment A college professor is interested in the average mathematics SAT score for all students at the university where he teaches. He surveys each of the students in the course Introduction to Engineering, and records their math score. From this sample, he calculates the average mathematics SAT score. ® Is this sample likely to give a biased estimate of the parameter? Explain. O = = = 17 /65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
BEIE-R07e]I[=Ye]ile]a) Webberville ICA Sample Assessment A college professor is interested in the average mathematics SAT score for all students at the university where he teaches. He surveys each of the students in the course Introduction to Engineering, and records their math score. From this sample, he calculates the average mathematics SAT score. ® Is this sample likely to give a biased estimate of the parameter? Explain. Yes, because the sample is not representative of the population. O = = = 17 /65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Sampiing Methods Bad Sampling Methods included in the sample. o Voluntary response sample: Only those people who volunteer to participate are e EXx: Online polls as the sample. e Convenience sample: The most convenient (readily available) group is considered e EX: Asking people walking by in the Brickyard 18 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
DET:-R 0] |[Te3ile]y Sampling Methods Bad Sampling Methods e These sampling methods are common because they are cheap & easy, but they are not good for much certainly not for inference. e These samples are not representative of the larger population of interest. e People who participate in volunteer samples tend to have stronger opinions than the general population. O 19/65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Sampiing Methods Good Sampling Methods e Simple Random Sample (SRS). Every different possible sample of the desired size has the same chance of being selected. e Note: requires a sampling frame that includes everyone in the population and excludes everyone outside of the population 20 / 65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Sampiing Methods Good Sampling Methods o Stratified random sample: The population is first divided into nonoverlapping groups (called strata) and a simple random sample is selected from each group Within a stratum, every person has the same chance of being selected. e Note: ALL groups are represented in the sample 21/65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
DET:-R 0] |[Te3ile]y Sampling Methods Good Sampling Methods e Cluster Sample: The population is first divided into nonoverlapping groups (called clusters), a simple random sample of clusters is selected, and all individuals in the selected clusters are included in the sample. Every cluster has the same chance of being selected. e Note: SOME groups are represented in the sample O = = 22 / 65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
DET:-R 0] |[Te3ile]y Sampling Methods Good Sampling Methods e Systematic sample: The population is a list divided into consecutive segments. One individual is randomly selected from the first segment and the same position is selected from each of the remaining segments. (Select every k" unit from a random starting point). O = = 23 /65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Example Average height of all undergrads at NCSU We take a sample of 100 students, and find the average height is 66 inches. e What is the sample? e What is the sample statistic? 24 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Example Average height of all undergrads at NCSU We take a sample of 100 students, and find the average height is 66 inches e What is the sample? The 100 NCSU undergraduates | actually selected e What is the sample statistic? 24 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Example Average height of all undergrads at NCSU We take a sample of 100 students, and find the average height is 66 inches e What is the sample? The 100 NCSU undergraduates | actually selected e What is the sample statistic? 66 Inches O = = = 24 /65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Ways to Select the Sample e Get a list of all students from the registrar. e Number them from 1 to 25000. e Randomly select (using a random number generator) 100 numbers 25 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Ways to Select the Sample Simple Random Sample (SRS): e Get a list of all students from the registrar. e Number them from 1 to 25000. e Randomly select (using a random number generator) 100 numbers 25 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Ways to Select the Sample o (Get a list of classes, and number each class. e Randomly select a few of the classes. e Survey everyone in each of the selected classes. 26 / 65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Ways to Select the Sample Cluster: o (Get a list of classes, and number each class. e Randomly select a few of the classes. e Survey everyone in each of the selected classes. 26 / 65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Ways to Select the Sample e Get a list of all students from the registrar junior, or senior). e Divide the students into some natural grouping, such as year (freshman, sophomore, e Number students in each year. ¢ Randomly select 25 from each, so that we have a total of 100 students. 27 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Ways to Select the Sample Stratified: e Get a list of all students from the registrar junior, or senior). e Divide the students into some natural grouping, such as year (freshman, sophomore, e Number students in each year. ¢ Randomly select 25 from each, so that we have a total of 100 students. 27 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Examples Which type of sample to take? Factors to consider e [IMe e Money/resources e \What information is available to you e What questions you ultimately want to investigate 28 / 65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
DETr-W@fe]|[Toitlely Examples Which Sampling Method to use? a) Researchers want to test a new math curriculum for third graders. They only have the budget to try the curriculum in 10 elementary schools. 29 / 65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
DETr-W@fe]|[Toitlely Examples Which Sampling Method to use? a) Researchers want to test a new math curriculum for third graders. They only have the budget to try the curriculum in 10 elementary schools. Cluster 29 / 65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
DETr-W@fe]|[Toitlely Examples Which Sampling Method to use? b) A baseball player''s agent wants to know the average salary for each position in Major League Baseball by taking a sample of current players. 30/ 65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
DETr-W@fe]|[Toitlely Examples Which Sampling Method to use? b) A baseball player''s agent wants to know the average salary for each position in Major League Baseball by taking a sample of current players. Stratified 30/ 65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Bias Bias One of the primary reasons for using a good sampling method to collect units for observation is that it avoids bias in the data, particularly something called selection bias in which the sample participants tend to systematically differ from the population of interest. O = = = 31/65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
BEIE-R07e]I[=Ye]ile]a) Bias Types of Survey Bias e Undercoverage: Tendency for a sample to differ from the corresponding population because the sampling frame excludes some parts of the population. e Nonresponse bias: Tendency for a sample to differ from the corresponding population because a subset of the sample cannot be contacted or does not respond. m ) 32 /65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Bias Types of Survey Bias e Response bias: Tendency for a sample to differ from the corresponding population because participants respond differently from how they truly feel. 33 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
BEIE-R07e]I[=Ye]ile]a) Bias Example HAGAR the Horrible THE Klflmécowum'»lé SURYEY— HOW PO Yo q.d!l : M' wnmuml Q' = £ 404 wy K0 g 7 santt Pramme, Fu W b0 G —— o e Which type of bias is pictured here? O = = = 34 /65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
BEIE-R07e]I[=Ye]ile]a) Bias Example HAGAR the Horrible THE Klflmécowum'»lé SURYEY— HOW PO Yo q.d!l : M' wnmuml Q' = £ 404 wy K0 g 7 santt Pramme, Fu W b0 G —— o e Which type of bias is pictured here? Response O = = = 34 /65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Bias Methods to Avoid Bias Undercoverage: e Choose a sampling frame that best represents the population Nonresponse: e Contact people multiple times/ways e Offer incentives for participation Response: e Take care in question wording and ordering. e Allow for anonymous responses. 35 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Mein Points Main Takeaway e One of the most important skills you will learn in this class is how to determine if it is appropriate to make inference to a larger population from a sample based on a description of a study. e The answer is that appropriate inference is only assured when a random sample is selected using one or more of the provided good sampling methods. O = = = 36 / 65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Data Collection Example Example - Do you support legalization of marijuana®? What proportion of NC State students support the legalization of marijuana? Random sample of 300 students 64% are In favor e What is the population? e What is the parameter? O = = 37 /65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Data Collection Example Example - Do you support legalization of marijuana®? What proportion of NC State students support the legalization of marijuana? Random sample of 300 students 64% are in favor e What is the population? All NC State students e What is the parameter? O = = 37 /65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Data Collection Example Example - Do you support legalization of marijuana®? What proportion of NC State students support the legalization of marijuana? Random sample of 300 students 64% are In favor e What is the population? All NC State students e What is the parameter? The proportion of all NC State students who support legalization O = = = A 37 /65 Topic 1: Course Information, Data Collection & Summary Statistics
Data Collection Example Example - Do you support legalization of marijuana®? e What is the sample? e What is the statistic? e What is the sampling frame? e What type of sample was taken? e Can you make inference for all college students? O = = = Ay 38 /65 Topic 1: Course Information, Data Collection & Summary Statistics
Data Collection Example Example - Do you support legalization of marijuana®? e What is the sample? The 300 students selected e What is the statistic? e What is the sampling frame? e What type of sample was taken? e Can you make inference for all college students? O = = = Ay 38 /65 Topic 1: Course Information, Data Collection & Summary Statistics
Data Collection Example Example - Do you support legalization of marijuana®? e What is the sample? The 300 students selected e What is the statistic? 64% e What is the sampling frame? e What type of sample was taken? e Can you make inference for all college students? O = = = Ay 38 /65 Topic 1: Course Information, Data Collection & Summary Statistics
Data Collection Example Example - Do you support legalization of marijuana®? e What is the sample? The 300 students selected e What is the statistic? 64% e What is the sampling frame? Student list from the Registrar e What type of sample was taken? e Can you make inference for all college students? O = = = Ay 38 /65 Topic 1: Course Information, Data Collection & Summary Statistics
Data Collection Example Example - Do you support legalization of marijuana®? e What is the sample? The 300 students selected e What is the statistic? 64% e What is the sampling frame? Student list from the Registrar e What type of sample was taken? SRS e Can you make inference for all college students? O = = = Ay 38 /65 Topic 1: Course Information, Data Collection & Summary Statistics
Data Collection Example Example - Do you support legalization of marijuana®? e What is the sample? The 300 students selected e What is the statistic? 64% e What is the sampling frame? Student list from the Registrar e What type of sample was taken? SRS e Can you make inference for all college students? No. The sample is only representative of NC State students. O = = = A Topic 1: Course Information, Data Collection & Summary Statistics
Mein Points Main Takeaway If the sample is not random, proceed with extreme caution! e You may not be able to make any conclusions about the original population of interest. e Instead, you have to think about what population you could make conclusions about based on the sample you took about what population the sample is representative of. O = = = 39/65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Data Summaries e Now that we know how to appropriately collect data in a way that represents our population well, how do we summarize it? e Graphical and numerical summaries are important for telling the story of a dataset o Different types of summaries are appropriate for different types of data. 40 / 65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Types of Variables e A categorical variable places a unit into one of several groups or categories. e Examples include: e Major e Car color o | etter grades 41 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Types of Variables o A quantitative variable takes numeric values for which arithmetic operations such as adding and averaging make sense e Examples include: e Height e Age (in years) e Exam score (points) 42 /65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Taligele[Wle3ilo]g Appropriate Graphical Summaries e Graphs for summarizing categorical data display the number or proportion of items in each category. Good examples are. .. Pie charts & bar charts NBA Player positions (2005) O position M C, 66, 15.94% MF, 213, 51.45% M G, 135, 32.61% Freque ncy O P = = Ay 43 / 65 Topic 1: Course Information, Data Collection & Summary Statistics
Taligele[Wle3ilo]g Appropriate Graphical Summaries e Graphs for summarizing quantitative data display the distribution of the observed values. Good examples are. .. Histograms, dot plots, & box plots NBA Player heights in inches (2005) Frequency Variable height 75 I l l ! ) § I ' : l ' i .3 65 1 70 75 80 85 height 90 O = = = Ay 44 / 65 Topic 1: Course Information, Data Collection & Summary Statistics
Distributions o Distribution of a quantitative variable: the overall pattern of how often the different possible values occur. A way to describe the data as a whole. e We usually draw a distribution as a curve over a number line, where the height of the curve represents the frequency (or number) of observations at a specific value. m ) 45 / 65 Ay Topic 1: Course Information, Data Collection & Summary Statistics
Shape BIEiglol0]ile]glS] Key Characteristics e Symmetry: approximately symmetric, right or left skewed Left-Skewed Right-Skewed o |eft/Right Skewness refers to where the tail of the distribution is located Symmetric 46 / 65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Koy Characterisis Shape e Modality: uni-, bi-, or multi-modal. Uni-modal Bi-modal Multi-modal o - § - §. 2 4 §_ o g_ ] a8 - - o _ N .. g 3 - g - g - e Refers to the number of "humps" the distribution has 0O =y = = Ay w 47 / 65 Topic 1: Course Information, Data Collection & Summary Statistics
Cle-Tolalle=IRS W InalagF1al=ES Histograms Frequency . . . 1401 e Displays the number of people/items in 1201 a particular range of the data. 1001 o Can draw a smooth curve 301 representation of the histogram to o represent the distribution instead! 401 20.. O = E = Ay 48 / 65 Topic 1: Course Information, Data Collection & Summary Statistics
Cle-Tolalle=IRS W InalagF1al=ES Histograms Reading a histogram: Frequency 10T e Approximately _ people have a shoe size 1201 greater than or equal to a size 8 but less 100! than a size 12. e Approximately people are represented in this histogram. 801 601 401 207 O P = = Ay 49 / 65 Topic 1: Course Information, Data Collection & Summary Statistics
Histograms Reading a histogram: Frequency 1401 o Approximately 240 people have a shoe size 1201 greater than or equal to a size 8 but less 100! than a size 12. e Approximately people are represented in this histogram. 801 601 401 207 O P = = Ay 49 / 65 Topic 1: Course Information, Data Collection & Summary Statistics
Histograms Reading a histogram: Frequency 1401 o Approximately 240 people have a shoe size 1201 greater than or equal to a size 8 but less 100! than a size 12. e Approximately 385 people are represented in this histogram. 801 601 401 207 O P = = Ay 49 / 65 Topic 1: Course Information, Data Collection & Summary Statistics
Dotplot e One dot for each value in the data e Better for smaller data sets g 10 shoe oog'o'lil"o'o; $ x L L 12 14 e Exactly _ people have a shoe size of 8.5. e Exactly people have a shoe size of 12 or larger. 50/ 65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Dotplot e One dot for each value in the data e Better for smaller data sets g 10 shoe oog'o'lil"o'o; $ x L L 12 14 e Exactly 4 people have a shoe size of 8.5. e Exactly people have a shoe size of 12 or larger. 50/ 65 O = Ay Topic 1: Course Information, Data Collection & Summary Statistics
Dotplot e One