Week2

.pdf
Week 2 Ning Xu Aug/16/2021 Week 2 material is the review of Week 1 content. Last week, you've learned the following terms • population and sample • variable categorization - categorical vs quantitative - explanatory vs response • sampling bias 1 Topic : population and sample You should know • what is population and sample, • why you need sample and population in statistics Statistics: • Statistics : to learn some knowledge for population or summarize patterns for population. • For example, you want to know the average body height of US citizens, - you can use the knowledge or the pattern to predict the height of any US citizen - you can analyse the relation between body height of US citizens and other factors (smoking, french fries) • In this semester, you will focus on these two tasks. Sample and Population • To achieve two tasks above, you need data • You want to know the average body height of US citizens (the pattern or the knowledge you want to know in Statistics). • To find the true/correct/accurate average height, one idea is that you need to interview each and every person in US (population) . Population : where you find you data and where your data is generated • But the issue, this is impossible. - it may take too much time or money - in many applications, accessing the full population is impossible (e.g., you want summarize the knowledge about the entire human war history) • We have to go another way: you collect a subset of population (sample , in this case 100 US citizens), compute their average height (sample average) and use this value approximately as "the average height of everyone in US". 1
How statistics works (statistical inference): • collect a sample; • learn the sample knowledge/pattern; • use this sample knowledge/pattern approximately as the population knowledge/pattern; In the body height example, • sample knowledge/pattern is the average height in your sample (sample average); • the population knowledge/pattern is the average height of each and every citizen in US. 2 Topic : bias Bias is the mistakes you have when you do statistics. • How statistics commit bias: - when collect a sample; (sampling bias) * the US body height example : to learn the average height of each and every citizen in US (population pattern) , you need to collect data from each state. * however, if you only collect data from New York only, you are commiting the sampling bias (you collect your data in a wrong way) . * you aim to summarize the knowledge for US; but what you actually get is only for NY * sampling bias is one of the most important concept in this unit. - when learning the sample knowledge/pattern; (estimation bias, second-semester stat course , EMCT1020) - when using this sample knowledge/pattern approximately as the population knowledge/pattern ( transfer bias, 3rd-yr stat course ); * you collect a sample of human, which tells you the average number of hands is 2 * you use this sample pattern to approximate the population pattern for birds : based on your sample, I believe that birds on average has two hands. * you use a sample pattern to approximate the wrong population pattern. • In the body height example, - sample knowledge/pattern is sample mean (6.2 feet); - population knowledge/pattern is population mean is 6 feet. - if you collect your sample correctly and your sample is large enough , the sample mean is almost the same as population mean. - in this case, you don't have to worry about the sampling bias. • sampling bias may be due to the following reasons. - you collect the wrong sample (I want to know the average height of US citizen; instead, I collect my data from Hawaii) - your sample size is too small. (I want to know the average height of US citizen; I only collect 1 observations from all US citizens.) • This week we focus on the first one (you collect the wrong sample); a few weeks later we focus on the second one (your sample size is too small). • We will assume that all the samples in our tut questions are large enough (say, 1 billion observa- tions) 2
3 Topic : variable categorization Variable is how you describe the information in your data. • using the US body height example : - your data is 1 billion observations of US citizens - your data contains all the information you need for US demographics - now, to extract a certain type of information (the body height), you need to summarize your information into a variable. There are different ways to extract the same informations • Example : data on my dining - one way to describe the information about my dining : whether you have your breakfast (Yes or No question). - another way : how many dishes you have during your breakfast (numeric question). • Categorical/discrete : whether you have your breakfast. - feature 1 : finite number of possible outcomes : Yes or No * if you use 1 for Yes and 0 for No, you are going to have a 0-1 variable (aka binary) - feature 2 (the unique one) : there is no relation/order among different possible values/outcomes of this variable. * "Yes and No" or "0 and 1" just represent two status * there is no order or relation between these two status. By order or relation I mean · Yes is better than No · 1 > 0 • Quantitative : how many dishes you have during your breakfast. - you sometimes find a quantitative variable with infinite number of possible outcomes : any real number in between [ 0, + ª§¦ª ) - however, some quantitative variables only have finite number of possible outcomes. - the unique feature of quantitative variables is : there is an order among different possible values/outcomes of a quantitative variable. - in this content, 1 > 0 since 1 means 1 dish, and 1 dish is more than 0 dishes. • Example : during breakfast, I only offer you one dish. You can choose take it or not to take it. - variable "whether you have your breakfast" : 1 (Yes) or 0 (No) * categorical: 1 and 0 just represent two status * no relation or order among them * you can use 0 to represent Yes and 1 to represent No as well, which does not change anything - variable "how many dishes you have during your breakfast" : 1 or 0. * numerical ever though it only has two outcomes * 0 represents no dish and 1 represent two dishes. There is a order between 0 and 1 (the number of dishes) 3
Page1of 5
Uploaded by HighnessNarwhal1070 on coursehero.com