# Lab 10 anly 500

.docx
ANOVA Shiv Dawar 2023-06-06 Abstract : How do university training and subsequent practical experience affect expertise in data science? To answer this question we developed methods to assess data science knowledge and the competence to formulate answers, construct code to problem solve, and create reports of outcomes. In the cross-sectional study, undergraduate students, trainees in a certified postgraduate data science curriculum, and data scientists with more than 10 years of experience were tested (100 in total: 20 each of novice, intermediate, and advanced university students, postgraduate trainees, and experienced data scientists). We discuss the results against the background of expertise research and the training of data scientist. Important factors for the continuing professional development of data scientists are proposed. Dataset: APA write ups should include means, standard deviation/error (or a figure), t-values, p- values, effect size, and a brief description of what happened in plain English. - Participant type: novice students, intermediate students, advanced university students, postgraduate trainees, and experienced data scientists - Competence: an average score of data science knowledge and competence based on a knowledge test and short case studies. library (ez) library (MOTE) ## Warning: package 'MOTE' was built under R version 3.5.3 library (ggplot2 ) library (pwr) ##import the data master = read.csv ( "10_data.csv" ) ##reorder the factors master $participant_type = factor (master$ participant_type, levels = c ( "novice" , "intermediate" , "advanced" , "postgraduate" , "experienced" ))
Data screening: Assume the data is accurate and that there is no missing data. Outliers Assumptions Normality: a. Examine the dataset for outliers using z-scores with a criterion of 3.00 as p < .001. b. Why do we have to use z-scores? BECAUSE WE ONLY HAVE ONE DV TO WORK WITH MAHAL RUNS WITH TWO COLUMNS c. How many outliers did you have? NONE d. Exclude all outliers. ##outliers zscore = scale (master $competence) summary ( abs (zscore) < 3 ) ## V1 ## Mode:logical ## TRUE:100 noout = subset (master, abs (zscore) < 3 ) a. Include a picture that you would use to assess multivariate normality. b. Do you think you've met the assumption for normality? MOSTLY OK, BIT POSITIVE SKEWED ##assumptions set up random = rchisq ( nrow (noout), 7 ) fake = lm (random ~ ., data = noout) fitted = scale (fake$ fitted.values) standardized = rstudent (fake) ##normality
Linearity: a. Include a picture that you would use to assess linearity. b. Do you think you've met the assumption for linearity? YEAH, LOOKS FINE DOTS ON THE LINE ##linearity { qqnorm (standardized) abline ( 0 , 1 )}
Homogeneity/Homoscedasticity: a. Include a picture that you would use to assess homogeneity and homoscedasticity. b. Include the output from Levene's test. c. Do you think you've met the assumption for homogeneity? (Talk about both components here). THE DOT SCATTERPLOT LOOKS OK, BUT LEVENE'S TEST INDICATES WE ARE IN TROUBLE! IT IS SIGNIFICANTLY BAD P < .001 d. Do you think you've met the assumption for homoscedasticity? YES, APPEARS OK ##homog and s { plot (fitted, standardized) abline ( 0 , 0 ) abline ( v = 0 )}
##anova noout $partno = 1 : nrow (noout) options ( scipen = 999 ) ezANOVA ( data = noout, dv = competence, between = participant_type, wid = partno, type = 3 , detailed = T) ## Warning: Converting "partno" to factor for ANOVA. ## Coefficient covariances computed by hccm() ##$ANOVA ## Effect DFn DFd SSn SSd F ## 1 (Intercept) 1 95 203761.6 3397.123 5698.1596 ## 2 participant_type 4 95 24073.4 3397.123 168.3022 ## p ## 1 0.00000000000000000000000000000000000000000000000000000000000000000000000 0000 0000000000131473 ## 2 0.00000000000000000000000000000000000000000324548810971113186468750555491 1695 0714960694313049
## p<.05 ges
Hypothesis Testing: Run the ANOVA test. Calculate the following effect sizes: Given the ? 2 effect size, how many participants would you have needed to find a significant effect? TEN, 5 * 2 = 10 If you get an error: "Error in uniroot(function(n) eval(p.body) - power, c(2 + 0.0000000001, : f() values at end points not of opposite sign": ## 1 ## 2 ## * 0.9836013 * 0.8763357 ## $Levene's Test for Homogeneity of Variance ##DFn DFdSSnSSdF p p<.05 ## 14 95 365.6359 1216.221 7.140029 0.00004529845* a. Include the output from the ANOVA test. SO HERE THEY SHOULD DO THE ONEWAY TEST BECAUSE LEVENE'S WAS SIGNIFICANT b. Was the omnibus ANOVA test significant? YES P < .05 oneway.test (competence ~ participant_type, data = noout) ## ## One-way analysis of means (not assuming equal variances) ## ## data: competence and participant_type ## F = 507.48, num df = 4.000, denom df = 44.375, p- value < ## 0.00000000000000022 a.$\eta^2$= .8763357 PULLED FROM GES b.$\omega^2$= .8321344 eta = . 8763357 ##fill in the number here use for power below eta ## [1] 0.8763357 ##formula = (SSM â€" dfM*MSR)/ (SStotal + MSR) MSR = 3397.123 / 95 w2 = ( 24073 - 4 * MSR) / ( 24073 + 3397 + MSR) w2 ## [1] 0.8699983 - This message implies that the sample size is so large that the estimation of sample size has bottomed out. You should assume sample size required n = 2 *per group*. Mathematically, ANOVA has to have two people per group - although, that's a bad idea for sample size planning due to ##equal variances pairwise.t.test (noout$ competence, noout $participant_type, p.adjust.method = "none", paired = F, var.equal = T) ## ## Pairwise comparisons using t tests with pooled SD ## ## data: ## ## noout$competence and noout$participant_type novice intermediate ## intermediate < 0.0000000000000002 - advanced - ## advanced< 0.0000000000000002 < 0.0000000000000002 - ## postgraduate < 0.0000000000000002 < 0.0000000000000002 0.1776 ## experienced ## < 0.0000000000000002 0.0044 postgraduate 0.000000000016201 ## intermediate - ## advanced- ## postgraduate - ## experienced 0.000000000000022 ## ## P value adjustment method: none pairwise.t.test (noout$ competence, noout $participant_type, p.adjust.method = "bonferroni", paired = F, var.equal = T) ## ## Pairwise comparisons using t tests with pooled SD ## ## data: ## ## noout$competence and noout$participant_type novice intermediate ## intermediate < 0.0000000000000002 - advanced - ## advanced< 0.0000000000000002 < 0.0000000000000002 - Run a post hoc independent t-test with no correction and a Bonferroni correction. Remember, for a real analysis, you would only run one type of post hoc. This question should show you how each post hoc corrects for type 1 error by changing the p-values. parametric tests. - Leave in your code, but comment it out so the document will knit. feta = sqrt (eta / ( 1 - eta)) #pwr.anova.test(k = 3, n = NULL, f = feta, # sig.level = .05, power = .80) ## postgraduate < 0.0000000000000002 < 0.0000000000000002 1.000 ## experienced < 0.0000000000000002 0.044 0.00000000016201 ## postgraduate ## intermediate - ## advanced - ## postgraduate - ## experienced 0.00000000000022 ## ## P value adjustment method: bonferroni ##unequal variances post1 = pairwise.t.test (noout$ competence, noout $participant_type, p.adjust.method = "none" , paired = F, var.equal = F) post2 = pairwise.t.test (noout$ competence, noout $participant_type, p.adjust.method = "bonferroni" , paired = F, var.equal = F) post1 # # # # # # # # # # # # Pairwise comparisons using t tests with pooled SD data: noout$competence and noout$participant_type novice intermediate advanced ## intermediate < 0.0000000000000002 - - ## advanced < 0.0000000000000002 < 0.0000000000000002 - ## postgraduate < 0.0000000000000002 < 0.0000000000000002 0.1776 ## # # # # # experienced < 0.0000000000000002 0.0044 postgraduate intermediate - advanced - postgraduate 0.000000000016201 # # # # # # # # # - experienced 0.000000000000022 P value adjustment method: none post2 ## ## Pairwise comparisons using t tests with pooled SD ## ## data: noout$competence and noout$participant_type ## ## novice intermediate advanced ## intermediate < 0.0000000000000002 - - ## advanced < 0.0000000000000002 < 0.0000000000000002 - ## postgraduate < 0.0000000000000002 < 0.0000000000000002 1.000 ## experienced < 0.0000000000000002 0.044 0.00000000016201 ## postgraduate ## intermediate - ## advanced - ## postgraduate - ## experienced 0.00000000000022 ## ## P value adjustment method: bonferroni Include the effect sizes for only Advanced Students vs Post Graduate Trainees and Intermediate students versus Experienced Data Scientists. You are only doing a couple of these to save time. M = with (noout, tapply (competence, participant_type, mean)) stdev = with (noout, tapply (competence, participant_type, sd)) N = with (noout, tapply (competence, participant_type, length)) ##advanced versus post is 3 and 4 effect1 = d.ind.t ( m1 = M[ 3 ], m2 = M[ 4 ], sd1 = stdev[ 3 ], sd2 = stdev[ 4 ], n1 = N[ 3 ], n2 = N[ 4 ], a = . 05 ) effect1$ d ## advanced ## -0.5753897 ##immediate versus experienced is 2 and 5 effect2 = d.ind.t ( m1 = M[ 2 ], m2 = M[ 5 ],
Create a table of the post hoc and effect size values: tableprint = matrix ( NA , nrow = 3 , ncol = 3 ) ##row 1 ##fill in where it says NA with the values for the right comparison ##column 2 = Advanced Students vs Post Graduate Trainees ##column 3 = Intermediate students versus Experienced Data Scientists. tableprint[ 1 , ] = c ( "No correction p" , post1 $p.value[ 11 ], post1$ p.value[ 8 ]) ##row 2 tableprint[ 2 , ] = c ( "Bonferroni p" , post2 $p.value[ 11 ], Type of Post Hoc Advanced Students vs Post Graduate Trainees Intermediate students versus Experienced Data Scientists No correction p 0.177621639852061 0.00442781018442886 Bonferroni p 1 0.0442781018442886 d value -0.575389669763759 -0.696064088502951 Run a trend analysis. a. Is there a significant trend? YES b. Which type? CUBIC ##trend analysis k = 5 noout$ part = noout $participant_type contrasts (noout$ part) = contr.poly (k) output2 = aov (competence ~ part, data = noout) summary.lm (output2) # # # # Call: ## ## ## aov(formula = competence Residuals: ~ part, data = noout) ## # # # # Min 1Q Median -12.486 -3.769 -0.201 3Q 3.018 Max 20.62 8 ## ## Coefficients: Estimat e Std . Erro r t value Pr(>|t|) ## (Intercept) 45.140 0.598 75.486 < 0.000000000000000 2 *** ## part.L 23.546 1.337 17.609 < 0.000000000000000 2 *** ## part.Q - 24.687 1.33 7 - 18.463 < 0.000000000000000 2 *** ## part.C -6.055 1.33 -4.528 0.0000172 *** ##row 3 tableprint[ 3 , ] = c ( "d value" , effect1 $d, effect2$ d) #don't change this kable (tableprint, digits = 3 , col.names = c ( "Type of Post Hoc" , "Advanced Students vs Post Graduate Trainees" , "Intermediate students versus Experienced Data Scientists" ))
7 ## ## part^4 1.765 --- 1.337 1.320 0.19 ## ## Signif . codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Make a bar chart of the results from this study: a. X axis labels and group labels b. Y axis label c. Y axis length â€" the scale runs 0-100. You can add coord_cartesian(ylim = c(0,100)) to control y axis length to your graph. d. Error bars e. Ordering of groups: Use the factor command to put groups into the appropriate order. You use the factor command to reorder the levels by only using the levels command and putting them in the order you want. Remember, the levels have to be spelled correctly or it will delete them. ##bar chart library (ggplot2) cleanup = theme ( panel.grid.major = element_blank (), panel.grid.minor = element_blank (), panel.background = element_blank (), axis.line.x = element_line ( color = "black" ), axis.line.y = element_line ( color = "black" ), legend.key = element_rect ( fill = "white" ), text = element_text ( size = 15 )) bargraph = ggplot (noout, aes (participant_type, competence)) bargraph + cleanup + stat_summary ( fun.y = mean, geom = "bar" , fill = "white" , color = "black" ) + stat_summary ( fun.data = mean_cl_normal, geom = "errorbar" , position = "dodge" , width = . 2 ) + xlab ( "Level of Study of Participant" ) + ylab ( "Average Competence Rating" ) + coord_cartesian ( ylim = c ( 0 , 100 )) + scale_x_discrete ( labels = c ( "Novice" , "Intermediate" , "Advanced" , "Post- Graduate" , "Experienced" )) ## Residual standard error: 5.98 on 95 degrees of freedom ## Multiple R-squared: 0.8763, Adjusted R- squared: 0.8711 ## F-statistic: 168.3 on 4 and 95 DF, p-value: < 0.00000000000000022
Write up a results section outlining the results from this study. Use two decimal places for statistics (except when relevant for p-values). Be sure to include the following: a. A reference to the figure you created (the bar chart)- this reference allows you to not have to list every single mean and standard deviation. b. Very brief description of study and variables. c. The omnibus test value and if it was significant. d. The two post hoc comparisons listed above describing what happened in the study and their relevant statistics. You would only list the post hoc correction values. e. Effect sizes for all statistics. Five different types of data scientists (novice, intermediate, and advanced university students, postgraduate trainees, and experienced) were examined for the competence levels on data science knowledge. A between-subjects, one-way ANOVA was used to examine the differences in competence for these five types of data scientist, and significant differences in competence were found, F(4, 95) = 168.30, p < .001, eta^2 = .88 (using a Welch correction for homogeneity problems, F(4, 44.38) = 507.48, p < .001, eta^2 = .88). Using an independent t-test with a Bonferroni correction, advanced students were not significantly different than post graduate students, p = 1.00, ds = 0.58. However, intermediate students were significantly lower than experienced data scientists in their competence levels, p = .04, ds = 0.70. Means and confidence intervals are included in Figure 1 (the bar chart).
Uploaded by sdawar16 on coursehero.com