Lab 7

.Rmd
School
University of California, Berkeley **We aren't endorsed by this school
Course
STAT 154
Subject
Statistics
Date
Aug 28, 2023
Pages
6
Uploaded by ChiefCrownGrouse38 on coursehero.com
This is a preview
Want to read all 6 pages? Go Premium today.
Already Premium? Sign in here
--- title: An R Markdown document converted from "/Users/saptarshichakraborty/Desktop/STAT 154:254 Labs/Lab 7.ipynb" output: html_document --- # Lab 7: PCA and its Variants ```{r} library(caret) library(tidyr) library(dplyr) library(ggplot2) library(gridExtra) ``` ```{r} show_digit = function(arr784, col = gray(12:1 / 12), ...) { image(matrix(as.matrix(arr784[-785]), nrow = 28)[, 28:1], col = col, labels=FALSE, xaxt = "n", yaxt = "n", ...) } # load image files load_image_file = function(filename) { ret = list() f = file(filename, 'rb') readBin(f, 'integer', n = 1, size = 4, endian = 'big') n = readBin(f, 'integer', n = 1, size = 4, endian = 'big') nrow = readBin(f, 'integer', n = 1, size = 4, endian = 'big') ncol = readBin(f, 'integer', n = 1, size = 4, endian = 'big') x = readBin(f, 'integer', n = n * nrow * ncol, size = 1, signed = FALSE) close(f) data.frame(matrix(x, ncol = nrow * ncol, byrow = TRUE)) } # load label files load_label_file = function(filename) { f = file(filename, 'rb') readBin(f, 'integer', n = 1, size = 4, endian = 'big') n = readBin(f, 'integer', n = 1, size = 4, endian = 'big') y = readBin(f, 'integer', n = n, size = 1, signed = FALSE) close(f) y } ``` ```{r} # load data dat = load_image_file("train-images-idx3-ubyte") # load labels dat_labels = as.factor(load_label_file("train-labels-idx1-ubyte")) ``` ```{r} head(dat) ```
```{r} # view test image show_digit(dat[20000, ]) title('Example Image') ``` ```{r} # We plot 100 cherry-picked images from the training set num = 10 par(mfrow=c(num, num), mar=c(0, 0.2, 1, 0.2)) for (i in 1:(num*num)) { show_digit(dat[i, ]) } ``` Here we will use PCA to reduce the number of features while retaining as much of the variance possible. PCA does this by finding a new set of axes that fit to the variance of the data. At heart, PCA is simply an eigendecomposition of the data which returns a set of eigenvectors and eigenvalues. Eigenvectors and eigenvalues describe the transformations necessary to go from the original axes to a new feature space. ```{r} # use the prcomp function dat.pca <- prcomp( dat, center = TRUE, scale = TRUE ) ``` ```{r} # pca summary sumdat <- summary( dat.pca ) #calculate total variance explained by each principal component df_pca <- data.frame( t( sumdat$importance ) ) df_pca$compnum <- 1:dim(dat)[2] ``` How many components account for 95% of the variance in the data? ```{r} comp95 <- min(which(df_pca$Cumulative.Proportion>=0.95)) comp95 ``` ### Visualizing the cumulative explained variance described by the principal components with a Skree Plot: ```{r} p1 <- ggplot( df_pca, aes( x = compnum, y = Proportion.of.Variance ) ) + geom_line() + ylim( c(0,0.3) ) + # xlim( c(0,20)) + geom_hline( yintercept = 0.01, linetype = 'dotted', col = 'red') + annotate("text", x = 2, y = 0.01, label = expression( "1%" ~ sigma), vjust = -0.5) + theme_minimal() + xlab( 'Principal Component Number' ) + ylab( 'Proportion Explained Variance' ) + ggtitle( 'Skree plot' ) p2 <- ggplot( df_pca, aes( x = compnum, y = Cumulative.Proportion ) ) +
geom_line() + ylim( c(0,1.1) ) + #xlim( c(0,500)) + geom_hline( yintercept = 0.95, linetype = 'dotted', col = 'red') + annotate("text", x = 20, y = 0.98, label = expression( "95%" ~ sigma), vjust = -0.5) + theme_minimal() + xlab( 'Principal Component Number' ) + ylab( 'Cumulative Explained Variance' ) + ggtitle( 'Cumulative Explained Variance' ) grid.arrange( p1, p2, ncol=2) ``` ## Visualizing several PCA components ```{r} par(mfrow = c(2,3)) show_digit(sumdat$rotation[,1]) title(bquote('PC1: ' ~ .(round( df_pca$Proportion.of.Variance[1],3)*100) ~ '% explained variance')) show_digit(sumdat$rotation[,2]) title(bquote('PC2: ' ~ .(round( df_pca$Proportion.of.Variance[2],3)*100) ~ '% explained variance')) show_digit(sumdat$rotation[,3]) title(bquote('PC3: ' ~ .(round( df_pca$Proportion.of.Variance[3],4)*100) ~ '% explained variance')) show_digit(sumdat$rotation[,4]) title(bquote('PC4: ' ~ .(round( df_pca$Proportion.of.Variance[4],4)*100) ~ '% explained variance')) show_digit(sumdat$rotation[,392]) title(bquote('PC392: ' ~ .(round( df_pca$Proportion.of.Variance[392],6)*100) ~ '% explained variance')) show_digit(sumdat$rotation[,784]) title(bquote('PC784: ~' ~ .(df_pca$Proportion.of.Variance[784]) ~ '% explained variance')) ``` ```{r} label = c('T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot') dat_pca_x <- data.frame(dat.pca$x) %>% select( c( 'PC1', 'PC2' ) ) %>% mutate( labels = dat_labels, labels_text = case_when( labels == 0 ~ 'T-shirt/top', labels == 1 ~ 'Trouser', labels == 2 ~ 'Pullover', labels == 3 ~ 'Dress', labels == 4 ~ 'Coat', labels == 5 ~ 'Sandal', labels == 6 ~ 'Shirt', labels == 7 ~ 'Sneaker',
Why is this page out of focus?
Because this is a Premium document. Subscribe to unlock this document and more.
Page1of 6
Uploaded by ChiefCrownGrouse38 on coursehero.com