class: center, middle, inverse, title-slide .title[ # Principal Component Analysis ] .subtitle[ ## EDP 619 Week 10 ] .author[ ### Dr. Abhik Roy ] --- <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Register.StartupHook("TeX Jax Ready",function () { MathJax.Hub.Insert(MathJax.InputJax.TeX.Definitions.macros,{ cancel: ["Extension","cancel"], bcancel: ["Extension","cancel"], xcancel: ["Extension","cancel"], cancelto: ["Extension","cancel"] }); }); </script> <style> section { display: flex; display: -webkit-flex; } section { height: 600px; width: 60%; margin: auto; border-radius: 21px; background-color: #212121; } .remark-slide-container { background: #212121; } .hljs-github .hljs { background: transparent; color: #b2dfdb; } .hljs-github .hljs-keyword { color: #64b5f6; } .hljs-github .hljs-literal { color: #64b5f6; } .hljs-github .hljs-number { color: #64b5f6; } .hljs-github .hljs-string { color: #b7b3ef; } .hljs-github .hljs { background: transparent; color: #b2dfdb; } .hljs-github .hljs-keyword { color: #64b5f6; } .hljs-github .hljs-literal { color: #64b5f6; } .hljs-github .hljs-number { color: #64b5f6; } .hljs-github .hljs-string { color: #b7b3ef; } section p { text-align: center; font-size: 30px; background-color: #212121; border-radius: 21px; font-family: Roboto Condensed; font-style: bold; padding: 12px; color: #bff4ee; margin: auto; } #center { text-align: center; } #right { text-align: right; } .center p { margin: 0; position: absolute; top: 50%; left: 50%; -ms-transform: translate(-50%, -50%); transform: translate(-50%, -50%); } .center2 { margin: 0; position: absolute; top: 50%; left: 50%; -ms-transform: translate(-50%, -50%); transform: translate(-50%, -50%); } .tab { display: inline-block; margin-left: 40px; } .tabdbl { display: inline-block; margin-left: 80px; } .tabtpl { display: inline-block; margin-left: 120px; } .obr { display:block; margin-top:-15px; } .pull-left-left { float: left; width: 27%; } .pull-right-right { float: right; width: 32%; } img.expand:hover { margin: 0 auto; position: relative; width: 50%; display: flex; justify-content: center; align-items: center; align-content: center; transform: scale(1.5) translateX(-35%); z-index: 99; transition:all 0.5s ease-in-out; -webkit-transition:all 0.2s ease-in-out; } .vertline { border-left: 5px solid #212121; height: 100px; margin-left: 15px; margin-right: 15px; } *, *:before, *:after { box-sizing: border-box; outline: none; } .hover { position: relative; display: flex; align-items: center; justify-content: center; width: 400px; height: 65px; background-color: #e3c0ff; border-radius: 99px; box-shadow: 0 1px 3px rgba(0, 0, 0, 0.12), 0 1px 2px rgba(0, 0, 0, 0.24); transition: all 0.3s cubic-bezier(0.25, 0.8, 0.25, 1); overflow: hidden; } .hover:before, .hover:after { position: absolute; top: 0; display: flex; align-items: center; justify-content: center; width: 50%; height: 100%; transition: 0.25s linear; z-index: 1; } .hover:before { content: ''; left: 0; background-color: #ca86ec; color: #212121; } .hover:after { content: ''; right: 0; background-color: #d896ff; } .hover:hover { background-color: #cc8bff; box-shadow: 0 14px 28px rgba(0, 0, 0, 0.25), 0 10px 10px rgba(0, 0, 0, 0.22); } .hover:hover span { opacity: 0; z-index: -3; } .hover:hover:before { opacity: 0.5; transform: translateY(-100%); } .hover:hover:after { opacity: 0.5; transform: translateY(100%); } .hover span { position: absolute; top: 0; left: 0; display: flex; align-items: center; justify-content: center; text-align: center; width: 100%; height: 100%; color: #212121; font-size: 24px; font-weight: 700; opacity: 1; transition: opacity 0.25s; z-index: 2; white-space:pre; } .hover .doc-link { position: relative; display: flex; align-items: center; justify-content: center; text-align: center; width: 25%; height: 100%; color: whitesmoke; font-size: 24px; text-decoration: none; transition: 0.25s; } .hover .doc-link i { text-shadow: 1px 1px rgba(70, 98, 127, 0.7); transform: scale(1); } .hover .doc-link:hover { background-color: rgba(245, 245, 245, 0.1); } .hover .doc-link:hover i { animation: bounce 0.4s linear; } @keyframes bounce { 40% { transform: scale(1.4); } 60% { transform: scale(0.8); } 80% { transform: scale(1.2); } 100% { transform: scale(1); } } .boxl { width: 50%; margin: 5px; text-align: center; } .boxr { margin: 5px; text-align: center; } .picr { display: flex; justify-content: space-around; align-items: center; } </style> <style type="text/css"> .highlight-last-item > ul > li, .highlight-last-item > ol > li { opacity: 0.5; } .highlight-last-item > ul > li:last-of-type, .highlight-last-item > ol > li:last-of-type { opacity: 1; } </style>
--- class: highlight-last-item layout: true --- # Welcome! There are a lot of things going on behind the scenes when using PCAs and this is just a very brief introduction without any audio. I have tried to minimize the jargon and complexity, though some items may not be as clear as others. If you have questions, please feel free to reach out. Additionally you may notice the following icons in the footnotes. These contain links to external sites that provide extra materials that may be of interest to you. <br> <br> <br> <br> <center> <div class='footsbs'> <img src="img/htmlcon-ico.png" alt="HTML icon" width='70' style="padding-right: 20px"> <img src="img/pdfcon-ico.png" alt="PDF icon" width='70' style="padding-right: 20px"> <img src="img/rmdcon-ico.png" alt="Rmarkdown icon" width='70' style="padding-right: 20px"> <img src="img/Rscriptcon-ico.png" alt="Rscript icon" width='70' style="padding-right: 20px"> <img src="img/videocon-ico.png" alt="Video icon" width='70'> </div> </center> <!-- Some information about the audio files. Firstly and importantly, I'm no Morgan Freeman but each page will have some audio that will hopefully help you understand a bit more about what's on the page <center> <audio controls preload="auto"> <source src="audio/criteria/S1_Introduction.mp3" type="audio/mpeg"> Your browser does not support embedded audio. <audio> </center> --> --- # Prerequisites This slideshow assumes that you have a basic understanding of variance and correlations. For a refresher, please take a look at both reviews below -- .pull-left[ .bg-washed.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ **Variance** is essentially a measure of the spread between points in a data set. Specifically it tells us how far each data point in a set is from the mean and by proxy from every other data point in that set.<br><br> <center> <img src="img/variance_card.png" height="180px" style="background-color:#212121;"/> </center><br>] ] -- .pull-right[ .bg-washed.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ **Correlation** gives you an idea of the strength or weakness of the relationship between two variables. In a survey where each item is set to measure a single construct, these are essentially the applicable questions.<br> <center> <img src="img/correlation_card.png" height="180px" style="background-color:#212121;"/> </center><br>] ] --- # More Review If you would like a deeper dive on either area, tale a look at the videos below <br> <br> <br> <br> -- .pull-left[ <center> **Variance** </center> <p align="center"> <iframe width="450" height="252" src="https://www.youtube.com/embed/SzZ6GpcfoQY" title="StatQuest Pearson's Correlation, Clearly Explained!!!" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </p> ] -- .pull-right[ <center> **Correlation** </center> <p align="center"> <iframe width="450" height="252" src="https://www.youtube.com/embed/xZ_z8KWkhXE" title="StatQuest Pearson's Correlation, Clearly Explained!!!" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </p> ] --- # From Basic to Better -- <br> <br> <center> correlations are great but they don't... </center> <br> <br> -- .centerCenter[ <br> <br> 1. tell you how every question is related to every other question 2. differentiate between relevant data and noise ] -- <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <center> ...enter a method called <b><i>Principle Component Analysis</i></b> </center> --- # Principle Component Analysis (PCA) --- ## Steps in a Nutshell -- <center> The basic idea of a PCA can be broken into two steps </center> <br> <br> -- .pull-left[ Locate the directions, or *components*, in a data set with high variance<br><br> <center> <img src="img/pc_card.png" height="300px" style="background-color:#212121;"/> </center> ] -- .pull-right[ Find a limited number of components with high variance that in aggregate can explain most of the overall variance in the data<br><br> <center> <img src="img/pca_card.png" height="300px" style="background-color:#212121;"/> </center> ] --- ## Reducing Complexity -- <center> First an overview of some terms: </center> <br> <br> -- .pull-left[ ***Dimensionality*** - The number of input variables, or *features* in a dataset. In a spreadsheet, you can think of these as the column names.<br><br> <center> <img src="img/dimensionality_spreadsheet.png" height="300px" style="background-color:#212121;"/> </center> ] -- .pull-right[ ***Dimensionality Reduction*** - Statistical techniques used to reduce the number of input variables.<br><br><br><br><br> <center> <img src="img/pca_dim_red.png" height="200px" style="background-color:#212121;"/> </center> ] --- ## The Problem with Dimensions ***Curse of Dimensionality*** - In brief terms, this refers to a few aspects -- + *statistical*. the error rate increases as the number of features increases -- + *computational*. algorithms are harder to design and and exponentially take more time to run in high dimensions -- + *practical*. higher number of dimensions theoretically allow more information to be stored, but in reality it rarely helps due to the higher possibility of noise and redundancy in real-world data -- <br> <br> <center> <img src="img/curse_example.png" height="250px" style="background-color:#212121;"/> </center> --- ## Fundamentals of What PCAs Can and Cannot Do PCAs are one of the most traditional methods used for dimension reduction. -- .pull-left[ <center> **Primary benefit** </center> It transforms the data into the most informative space, thereby allowing the use of lesser dimensions which retain needed information from the data while shedding much of the noise ] -- .pull-right[ <center> **Primary drawback** </center> It assumes linearity so any nonlinear relationship in a given data set is lost possibly causing loss in accuracy and the ability to estimate the likelihood of causality. ] -- <br> <br> <br> <br> <hr style="width:30%"> .centerBottom[ Note as with most other procedures: *what you gain in efficiency, you lose in precision*. In a nutshell, there is no known perfect method that can both get rid of all of the noise and leave only relevant information. However with an ever growing machine learning library of approaches, we could get pretty close well within your lifetime! ] --- ## How Do PCAs Work? -- Before moving on please note that this is a nutshell explanation of the steps and avoids the mathematics<sup>1</sup>. If you are interested in a more nuanced introduction coupled with the mathematics, watch this amazing lecture by Josh Starmer from [StatQuest](https://statquest.org/about/)<sup>2</sup>. <p align="center"> <iframe width="560" height="315" src="https://www.youtube.com/embed/_C9-Bn-7KO4" title="Introduction to Principal Component Analysis (PCA) by Josh Starmer from StatQuest" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </p> .footnote[ [1] <a href="https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf#chapter.1" target='_blank'> <img src="img/pdf-ico.png" alt="PDF icon" width='35' style="padding-right: 10px ; padding-left: 5px;"> </a> [2]<a href="https://www.youtube.com/c/joshstarmer/videos" target='_blank'> <img src="img/video-ico.png" alt="Video Icon" width='22' style="padding-right: 10px ; padding-left: 5px;"> </a> ] --- ## OK Now Really How Do PCAs Work? -- .center2[ Let's look at a data set with 205 points randomly scattered in three-dimensions. Keep in mind that as you move along, the <i>PCA is carving out new dimensions which you will be able to see and interact with</i>. ] <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <center> When applying a PCA, it locates the... </center> --- <ol start="1"> <li>center point of data in multi-dimensional space </li> </ol> <br> <br> <br> .panelset.sideways[ .panel[.panel-name[Look] <img src="img/pca1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Interact]
] ] --- <ol start="2"> <li>direction with the greatest variance. This is called the <b>1st component</b> </li> </ol> <br> <br> <br> .panelset.sideways[ .panel[.panel-name[Look] <img src="img/pca2.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Interact]
] ] --- <ol start="3"> <li>direction that is perpendicular, or <i>orthogonal</i> to the 1st component with the greatest variance. This is called the <b>2nd component</b>. </li> </ol> <br> <br> <br> .panelset.sideways[ .panel[.panel-name[Look] <img src="img/pca3.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Interact]
] ] --- <ol start="4"> <li>direction that is perpendicular, or <i>orthogonal</i> to the 1st and 2nd component with the greatest variance. This is called the <b>3rd component</b>. </li> </ol> <br> <br> <br> .panelset.sideways[ .panel[.panel-name[Look] <img src="img/pca4.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Interact]
] ] --- > and it keeps going like this for as many dimensions as we have in a data set... <br> <br> -- > so you can probably imagine that big data sets with hundreds or thousands of columns and rows can take quite a bit of time... <br> <br> -- > but there are many other methods of reducing dimensions like --- .pull-left[ [Hierarchical Clustering](https://uc-r.github.io/hc_clustering)<sup>3</sup> ] .pull-right[ ![](pca_pres_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] .footnote[ <div class='footsbs'> [3] <a href="https://uc-r.github.io/hc_clustering" target='_blank'> <img src="img/rmd-logo.png" alt="Rmarkdown icon" width='35' style="padding-right: 10px ; padding-left: 5px;"> </a> <a href="scripts/hc.zip" target='_blank' download="Hierarchical Clustering script"> <img src="img/Rscript-ico.png" alt="Rscript icon" width='36'> </a> </div> ] --- .pull-left[ [K-means Clustering](https://uc-r.github.io/kmeans_clustering)<sup>4</sup> ] .pull-right[ ![](pca_pres_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ] .footnote[ <div class='footsbs'> [4] <a href="https://uc-r.github.io/kmeans_clustering" target='_blank'> <img src="img/rmd-logo.png" alt="Rmarkdown icon" width='35' style="padding-right: 10px; padding-left: 5px;"> </a> <a href="scripts/kmeans.zip" target='_blank' download="K-means script"> <img src="img/Rscript-ico.png" alt="Rscript icon" width='36'> </a> </div> ] --- .pull-left[ [t-Distributed Stochastic Neighbor Embedding (t-SNE)](https://rpubs.com/marwahsi/tnse)<sup>5</sup> ] .pull-right[ ![](pca_pres_files/figure-html/tsneplot-1.png)<!-- --> ] .footnote[ <div class='footsbs'> [5] <a href="https://rpubs.com/marwahsi/tnse" target='_blank'> <img src="img/rmd-logo.png" alt="Rmarkdown icon" width='35' style="padding-right: 10px; padding-left: 5px;"> </a> <a href="https://distill.pub/2016/misread-tsne/" target='_blank'> <img src="img/html-ico.png" alt="HTML icon" width='35' style="padding-right: 10px;"> </a> <a href="scripts/tsne.zip" target='_blank' download="t-SNE script"> <img src="img/Rscript-ico.png" alt="Rscript icon" width='36'> </a> </div> ] --- Below is an animation of the t-SNE process which shows a complex data set, data reduction and then clustering<sup>6</sup> <center> <img src="img/tsne_anim.gif" style="background-color:#212121;"/> </center> .footnote[ <div class='footsbs'> [6] <a href="https://hypercompetent.github.io/post/gganimate-tweenr-tsne-plot/" target='_blank'> <img src="img/html-ico.png" alt="HTML icon" width='35' style="padding-right: 10px; padding-left: 5px;"> </a> <a href="scripts/tsne_anim.zip" target='_blank' download="t-SNE animation bundle"> <img src="img/Rscript-ico.png" alt="Rscript icon" width='36'> </a> </div> ] --- And just for fun here are two [pca](https://rpubs.com/marwahsi/tnse) rotations of the example data set<sup>7</sup> <br> .pull-left[ ![](pca_pres_files/figure-html/pc12plot-1.png)<!-- --> ] .pull-right[ ![](pca_pres_files/figure-html/pc34plot-1.png)<!-- --> ] .footnote[ <div class='footsbs'> [7] <a href="https://uc-r.github.io/pca" target='_blank'> <img src="img/rmd-logo.png" alt="Rmarkdown icon" width='35' style="padding-right: 10px; padding-left: 5px;"> </a> <a href="scripts/pca.zip" target='_blank' download="PCA script"> <img src="img/Rscript-ico.png" alt="Rscript icon" width='36'> </a> </div> ] --- ## Surveys and PCAs In general, using a PCA in survey data analysis helps you to understand -- + how each item is similar to all others and the strength of that relationship -- + which items are should likely be kept or removed --- # But Wait There's More! Again this is just the tip of the iceberg. To really see the power of PCAs, take a look at machine learning. This is just one of many ways to deal with classification and dimensionality. Here are a couple resources. At this time, its good just to ignore the coding and to simply get a basic idea of each. .footnote[ [8] If you cannot see the entire page, please load the site in a private window. Directions on how to do this are provided for <center> <div class='footsbs'> <a href="https://support.google.com/chrome/answer/95464" target='_blank'> <img src="img/chrome-ico.png" alt="Chrome icon" width='35' style="padding-right: 10px;"> </a> <a href="https://support.mozilla.org/en-US/kb/private-browsing-use-firefox-without-history" target='_blank' download="PCA script"> <img src="img/firefox-ico.png" alt="Firefox icon" width='35' style="padding-right: 10px;"> </a> <a href="https://support.apple.com/guide/safari/browse-privately-ibrw1069/mac" target='_blank' download="PCA script"> <img src="img/safari-ico.png" alt="Safari icon" width='35'> </a> </dvv> </center> ] -- + [11 Dimensionality reduction techniques you should know in 2021](https://towardsdatascience.com/11-dimensionality-reduction-techniques-you-should-know-in-2021-dcb9500d388b)<sup>8</sup> -- + [Understanding Dimension Reduction and Principal Component Analysis in R for Data Science](https://towardsdatascience.com/understanding-dimension-reduction-and-principal-component-analysis-in-r-e3fbd02b29ae)<sup>8</sup> -- + [Workshop: Dimension reduction with R](https://rpubs.com/Saskia/520216) --- ## Thats it! If you have any questions, please reach out -- <br> <br> <br> <br> <br> <br> <br> <br> <br> <center> <br><br> <div class="fade_rule"></div> <br><br> </center> <center> <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><br />This work is licensed under a <br /><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a> </center>