class: center, middle, inverse, title-slide .title[ # Topic Modeling ] .subtitle[ ## EDP 618 Week 12 ] .author[ ### Dr. Abhik Roy ] --- <script> function resizeIframe(obj) { obj.style.height = obj.contentWindow.document.body.scrollHeight + 'px'; } </script> <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Register.StartupHook("TeX Jax Ready",function () { MathJax.Hub.Insert(MathJax.InputJax.TeX.Definitions.macros,{ cancel: ["Extension","cancel"], bcancel: ["Extension","cancel"], xcancel: ["Extension","cancel"], cancelto: ["Extension","cancel"] }); }); </script> <style> section { display: flex; display: -webkit-flex; } section { height: 600px; width: 60%; margin: auto; border-radius: 21px; background-color: #212121; } .remark-slide-container { background: #212121; } .hljs-github .hljs { background: transparent; color: #b2dfdb; } .hljs-github .hljs-keyword { color: #64b5f6; } .hljs-github .hljs-literal { color: #64b5f6; } .hljs-github .hljs-number { color: #64b5f6; } .hljs-github .hljs-string { color: #b7b3ef; } .hljs-github .hljs { background: transparent; color: #b2dfdb; } .hljs-github .hljs-keyword { color: #64b5f6; } .hljs-github .hljs-literal { color: #64b5f6; } .hljs-github .hljs-number { color: #64b5f6; } .hljs-github .hljs-string { color: #b7b3ef; } section p { text-align: center; font-size: 30px; background-color: #212121; border-radius: 21px; font-family: Roboto Condensed; font-style: bold; padding: 12px; color: #bff4ee; margin: auto; } #center { text-align: center; } #right { text-align: right; } .center p { margin: 0; position: absolute; top: 50%; left: 50%; -ms-transform: translate(-50%, -50%); transform: translate(-50%, -50%); } .center2 { margin: 0; position: absolute; top: 50%; left: 50%; -ms-transform: translate(-50%, -50%); transform: translate(-50%, -50%); } .tab { display: inline-block; margin-left: 40px; } .listtab { display: inline-block; margin-left: 30px; } .obr { display:block; margin-top:-15px; } .container { display: flex; } .container > div { flex: 1; /*grow*/ margin-right: 40px; } td, th, tr, table { border: 0 !important; border-spacing:0 !important; overflow-x: hidden; overflow-y: hidden; background-color: unset !important; color: unset !important; } tbody > td > tr:hover { background-color: unset !important; color: unset !important; } .remarkwidth code[class="remark-code"] { white-space: pre-wrap; padding-left:a 1.85em; text-indent: -1.85em; } .left-code { color: #777; width: 60%; height: 92%; float: left; } .right-plot { width: 38%; float: right; padding-left: 1%; } .cardquad1 img:hover{ position: relative; transform: translate(-50%,50%) scale(2.0); background-color: #212121; } .cardquad2 img:hover{ position: relative; transform: translate(50%,50%) scale(2.0); background-color: #212121; } .cardquad3 img:hover{ position: relative; transform: translate(50%,-50%) scale(2.0); background-color: #212121; } .cardquad4 img:hover{ position: relative; transform: translate(-50%,-50%) scale(2.0); background-color: #212121; } img{ -webkit-transition: transform 0.5s ease-in-out; -moz-transition: transform 0.5s ease-in-out; -ms-transition: transform 0.5s ease-in-out; -o-transition: transform 0.5s ease-in-out; transition: transform 0.5s ease-in-out; } </style> <style type="text/css"> .highlight-last-item > ul > li, .highlight-last-item > ol > li { opacity: 0.5; } .highlight-last-item > ul > li:last-of-type, .highlight-last-item > ol > li:last-of-type { opacity: 1; } </style>
--- class: highlight-last-item layout: true --- # Setting Up -- 1. You can retrieve the *employee sample reviews* survey response data set and both installation and walkthrough
*scripts* by clicking on the icon below<br> <br> <center> <a href="files/topic_modeling_files.zip" target='_blank' download="Topic Modeling Files"> <img src="img/zip-ico.png" alt="Paper" width='45'></a> </center> -- 2. Open up RStudio -- 3. Open up <span style="font-family:'Source Code Pro'; color:#b7b3ef">Topic Modeling Install.R</span> -- 4. Open <span style="font-family:'Source Code Pro'; color:#b7b3ef">Topic Modeling Script.R</span> -- .footnote[Take a look at the various types of files that can be imported in the tidyverse <a href="files/data-import.pdf" target='_blank' download="Data import with the tidyverse"> <img src="img/pdf-ico.png" alt="PDF icon" width='45'></a>] --- # Getting Prepped -- In <span style="font-family:'Source Code Pro'; color:#b7b3ef">Topic Modeling Script.R</span>, run the following commands -- 1. Setting the working directory as source ```r setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) ``` -- 2. Loading the needed packages for this walkthrough ```r library(tidyverse) library(tidytext) library(tm) library(textclean) library(topicmodels) library(ldatuning) library(stopwords) library(textstem) library(broom) ``` .footnote[Alternatively if you have the **pacman** package, run `pacman::p_install("tidyverse", "tidytext", "tm", "textclean", "topicmodels", "ldatuning", "stopwords", "textstem", "broom")`] --- <ol start="3"> <li> Bringing in the survey data </ol> ```r employee_responses <- read_csv("employee sample reviews.csv") ``` -- <ol start="4"> <li> Bringing in the common names set (as a character string) </ol> ```r common_names <- read_csv("most common names.csv") %>% simplify_all() %>% .[[1]] ``` -- <ol start="5"> <li> Retrieving stopwords </ol> ```r data("stop_words") ``` --- ## Too Many Files? -- You may have noticed that the there are a lot of files. Some of them we'll use while the others are included for completeness. Below you will see a description of each -- .pull-left[ <p id="center" style="color:#baffc9; border:1px; border-style:solid; border-color:#baffc9; border-radius: 25px; padding: 0.3em; margin-top: -6px"> <span style = "font-weight:bold; font-style:italic;">Data sets we <i>are</i> using </span><br><br> <span style="font-family:'Source Code Pro'; color:#bff4ee">employee sample reviews.csv</span> is a 10% random sampling of the original data set that was created to save computing time for the example given in the walkthrough<br><br> <span style="font-family:'Source Code Pro'; color:#bff4ee">most common names.csv</span> is a list of approximately 97,000+ common U.S. names derived from both U.S. Census and Social Security Administration data </p> ] -- .pull-right[ <p id="center" style="color:#baffc9; border:1px; border-style:solid; border-color:#baffc9; border-radius: 25px; padding: 0.3em; margin-top: -6px"> <span style = "font-weight:bold; font-style:italic;">R Scripts we <i>are</i> using </span><br><br> <span style="font-family:'Source Code Pro'; color:#b7b3ef">Topic Modeling Install.R</span> is the installation file needed for the walkthrough<br><br> <span style="font-family:'Source Code Pro'; color:#b7b3ef">Topic Modeling Script.R</span> is a static copy of the commands used in the walkthrough </p> ] -- .pull-left[ <p id="center" style="color:#ffb3ba; border:1px; border-style:solid; border-color:#ffb3ba; border-radius: 25px; padding: 0.3em; margin-top: -6px"> <span style = "font-weight:bold; font-style:italic;">Data sets we <i>are not</i> using </span><br><br> <span style="font-family:'Source Code Pro'; color:#bff4ee">employee reviews.csv</span> is the original data set and may be found on [Kaggle](https://www.kaggle.com/datasets/fiodarryzhykau/employee-review) </p> ] -- .pull-right[ <p id="center" style="color:#ffb3ba; border:1px; border-style:solid; border-color:#ffb3ba; border-radius: 25px; padding: 0.3em; margin-top: -6px"> <span style = "font-weight:bold; font-style:italic;">R scripts we <i>are not</i> using </span><br><br> <span style="font-family:'Source Code Pro'; color:#b7b3ef">Get Common Names.R</span> provides a list of <b>dplyr</b> commands that uses the <b>lexicon</b>, <b>babynames</b>, and <b>genderdata</b> packages to create the most common names data set<br><br> <span style="font-family:'Source Code Pro'; color:#b7b3ef">Random Sampling Rows.R</span> provides a list of <b>dplyr</b> commands used to create the sample data set </p> ] --- # Before We Begin -- This is the process we'll cover lightly. There is a lot more going on under the hood and you may not be able to recognize all of the terms, but if you can get a basic understanding of the process, the rest can be filled in by conducting a topic model! <center> <img src="img/tmwf.png" alt="Basic Topic Modeling Process" width='400'></a> </center> --- If you didn't know, computers can't understand human languages...not directly anyway. Enter this idea below of using a medium to communicate with one (or multiple) <br> <br> <br> <center> <img src="img/nlp.png" alt="Natural Language Processing definition" width='500'></a> </center> --- Here are a few things we won't be covering in this session so please read over the areas you lack familiarity with. Given that, it is absolutely fine if you cannot fully understand all of these ideas right now - they will hopefully become apparent as we progress .center2[ <div style="text-align: center; color: #212121; border:1px; border-style:solid; border-color:#c7f9f6; background-color: #c7f9f6; border-radius: 15px; padding: 0.65em; width:fit-content;"> hover over<br>any card<br>to make<br>it bigger </div>] -- .pull-left[ <center> <div class="cardquad2"> <img src="img/document.png" alt="Document Definition" width='350'></a> </center> </div> ] -- .pull-right[ <div class="cardquad1"> <center> <img src="img/corpus.png" alt="Corpus Definition" width='350'></a> </center> </div> ] -- <br> .pull-left[ <div class="cardquad3"> <center> <img src="img/tf-idf.png" alt="TF-IDF Definition" width='350'></a> </center> </div> ] -- .pull-right[ <div class="cardquad4"> <center> <img src="img/lda.png" alt="LDA Definition" width='350'></a> </center> </div> ] --- Here are some basic terms you should try to keep while going through the walkthrough. Again it is completely fine if you do not understand what these mean in context right now! .center2[ <div style="text-align: center; color: #212121; border:1px; border-style:solid; border-color:#c7f9f6; background-color: #c7f9f6; border-radius: 15px; padding: 0.65em; width:fit-content;"> hover over<br>any card<br>to make<br>it bigger </div>] -- .pull-left[ <center> <div class="cardquad2"> <img src="img/bow.png" alt="Bag of Words Definition" width='350'></a> </center> </div> ] -- .pull-right[ <div class="cardquad1"> <center> <img src="img/classification.png" alt="Classification Definition" width='350'></a> </center> </div> ] -- <br> .pull-left[ <div class="cardquad3"> <center> <img src="img/standardization.png" alt="Standardization Definition" width='350'></a> </center> </div> ] -- .pull-right[ <div class="cardquad4"> <center> <img src="img/tokenization.png" alt="Tokenization Definition" width='350'></a> </center> </div> ] --- <br> <br> #<center>Topic Modeling</center> -- <br> <br> <br> <center> <div style="text-align: center; color:#c7f9f6; border:1px; border-style:solid; border-color:#c7f9f6; border-radius: 25px; padding: 0.8em; width:fit-content;"> A type of probabilistic statistical model for<br><br> <div style="display: inline-block; text-align: left;"> (a) discovering the abstract "topics"<br> <span class = "listtab">- or <i>hidden semantic structures</i> -</span><br> <span class = "listtab">that occur in a collection of documents</span><br><br> (b) dimensionality reduction</span> </div> </div> </center> --- ### The Most Annoying Thing About Data -- .center2[<b>The 80/20 Rule</b><sup>1</sup>: <i>Most data scientists spend only 20 percent of their time on actual data analysis and 80 percent of their time finding, cleaning, and reorganizing huge amounts of data</i>] .footnote[<sup>1</sup> Loosely based on an idea called **Pareto's Principle** which states that *roughly 80% of outcomes come from 20% of causes*] --- .center2[<b><span style = "font-size:2.75rem">Step 1: Assessing Data</span></b>] --- 1. Take a look at the data set and think about categorizing terms that may skew how terms are assessed -- .pull-left[<span class = "tab">the names of the people are not important so we could replace all of them simply with the word <b>people</b></span>] -- .pull-right[<span class = "tab">employees are prevalent in the data so we could remove the word <s><b>people</b></s> altogether</span>] <br> -- <ol start="2"> <li> Open up an empty text document and try going through on your own to consider terms that could be collapsed </ol> --- .center2[<b><span style = "font-size:2.75rem">Step 2: Preprocessing</span></b>] --- <span style = "font-size:1.75rem;"><b>Cleaning</b> Raw Text</b></span> --- count: false .panel1-sw1a-auto[ ```r *employee_responses ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 6 ## id person_name nine_box_category feedb…¹ adjus…² revie…³ ## <dbl> <chr> <chr> <chr> <lgl> <lgl> ## 1 612 Yahir Harvey Category 6: 'High Performer' (… "Requi… TRUE TRUE ## 2 552 Briley Mcknight Category 7: 'Potential Gem' (L… "Basic… FALSE FALSE ## 3 10215 Emerson Rose Category 2: 'Average performer… "Emers… FALSE FALSE ## 4 315 Jay Reid Category 4: 'Inconsistent Play… "I am … FALSE FALSE ## 5 447 Chelsea Ross Category 6: 'High Performer' (… "Chels… FALSE FALSE ## 6 186 Eloise Foster Category 3: 'Solid Performer' … "Elois… FALSE FALSE ## 7 238 Eloise Foster Category 3: 'Solid Performer' … "Elois… FALSE FALSE ## 8 621 Dennis Buchanan Category 8: 'High Potential' (… "Denni… FALSE FALSE ## 9 10020 Kieran Clarke Category 5: 'Core Player' (Mod… "Kiera… FALSE FALSE ## 10 560 Logan Ellis Category 7: 'Potential Gem' (L… "While… FALSE FALSE ## # … with 78 more rows, and abbreviated variable names ¹feedback, ²adjusted, ## # ³reviewed ``` ] --- count: false .panel1-sw1a-auto[ ```r employee_responses %>% * select(feedback) # Select the column with open ended responses ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 "Requires additional scope[e. Willingness to self start. Outperforms task. m… ## 2 "Basically Briley Mcknight have high potential in his work experience. But … ## 3 "Emerson Rose is a fairly average worker. she puts out satisfactory work an… ## 4 "I am writing a review for Mr. Jay Reid. While he is a great person, unfortu… ## 5 "Chelsea Ross performed at a high level this past year. She is one of the m… ## 6 "Eloise Foster is a follower rather than a leader. While she does complete h… ## 7 "Eloise is a very capable worker and always produces excellent work. She alw… ## 8 "Dennis is a consistently reliable as an employee. His work product is alwa… ## 9 "Kieran is a linchpin of the team, and a dependable worker. Can be trusted w… ## 10 "While I have found some issues with Logan's performance in the past, specif… ## # … with 78 more rows ``` ] --- count: false .panel1-sw1a-auto[ ```r employee_responses %>% select(feedback) %>% # Select the column with open ended responses * mutate(feedback = textclean::replace_non_ascii(feedback)) # Convert to a standard format ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 Requires additional scope[e. Willingness to self start. Outperforms task. mo… ## 2 Basically Briley Mcknight have high potential in his work experience. But in… ## 3 Emerson Rose is a fairly average worker. she puts out satisfactory work and … ## 4 I am writing a review for Mr. Jay Reid. While he is a great person, unfortun… ## 5 Chelsea Ross performed at a high level this past year. She is one of the mos… ## 6 Eloise Foster is a follower rather than a leader. While she does complete he… ## 7 Eloise is a very capable worker and always produces excellent work. She alwa… ## 8 Dennis is a consistently reliable as an employee. His work product is always… ## 9 Kieran is a linchpin of the team, and a dependable worker. Can be trusted wi… ## 10 While I have found some issues with Logan's performance in the past, specifi… ## # … with 78 more rows ``` ] --- count: false .panel1-sw1a-auto[ ```r employee_responses %>% select(feedback) %>% # Select the column with open ended responses mutate(feedback = textclean::replace_non_ascii(feedback)) %>% # Convert to a standard format * mutate(feedback = str_to_lower(feedback)) # Convert all words to lower case ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 requires additional scope[e. willingness to self start. outperforms task. mo… ## 2 basically briley mcknight have high potential in his work experience. but in… ## 3 emerson rose is a fairly average worker. she puts out satisfactory work and … ## 4 i am writing a review for mr. jay reid. while he is a great person, unfortun… ## 5 chelsea ross performed at a high level this past year. she is one of the mos… ## 6 eloise foster is a follower rather than a leader. while she does complete he… ## 7 eloise is a very capable worker and always produces excellent work. she alwa… ## 8 dennis is a consistently reliable as an employee. his work product is always… ## 9 kieran is a linchpin of the team, and a dependable worker. can be trusted wi… ## 10 while i have found some issues with logan's performance in the past, specifi… ## # … with 78 more rows ``` ] --- count: false .panel1-sw1a-auto[ ```r employee_responses %>% select(feedback) %>% # Select the column with open ended responses mutate(feedback = textclean::replace_non_ascii(feedback)) %>% # Convert to a standard format mutate(feedback = str_to_lower(feedback)) %>% # Convert all words to lower case * mutate(feedback = str_remove_all(feedback, "'s")) # Remove all cases of `s ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 requires additional scope[e. willingness to self start. outperforms task. mo… ## 2 basically briley mcknight have high potential in his work experience. but in… ## 3 emerson rose is a fairly average worker. she puts out satisfactory work and … ## 4 i am writing a review for mr. jay reid. while he is a great person, unfortun… ## 5 chelsea ross performed at a high level this past year. she is one of the mos… ## 6 eloise foster is a follower rather than a leader. while she does complete he… ## 7 eloise is a very capable worker and always produces excellent work. she alwa… ## 8 dennis is a consistently reliable as an employee. his work product is always… ## 9 kieran is a linchpin of the team, and a dependable worker. can be trusted wi… ## 10 while i have found some issues with logan performance in the past, specifica… ## # … with 78 more rows ``` ] --- count: false .panel1-sw1a-auto[ ```r employee_responses %>% select(feedback) %>% # Select the column with open ended responses mutate(feedback = textclean::replace_non_ascii(feedback)) %>% # Convert to a standard format mutate(feedback = str_to_lower(feedback)) %>% # Convert all words to lower case mutate(feedback = str_remove_all(feedback, "'s")) %>% # Remove all cases of `s * mutate(feedback = str_remove_all(feedback, "[[:digit:]]")) # Remove all numbers ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 requires additional scope[e. willingness to self start. outperforms task. mo… ## 2 basically briley mcknight have high potential in his work experience. but in… ## 3 emerson rose is a fairly average worker. she puts out satisfactory work and … ## 4 i am writing a review for mr. jay reid. while he is a great person, unfortun… ## 5 chelsea ross performed at a high level this past year. she is one of the mos… ## 6 eloise foster is a follower rather than a leader. while she does complete he… ## 7 eloise is a very capable worker and always produces excellent work. she alwa… ## 8 dennis is a consistently reliable as an employee. his work product is always… ## 9 kieran is a linchpin of the team, and a dependable worker. can be trusted wi… ## 10 while i have found some issues with logan performance in the past, specifica… ## # … with 78 more rows ``` ] --- count: false .panel1-sw1a-auto[ ```r employee_responses %>% select(feedback) %>% # Select the column with open ended responses mutate(feedback = textclean::replace_non_ascii(feedback)) %>% # Convert to a standard format mutate(feedback = str_to_lower(feedback)) %>% # Convert all words to lower case mutate(feedback = str_remove_all(feedback, "'s")) %>% # Remove all cases of `s mutate(feedback = str_remove_all(feedback, "[[:digit:]]")) %>% # Remove all numbers * mutate(feedback = str_remove_all(feedback, "[[:punct:]]")) # Remove all punctuation ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 requires additional scopee willingness to self start outperforms task motiva… ## 2 basically briley mcknight have high potential in his work experience but in … ## 3 emerson rose is a fairly average worker she puts out satisfactory work and s… ## 4 i am writing a review for mr jay reid while he is a great person unfortunate… ## 5 chelsea ross performed at a high level this past year she is one of the most… ## 6 eloise foster is a follower rather than a leader while she does complete her… ## 7 eloise is a very capable worker and always produces excellent work she alway… ## 8 dennis is a consistently reliable as an employee his work product is always … ## 9 kieran is a linchpin of the team and a dependable worker can be trusted with… ## 10 while i have found some issues with logan performance in the past specifical… ## # … with 78 more rows ``` ] --- count: false .panel1-sw1a-auto[ ```r employee_responses %>% select(feedback) %>% # Select the column with open ended responses mutate(feedback = textclean::replace_non_ascii(feedback)) %>% # Convert to a standard format mutate(feedback = str_to_lower(feedback)) %>% # Convert all words to lower case mutate(feedback = str_remove_all(feedback, "'s")) %>% # Remove all cases of `s mutate(feedback = str_remove_all(feedback, "[[:digit:]]")) %>% # Remove all numbers mutate(feedback = str_remove_all(feedback, "[[:punct:]]")) %>% # Remove all punctuation # Remove all instances from a separate list * mutate(feedback = str_remove_all(feedback, paste0("\\b", common_names, "\\b", collapse = "|"))) ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 "requires additional scopee willingness to self start outperforms task motiv… ## 2 "basically mcknight have high potential his work experience but recent da… ## 3 " is a fairly average worker puts out satisfactory work and shows potent… ## 4 "i am writing a review for while he is a person unfortunately that doesn… ## 5 " performed at a high level this past year is one of most reliable and so… ## 6 " is a follower rather a leader while does complete tasks a timely mann… ## 7 " is a very capable worker and always produces excellent work always finish… ## 8 " is a consistently reliable as employee his work product is always above … ## 9 " is a linchpin of team and a dependable worker trusted with tasks of mod… ## 10 "while i have found some issues with performance past specifically a lack… ## # … with 78 more rows ``` ] --- count: false .panel1-sw1a-auto[ ```r employee_responses %>% select(feedback) %>% # Select the column with open ended responses mutate(feedback = textclean::replace_non_ascii(feedback)) %>% # Convert to a standard format mutate(feedback = str_to_lower(feedback)) %>% # Convert all words to lower case mutate(feedback = str_remove_all(feedback, "'s")) %>% # Remove all cases of `s mutate(feedback = str_remove_all(feedback, "[[:digit:]]")) %>% # Remove all numbers mutate(feedback = str_remove_all(feedback, "[[:punct:]]")) %>% # Remove all punctuation # Remove all instances from a separate list mutate(feedback = str_remove_all(feedback, paste0("\\b", common_names, "\\b", collapse = "|"))) %>% # Remove any additional terms manually * mutate(feedback = str_remove_all(feedback, "mcknight|cook|cunningham|hahn|vargas")) ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 "requires additional scopee willingness to self start outperforms task motiv… ## 2 "basically have high potential his work experience but recent days he is… ## 3 " is a fairly average worker puts out satisfactory work and shows potent… ## 4 "i am writing a review for while he is a person unfortunately that doesn… ## 5 " performed at a high level this past year is one of most reliable and so… ## 6 " is a follower rather a leader while does complete tasks a timely mann… ## 7 " is a very capable worker and always produces excellent work always finish… ## 8 " is a consistently reliable as employee his work product is always above … ## 9 " is a linchpin of team and a dependable worker trusted with tasks of mod… ## 10 "while i have found some issues with performance past specifically a lack… ## # … with 78 more rows ``` ] --- count: false .panel1-sw1a-auto[ ```r employee_responses %>% select(feedback) %>% # Select the column with open ended responses mutate(feedback = textclean::replace_non_ascii(feedback)) %>% # Convert to a standard format mutate(feedback = str_to_lower(feedback)) %>% # Convert all words to lower case mutate(feedback = str_remove_all(feedback, "'s")) %>% # Remove all cases of `s mutate(feedback = str_remove_all(feedback, "[[:digit:]]")) %>% # Remove all numbers mutate(feedback = str_remove_all(feedback, "[[:punct:]]")) %>% # Remove all punctuation # Remove all instances from a separate list mutate(feedback = str_remove_all(feedback, paste0("\\b", common_names, "\\b", collapse = "|"))) %>% # Remove any additional terms manually mutate(feedback = str_remove_all(feedback, "mcknight|cook|cunningham|hahn|vargas")) %>% * mutate(feedback = lemmatize_strings(feedback)) # Lemmatize terms ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 require additional scopee willingness to self start outperform task motivate… ## 2 basically have high potential his work experience but recent day he be do pe… ## 3 be a fairly average worker put out satisfactory work and show potential i wo… ## 4 i be write a review for while he be a person unfortunately that doesnt show … ## 5 perform at a high level this past year be one of much reliable and solid per… ## 6 be a follower rather a leader while do complete task a timely manner do with… ## 7 be a very capable worker and always produce excellent work always finish wor… ## 8 be a consistently reliable as employee his work product be always above and … ## 9 be a linchpin of team and a dependable worker trust with task of moderate co… ## 10 while i have find some issue with performance past specifically a lack of at… ## # … with 78 more rows ``` ] --- count: false .panel1-sw1a-auto[ ```r employee_responses %>% select(feedback) %>% # Select the column with open ended responses mutate(feedback = textclean::replace_non_ascii(feedback)) %>% # Convert to a standard format mutate(feedback = str_to_lower(feedback)) %>% # Convert all words to lower case mutate(feedback = str_remove_all(feedback, "'s")) %>% # Remove all cases of `s mutate(feedback = str_remove_all(feedback, "[[:digit:]]")) %>% # Remove all numbers mutate(feedback = str_remove_all(feedback, "[[:punct:]]")) %>% # Remove all punctuation # Remove all instances from a separate list mutate(feedback = str_remove_all(feedback, paste0("\\b", common_names, "\\b", collapse = "|"))) %>% # Remove any additional terms manually mutate(feedback = str_remove_all(feedback, "mcknight|cook|cunningham|hahn|vargas")) %>% mutate(feedback = lemmatize_strings(feedback)) %>% # Lemmatize terms * mutate(feedback = str_squish(feedback)) # Remove whitespace ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 require additional scopee willingness to self start outperform task motivate… ## 2 basically have high potential his work experience but recent day he be do pe… ## 3 be a fairly average worker put out satisfactory work and show potential i wo… ## 4 i be write a review for while he be a person unfortunately that doesnt show … ## 5 perform at a high level this past year be one of much reliable and solid per… ## 6 be a follower rather a leader while do complete task a timely manner do with… ## 7 be a very capable worker and always produce excellent work always finish wor… ## 8 be a consistently reliable as employee his work product be always above and … ## 9 be a linchpin of team and a dependable worker trust with task of moderate co… ## 10 while i have find some issue with performance past specifically a lack of at… ## # … with 78 more rows ``` ] --- count: false .panel1-sw1a-auto[ ```r employee_responses %>% select(feedback) %>% # Select the column with open ended responses mutate(feedback = textclean::replace_non_ascii(feedback)) %>% # Convert to a standard format mutate(feedback = str_to_lower(feedback)) %>% # Convert all words to lower case mutate(feedback = str_remove_all(feedback, "'s")) %>% # Remove all cases of `s mutate(feedback = str_remove_all(feedback, "[[:digit:]]")) %>% # Remove all numbers mutate(feedback = str_remove_all(feedback, "[[:punct:]]")) %>% # Remove all punctuation # Remove all instances from a separate list mutate(feedback = str_remove_all(feedback, paste0("\\b", common_names, "\\b", collapse = "|"))) %>% # Remove any additional terms manually mutate(feedback = str_remove_all(feedback, "mcknight|cook|cunningham|hahn|vargas")) %>% mutate(feedback = lemmatize_strings(feedback)) %>% # Lemmatize terms mutate(feedback = str_squish(feedback)) %>% # Remove whitespace * mutate(feedback = na_if(feedback, "")) # Replace blanks with NA ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 require additional scopee willingness to self start outperform task motivate… ## 2 basically have high potential his work experience but recent day he be do pe… ## 3 be a fairly average worker put out satisfactory work and show potential i wo… ## 4 i be write a review for while he be a person unfortunately that doesnt show … ## 5 perform at a high level this past year be one of much reliable and solid per… ## 6 be a follower rather a leader while do complete task a timely manner do with… ## 7 be a very capable worker and always produce excellent work always finish wor… ## 8 be a consistently reliable as employee his work product be always above and … ## 9 be a linchpin of team and a dependable worker trust with task of moderate co… ## 10 while i have find some issue with performance past specifically a lack of at… ## # … with 78 more rows ``` ] --- count: false .panel1-sw1a-auto[ ```r employee_responses %>% select(feedback) %>% # Select the column with open ended responses mutate(feedback = textclean::replace_non_ascii(feedback)) %>% # Convert to a standard format mutate(feedback = str_to_lower(feedback)) %>% # Convert all words to lower case mutate(feedback = str_remove_all(feedback, "'s")) %>% # Remove all cases of `s mutate(feedback = str_remove_all(feedback, "[[:digit:]]")) %>% # Remove all numbers mutate(feedback = str_remove_all(feedback, "[[:punct:]]")) %>% # Remove all punctuation # Remove all instances from a separate list mutate(feedback = str_remove_all(feedback, paste0("\\b", common_names, "\\b", collapse = "|"))) %>% # Remove any additional terms manually mutate(feedback = str_remove_all(feedback, "mcknight|cook|cunningham|hahn|vargas")) %>% mutate(feedback = lemmatize_strings(feedback)) %>% # Lemmatize terms mutate(feedback = str_squish(feedback)) %>% # Remove whitespace mutate(feedback = na_if(feedback, "")) %>% # Replace blanks with NA * drop_na() # Drop all columns with NA ``` ] .panel2-sw1a-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 require additional scopee willingness to self start outperform task motivate… ## 2 basically have high potential his work experience but recent day he be do pe… ## 3 be a fairly average worker put out satisfactory work and show potential i wo… ## 4 i be write a review for while he be a person unfortunately that doesnt show … ## 5 perform at a high level this past year be one of much reliable and solid per… ## 6 be a follower rather a leader while do complete task a timely manner do with… ## 7 be a very capable worker and always produce excellent work always finish wor… ## 8 be a consistently reliable as employee his work product be always above and … ## 9 be a linchpin of team and a dependable worker trust with task of moderate co… ## 10 while i have find some issue with performance past specifically a lack of at… ## # … with 78 more rows ``` ] <style> .panel1-sw1a-auto { color: white; width: 98%; hight: 32%; float: top; padding-left: 1%; font-size: 80% } .panel2-sw1a-auto { color: white; width: 0%; hight: 32%; float: top; padding-left: 1%; font-size: 80% } .panel3-sw1a-auto { color: white; width: NA%; hight: 33%; float: top; padding-left: 1%; font-size: 80% } </style> .footnote[ <div style="text-align: center; color: #212121; border:1px; border-style:solid; border-color:#c7f9f6; background-color: #c7f9f6; border-radius: 15px; padding: 0.65em; width:fit-content;"> Please note that <i>removing all instances from a separate list</i> may take up to a minute to complete </div> ] --- ### Assigning a Variable Let's save the entire cleaning process ```r responses_cleaned <- employee_responses %>% select(feedback) %>% mutate(feedback = textclean::replace_non_ascii(feedback)) %>% mutate(feedback = str_to_lower(feedback)) %>% mutate(feedback = str_remove_all(feedback, "'s")) %>% mutate(feedback = str_remove_all(feedback, "[[:digit:]]")) %>% mutate(feedback = str_remove_all(feedback, "[[:punct:]]")) %>% mutate(feedback = str_remove_all(feedback, paste0("\\b", common_names, "\\b", collapse = "|"))) %>% mutate(feedback = str_remove_all(feedback, "mcknight|cook|cunningham|hahn|vargas")) %>% mutate(feedback = lemmatize_strings(feedback)) %>% mutate(feedback = str_squish(feedback)) %>% mutate(feedback = na_if(feedback, "")) %>% drop_na() ``` --- ### What Just Happened? -- Let's try doing something similar but with shorter and simpler text taken from the very funny skit [Sharknado Pitch Meeting](https://youtu.be/CYootnc0uew) ```r example_text <- c("Excerpt from Sharknado Pitch Meeting. Creator: Ryan George. (1) It’s peer reviewed. (2) Multiple scientists looked over that and approved of it? (3) No some drunk guy on the pier checked it out. He loved it! (4) That is technically peer reviewed. I think we’re good. --The End-- ") ``` --- 1. Take a look at the raw text data .remarkwidth[ ```r example_text ``` ``` ## [1] "Excerpt from Sharknado Pitch Meeting. \n Creator: Ryan George. \n \n (1) It’s peer reviewed. \n (2) Multiple scientists looked over that and approved of it? \n (3) No some drunk guy on the pier checked it out. He loved it!\n (4) That is technically peer reviewed. I think we’re good.\n \n --The End--\n " ``` ] -- 2. Then we wrangle using a very similar process --- count: false .panel1-sw1b-auto[ ```r *example_text ``` ] .panel2-sw1b-auto[ ``` ## [1] "Excerpt from Sharknado Pitch Meeting. \n Creator: Ryan George. \n \n (1) It’s peer reviewed. \n (2) Multiple scientists looked over that and approved of it? \n (3) No some drunk guy on the pier checked it out. He loved it!\n (4) That is technically peer reviewed. I think we’re good.\n \n --The End--\n " ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% * read_lines() # Parse text into individual lines ``` ] .panel2-sw1b-auto[ ``` ## [1] "Excerpt from Sharknado Pitch Meeting. " ## [2] " Creator: Ryan George. " ## [3] " " ## [4] " (1) It’s peer reviewed. " ## [5] " (2) Multiple scientists looked over that and approved of it? " ## [6] " (3) No some drunk guy on the pier checked it out. He loved it!" ## [7] " (4) That is technically peer reviewed. I think we’re good." ## [8] " " ## [9] " --The End--" ## [10] " " ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines * as_tibble_col("text") # Create a single tidy column ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 10 × 1 ## text ## <chr> ## 1 "Excerpt from Sharknado Pitch Meeting. " ## 2 " Creator: Ryan George. " ## 3 " " ## 4 " (1) It’s peer reviewed. " ## 5 " (2) Multiple scientists looked over that and approved of it? " ## 6 " (3) No some drunk guy on the pier checked it out. He loved it!" ## 7 " (4) That is technically peer reviewed. I think we’re good." ## 8 " " ## 9 " --The End--" ## 10 " " ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column * slice(4:n()) # Remove unnecessary text ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 " (1) It’s peer reviewed. " ## 2 " (2) Multiple scientists looked over that and approved of it? " ## 3 " (3) No some drunk guy on the pier checked it out. He loved it!" ## 4 " (4) That is technically peer reviewed. I think we’re good." ## 5 " " ## 6 " --The End--" ## 7 " " ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text * mutate(text = textclean::replace_non_ascii(text)) # Convert to a standard format ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 "(1) It's peer reviewed." ## 2 "(2) Multiple scientists looked over that and approved of it?" ## 3 "(3) No some drunk guy on the pier checked it out. He loved it!" ## 4 "(4) That is technically peer reviewed. I think we're good." ## 5 "" ## 6 "--The End--" ## 7 "" ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format * mutate(text = str_to_lower(text)) # Convert all words to lower case ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 "(1) it's peer reviewed." ## 2 "(2) multiple scientists looked over that and approved of it?" ## 3 "(3) no some drunk guy on the pier checked it out. he loved it!" ## 4 "(4) that is technically peer reviewed. i think we're good." ## 5 "" ## 6 "--the end--" ## 7 "" ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format mutate(text = str_to_lower(text)) %>% # Convert all words to lower case * mutate(text = str_remove_all(text, "'s")) # Remove all cases of `s ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 "(1) it peer reviewed." ## 2 "(2) multiple scientists looked over that and approved of it?" ## 3 "(3) no some drunk guy on the pier checked it out. he loved it!" ## 4 "(4) that is technically peer reviewed. i think we're good." ## 5 "" ## 6 "--the end--" ## 7 "" ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format mutate(text = str_to_lower(text)) %>% # Convert all words to lower case mutate(text = str_remove_all(text, "'s")) %>% # Remove all cases of `s * mutate(text = str_remove_all(text, "[[:digit:]]")) # Remove all numbers ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 "() it peer reviewed." ## 2 "() multiple scientists looked over that and approved of it?" ## 3 "() no some drunk guy on the pier checked it out. he loved it!" ## 4 "() that is technically peer reviewed. i think we're good." ## 5 "" ## 6 "--the end--" ## 7 "" ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format mutate(text = str_to_lower(text)) %>% # Convert all words to lower case mutate(text = str_remove_all(text, "'s")) %>% # Remove all cases of `s mutate(text = str_remove_all(text, "[[:digit:]]")) %>% # Remove all numbers * mutate(text = str_remove_all(text, "[[:punct:]]")) # Remove all punctuation ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 " it peer reviewed" ## 2 " multiple scientists looked over that and approved of it" ## 3 " no some drunk guy on the pier checked it out he loved it" ## 4 " that is technically peer reviewed i think were good" ## 5 "" ## 6 "the end" ## 7 "" ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format mutate(text = str_to_lower(text)) %>% # Convert all words to lower case mutate(text = str_remove_all(text, "'s")) %>% # Remove all cases of `s mutate(text = str_remove_all(text, "[[:digit:]]")) %>% # Remove all numbers mutate(text = str_remove_all(text, "[[:punct:]]")) %>% # Remove all punctuation * mutate(text = str_remove_all(text, "the end")) # Remove term ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 " it peer reviewed" ## 2 " multiple scientists looked over that and approved of it" ## 3 " no some drunk guy on the pier checked it out he loved it" ## 4 " that is technically peer reviewed i think were good" ## 5 "" ## 6 "" ## 7 "" ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format mutate(text = str_to_lower(text)) %>% # Convert all words to lower case mutate(text = str_remove_all(text, "'s")) %>% # Remove all cases of `s mutate(text = str_remove_all(text, "[[:digit:]]")) %>% # Remove all numbers mutate(text = str_remove_all(text, "[[:punct:]]")) %>% # Remove all punctuation mutate(text = str_remove_all(text, "the end")) %>% # Remove term * mutate(text = str_replace_all(text, "multiple scientists", "scientists")) # Replace term ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 " it peer reviewed" ## 2 " scientists looked over that and approved of it" ## 3 " no some drunk guy on the pier checked it out he loved it" ## 4 " that is technically peer reviewed i think were good" ## 5 "" ## 6 "" ## 7 "" ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format mutate(text = str_to_lower(text)) %>% # Convert all words to lower case mutate(text = str_remove_all(text, "'s")) %>% # Remove all cases of `s mutate(text = str_remove_all(text, "[[:digit:]]")) %>% # Remove all numbers mutate(text = str_remove_all(text, "[[:punct:]]")) %>% # Remove all punctuation mutate(text = str_remove_all(text, "the end")) %>% # Remove term mutate(text = str_replace_all(text, "multiple scientists", "scientists")) %>% # Replace term * mutate(text = str_replace_all(text, "it", "paper")) # Replace term ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 " paper peer reviewed" ## 2 " scientists looked over that and approved of paper" ## 3 " no some drunk guy on the pier checked paper out he loved paper" ## 4 " that is technically peer reviewed i think were good" ## 5 "" ## 6 "" ## 7 "" ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format mutate(text = str_to_lower(text)) %>% # Convert all words to lower case mutate(text = str_remove_all(text, "'s")) %>% # Remove all cases of `s mutate(text = str_remove_all(text, "[[:digit:]]")) %>% # Remove all numbers mutate(text = str_remove_all(text, "[[:punct:]]")) %>% # Remove all punctuation mutate(text = str_remove_all(text, "the end")) %>% # Remove term mutate(text = str_replace_all(text, "multiple scientists", "scientists")) %>% # Replace term mutate(text = str_replace_all(text, "it", "paper")) %>% # Replace term * mutate(text = str_replace_all(text, "that", "paper")) # Replace term ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 " paper peer reviewed" ## 2 " scientists looked over paper and approved of paper" ## 3 " no some drunk guy on the pier checked paper out he loved paper" ## 4 " paper is technically peer reviewed i think were good" ## 5 "" ## 6 "" ## 7 "" ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format mutate(text = str_to_lower(text)) %>% # Convert all words to lower case mutate(text = str_remove_all(text, "'s")) %>% # Remove all cases of `s mutate(text = str_remove_all(text, "[[:digit:]]")) %>% # Remove all numbers mutate(text = str_remove_all(text, "[[:punct:]]")) %>% # Remove all punctuation mutate(text = str_remove_all(text, "the end")) %>% # Remove term mutate(text = str_replace_all(text, "multiple scientists", "scientists")) %>% # Replace term mutate(text = str_replace_all(text, "it", "paper")) %>% # Replace term mutate(text = str_replace_all(text, "that", "paper")) %>% # Replace term * mutate(text = lemmatize_strings(text)) # Lemmatize term ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 "paper peer review" ## 2 "scientist look over paper and approve of paper" ## 3 "no some drink guy on the pier check paper out he love paper" ## 4 "paper be technically peer review i think be good" ## 5 "" ## 6 "" ## 7 "" ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format mutate(text = str_to_lower(text)) %>% # Convert all words to lower case mutate(text = str_remove_all(text, "'s")) %>% # Remove all cases of `s mutate(text = str_remove_all(text, "[[:digit:]]")) %>% # Remove all numbers mutate(text = str_remove_all(text, "[[:punct:]]")) %>% # Remove all punctuation mutate(text = str_remove_all(text, "the end")) %>% # Remove term mutate(text = str_replace_all(text, "multiple scientists", "scientists")) %>% # Replace term mutate(text = str_replace_all(text, "it", "paper")) %>% # Replace term mutate(text = str_replace_all(text, "that", "paper")) %>% # Replace term mutate(text = lemmatize_strings(text)) %>% # Lemmatize term * mutate(text = str_remove_all(text, c("paper"))) # Remove term ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 " peer review" ## 2 "scientist look over and approve of " ## 3 "no some drink guy on the pier check out he love " ## 4 " be technically peer review i think be good" ## 5 "" ## 6 "" ## 7 "" ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format mutate(text = str_to_lower(text)) %>% # Convert all words to lower case mutate(text = str_remove_all(text, "'s")) %>% # Remove all cases of `s mutate(text = str_remove_all(text, "[[:digit:]]")) %>% # Remove all numbers mutate(text = str_remove_all(text, "[[:punct:]]")) %>% # Remove all punctuation mutate(text = str_remove_all(text, "the end")) %>% # Remove term mutate(text = str_replace_all(text, "multiple scientists", "scientists")) %>% # Replace term mutate(text = str_replace_all(text, "it", "paper")) %>% # Replace term mutate(text = str_replace_all(text, "that", "paper")) %>% # Replace term mutate(text = lemmatize_strings(text)) %>% # Lemmatize term mutate(text = str_remove_all(text, c("paper"))) %>% # Remove term * mutate(text = str_squish(text)) # Remove whitespace ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 "peer review" ## 2 "scientist look over and approve of" ## 3 "no some drink guy on the pier check out he love" ## 4 "be technically peer review i think be good" ## 5 "" ## 6 "" ## 7 "" ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format mutate(text = str_to_lower(text)) %>% # Convert all words to lower case mutate(text = str_remove_all(text, "'s")) %>% # Remove all cases of `s mutate(text = str_remove_all(text, "[[:digit:]]")) %>% # Remove all numbers mutate(text = str_remove_all(text, "[[:punct:]]")) %>% # Remove all punctuation mutate(text = str_remove_all(text, "the end")) %>% # Remove term mutate(text = str_replace_all(text, "multiple scientists", "scientists")) %>% # Replace term mutate(text = str_replace_all(text, "it", "paper")) %>% # Replace term mutate(text = str_replace_all(text, "that", "paper")) %>% # Replace term mutate(text = lemmatize_strings(text)) %>% # Lemmatize term mutate(text = str_remove_all(text, c("paper"))) %>% # Remove term mutate(text = str_squish(text)) %>% # Remove whitespace * mutate(text = na_if(text, "")) # Replace blanks with NA ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 7 × 1 ## text ## <chr> ## 1 peer review ## 2 scientist look over and approve of ## 3 no some drink guy on the pier check out he love ## 4 be technically peer review i think be good ## 5 <NA> ## 6 <NA> ## 7 <NA> ``` ] --- count: false .panel1-sw1b-auto[ ```r example_text %>% read_lines() %>% # Parse text into individual lines as_tibble_col("text") %>% # Create a single tidy column slice(4:n()) %>% # Remove unnecessary text mutate(text = textclean::replace_non_ascii(text)) %>% # Convert to a standard format mutate(text = str_to_lower(text)) %>% # Convert all words to lower case mutate(text = str_remove_all(text, "'s")) %>% # Remove all cases of `s mutate(text = str_remove_all(text, "[[:digit:]]")) %>% # Remove all numbers mutate(text = str_remove_all(text, "[[:punct:]]")) %>% # Remove all punctuation mutate(text = str_remove_all(text, "the end")) %>% # Remove term mutate(text = str_replace_all(text, "multiple scientists", "scientists")) %>% # Replace term mutate(text = str_replace_all(text, "it", "paper")) %>% # Replace term mutate(text = str_replace_all(text, "that", "paper")) %>% # Replace term mutate(text = lemmatize_strings(text)) %>% # Lemmatize term mutate(text = str_remove_all(text, c("paper"))) %>% # Remove term mutate(text = str_squish(text)) %>% # Remove whitespace mutate(text = na_if(text, "")) %>% # Replace blanks with NA * drop_na() # Drop all columns with NA ``` ] .panel2-sw1b-auto[ ``` ## # A tibble: 4 × 1 ## text ## <chr> ## 1 peer review ## 2 scientist look over and approve of ## 3 no some drink guy on the pier check out he love ## 4 be technically peer review i think be good ``` ] <style> .panel1-sw1b-auto { color: white; width: 98%; hight: 32%; float: top; padding-left: 1%; font-size: 80% } .panel2-sw1b-auto { color: white; width: 0%; hight: 32%; float: top; padding-left: 1%; font-size: 80% } .panel3-sw1b-auto { color: white; width: NA%; hight: 33%; float: top; padding-left: 1%; font-size: 80% } </style> --- <span style = "font-size:1.75rem"><b>Normalization</b> of Remaining Wording</span> -- > is used to reduce word randomness which allows some level of standardization to help to reduce the amount of different information that a computer has to process therefore improving efficiency -- > the overall goal is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form -- > two popular normalization techniques are <span style="color:#f5ebd9; font-weight:bold; font-style:italic;">lemmatization</span> and <span style="color:#d9e3f5; font-weight:bold; font-style:italic;">stemming</span> --- ## <span style="color:#f5ebd9; font-weight:bold; font-style:italic;">Lemmatization</span> vs. <span style="color:#d9e3f5; font-weight:bold; font-style:italic;">Stemming</span> -- <br> <br> .pull-left[ <p id="center" style="color:#f5ebd9; border:1px; border-style:solid; border-color:#f5ebd9; border-radius: 25px; padding: 0.3em; margin-top: -6px"> <span style = "font-weight:bold; font-style:italic;">Lemmatization</span><br><br> the process of reducing words to their base word<br><br> (takes more time) </p> ] -- .pull-right[ <p id="center" style="color:#d9e3f5; border:1px; border-style:solid; border-color:#d9e3f5; border-radius: 25px; padding: 0.3em; margin-top: -6px"> <span style = "font-weight:bold; font-style:italic;">Stemming</span><br><br> the process of reducing words to their word stem or root form by removing word endings or other affixes<br><br> (takes less time) </p> ] -- <br> .pull-left[ <p id="center" style="color:#f5ebd9; border:1px; border-style:solid; border-color:#f5ebd9; border-radius: 25px; padding: 0.3em; margin-top: -6px"> <i>Example</i><br><br> the term <i>better</i> has the lemma <i>good</i> </p> ] -- .pull-right[ <p id="center" style="color:#d9e3f5; border:1px; border-style:solid; border-color:#d9e3f5; border-radius: 25px; padding: 0.3em; margin-top: -6px"> <i>Example</i><br><br> the term <i>flooding</i> has the stem <i>flood</i> </p> ] -- .footnote[For a great rundown of this topic, avoid the syntax and read over [Text Normalization for Natural Language Processing (NLP)](https://towardsdatascience.com/text-normalization-for-natural-language-processing-nlp-70a314bfa646)] --- <br> <br> <br> <br> <br> <br> <br> <br> <center> <img src="img/lemmastem.png" alt="Lemmatization v Stemming Table Example" width='700'></a> </center> --- <span style = "font-size:1.75rem"><b>Tokenizing</b> Handled Data</span> -- .center2[<i>A process of distinguishing and classifying sections of a string of input characters</i>] -- <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> What you should take from this is that <b>unnesting</b> data successfully is a requirement to be able to **tokenize**. While the next set of commands should look familiar, please consider taking a bit of time to really see what occurs in each step --- <span style = "font-size:1.75rem"><b>Filtering stopwords</b></span> --- count: false .panel1-sw2-auto[ ```r *responses_cleaned ``` ] .panel2-sw2-auto[ ``` ## # A tibble: 88 × 1 ## feedback ## <chr> ## 1 require additional scopee willingness to self start outperform task motivate… ## 2 basically have high potential his work experience but recent day he be do pe… ## 3 be a fairly average worker put out satisfactory work and show potential i wo… ## 4 i be write a review for while he be a person unfortunately that doesnt show … ## 5 perform at a high level this past year be one of much reliable and solid per… ## 6 be a follower rather a leader while do complete task a timely manner do with… ## 7 be a very capable worker and always produce excellent work always finish wor… ## 8 be a consistently reliable as employee his work product be always above and … ## 9 be a linchpin of team and a dependable worker trust with task of moderate co… ## 10 while i have find some issue with performance past specifically a lack of at… ## # … with 78 more rows ``` ] --- count: false .panel1-sw2-auto[ ```r responses_cleaned %>% * unnest_tokens(word, feedback) ``` ] .panel2-sw2-auto[ ``` ## # A tibble: 4,107 × 1 ## word ## <chr> ## 1 require ## 2 additional ## 3 scopee ## 4 willingness ## 5 to ## 6 self ## 7 start ## 8 outperform ## 9 task ## 10 motivate ## # … with 4,097 more rows ``` ] --- count: false .panel1-sw2-auto[ ```r responses_cleaned %>% unnest_tokens(word, feedback) %>% * anti_join(stop_words) ``` ] .panel2-sw2-auto[ ``` ## Joining, by = "word" ``` ``` ## # A tibble: 1,401 × 1 ## word ## <chr> ## 1 require ## 2 additional ## 3 scopee ## 4 willingness ## 5 start ## 6 outperform ## 7 task ## 8 motivate ## 9 exceed ## 10 basically ## # … with 1,391 more rows ``` ] --- count: false .panel1-sw2-auto[ ```r responses_cleaned %>% unnest_tokens(word, feedback) %>% anti_join(stop_words) %>% * count(word, sort = TRUE) ``` ] .panel2-sw2-auto[ ``` ## Joining, by = "word" ``` ``` ## # A tibble: 579 × 2 ## word n ## <chr> <int> ## 1 time 40 ## 2 improve 38 ## 3 team 31 ## 4 performance 27 ## 5 potential 26 ## 6 company 23 ## 7 task 23 ## 8 skill 18 ## 9 worker 17 ## 10 complete 15 ## # … with 569 more rows ``` ] --- count: false .panel1-sw2-auto[ ```r responses_cleaned %>% unnest_tokens(word, feedback) %>% anti_join(stop_words) %>% count(word, sort = TRUE) %>% * add_column(document = 1) ``` ] .panel2-sw2-auto[ ``` ## Joining, by = "word" ``` ``` ## # A tibble: 579 × 3 ## word n document ## <chr> <int> <dbl> ## 1 time 40 1 ## 2 improve 38 1 ## 3 team 31 1 ## 4 performance 27 1 ## 5 potential 26 1 ## 6 company 23 1 ## 7 task 23 1 ## 8 skill 18 1 ## 9 worker 17 1 ## 10 complete 15 1 ## # … with 569 more rows ``` ] <style> .panel1-sw2-auto { color: white; width: 49%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw2-auto { color: white; width: 49%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw2-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ### Assign a Variable Let's save the tokenized data frame ```r responses_tokens <- responses_cleaned %>% unnest_tokens(word, feedback) %>% anti_join(stop_words) %>% count(word, sort = TRUE) %>% add_column(document = 1) ``` ``` ## Joining, by = "word" ``` --- .center2[<b><span style = "font-size:2.75rem">Step 3: Statistical Classification and Modeling</span></b>] --- <span style = "font-size:1.75rem">Creating a <b>Term Document Matrix</b></span> --- count: false .panel1-sw3-auto[ ```r *responses_tokens ``` ] .panel2-sw3-auto[ ``` ## # A tibble: 579 × 3 ## word n document ## <chr> <int> <dbl> ## 1 time 40 1 ## 2 improve 38 1 ## 3 team 31 1 ## 4 performance 27 1 ## 5 potential 26 1 ## 6 company 23 1 ## 7 task 23 1 ## 8 skill 18 1 ## 9 worker 17 1 ## 10 complete 15 1 ## # … with 569 more rows ``` ] --- count: false .panel1-sw3-auto[ ```r responses_tokens %>% * cast_dtm(document, word, n) ``` ] .panel2-sw3-auto[ ``` ## <<DocumentTermMatrix (documents: 1, terms: 579)>> ## Non-/sparse entries: 579/0 ## Sparsity : 0% ## Maximal term length: 16 ## Weighting : term frequency (tf) ``` ] <style> .panel1-sw3-auto { color: white; width: 49%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw3-auto { color: white; width: 49%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw3-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ### Assigning a Variable ```r responses_dtm <- responses_tokens %>% cast_dtm(document, word, n) ``` --- <span style = "font-size:1.75rem">Calculating <b>Coherence Scores</b></span> -- .center2[<i>A measure of the degree of semantic similarity between high scoring words in the topic which These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference</i>] --- count: false .panel1-sw4-auto[ ```r *FindTopicsNumber( * responses_dtm, * topics = seq(from = 2, to = 20, by = 1), * metrics = c("Griffiths2004", * "CaoJuan2009", * "Arun2010", * "Deveaud2014"), * method = "Gibbs", * control = list(seed = 77), * mc.cores = 2L, * verbose = TRUE * ) ``` ] .panel2-sw4-auto[ ``` ## fit models... done. ## calculate metrics: ## Griffiths2004... done. ## CaoJuan2009... done. ## Arun2010... done. ## Deveaud2014... done. ``` ``` ## topics Griffiths2004 CaoJuan2009 Arun2010 Deveaud2014 ## 1 20 -7911.127 0.2062964 1.51890626 0.6168028 ## 2 19 -7904.556 0.1967236 1.37525414 0.6470573 ## 3 18 -7901.925 0.1997969 1.32153681 0.6535171 ## 4 17 -7926.187 0.1930143 1.17535430 0.6768608 ## 5 16 -7896.660 0.1816866 1.01856724 0.7397206 ## 6 15 -7909.777 0.1830048 0.92347327 0.7523600 ## 7 14 -7882.658 0.1677419 0.76471092 0.8092577 ## 8 13 -7879.697 0.1625309 0.59618494 0.8577134 ## 9 12 -7910.403 0.1656523 0.55674661 0.8742824 ## 10 11 -7940.711 0.1386042 0.41817512 0.9698708 ## 11 10 -7957.690 0.1624183 0.43879594 0.9227275 ## 12 9 -7950.341 0.1318182 0.29442123 1.0585181 ## 13 8 -7949.358 0.1298083 0.08828607 1.0980934 ## 14 7 -7964.580 0.1212509 0.12133408 1.1361188 ## 15 6 -8001.784 0.1161049 0.18287414 1.1955308 ## 16 5 -8062.347 0.1152498 0.26711663 1.2464718 ## 17 4 -8167.468 0.1084004 0.38522266 1.3356483 ## 18 3 -8330.388 0.1203959 0.75165432 1.3910375 ## 19 2 -8631.373 0.1558351 1.33879013 1.3753280 ``` ] --- count: false .panel1-sw4-auto[ ```r FindTopicsNumber( responses_dtm, topics = seq(from = 2, to = 20, by = 1), metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"), method = "Gibbs", control = list(seed = 77), mc.cores = 2L, verbose = TRUE ) %>% * FindTopicsNumber_plot() ``` ] .panel2-sw4-auto[ ``` ## fit models... done. ## calculate metrics: ## Griffiths2004... done. ## CaoJuan2009... done. ## Arun2010... done. ## Deveaud2014... done. ``` <img src="topic-modeling-pres_files/figure-html/sw4_auto_02_output-1.png" width="80%" /> ] <style> .panel1-sw4-auto { color: white; width: 44.1%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw4-auto { color: white; width: 53.9%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw4-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ### Assigning a Variable .left-code[ ```r responses_topic_est <- FindTopicsNumber( responses_dtm, topics = seq(from = 2, to = 20, by = 1), # amend these metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"), method = "Gibbs", control = list(seed = 77), mc.cores = 2L, verbose = TRUE ) %>% FindTopicsNumber_plot() ``` ] -- .right-plot[ <img src="topic-modeling-pres_files/figure-html/topic-est-out-1.png" width="90%" /> ] --- .pull-left[ The estimate for a total number of topics can be a lowest single value or range of values. We do this by observing where the metric curves tend to plateau and get as close to each other as possible along the horizontal axis. This is known as a limit<br><br> From the plot, the metric symbolized by + is diverging away from the rest. While they may head back towards the horizontal axis in the future, the metrics symbolized by ◻, ○, and △ look to be the closest between 5 and 6<br><br> We want to use the lowest possible count so let's start by modeling 5 topics! ] .pull-right[ <img src="topic-modeling-pres_files/figure-html/unnamed-chunk-13-1.png" width="85%" /> ] --- <span style = "font-size:1.75rem">Applying a <b>Generative Model</b></span></b> --- count: false .panel1-sw5-auto[ ```r *LDA(responses_dtm, * k = 5, # Number of topics * control = list(seed = 1234)) ``` ] .panel2-sw5-auto[ ``` ## A LDA_VEM topic model with 5 topics. ``` ] --- count: false .panel1-sw5-auto[ ```r LDA(responses_dtm, k = 5, # Number of topics control = list(seed = 1234)) %>% * tidy(matrix = "beta") ``` ] .panel2-sw5-auto[ ``` ## # A tibble: 2,895 × 3 ## topic term beta ## <int> <chr> <dbl> ## 1 1 time 0.0189 ## 2 2 time 0.0241 ## 3 3 time 0.0401 ## 4 4 time 0.0237 ## 5 5 time 0.0361 ## 6 1 improve 0.0292 ## 7 2 improve 0.00549 ## 8 3 improve 0.0190 ## 9 4 improve 0.0323 ## 10 5 improve 0.0495 ## # … with 2,885 more rows ``` ] <style> .panel1-sw5-auto { color: white; width: 49%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw5-auto { color: white; width: 49%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw5-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ### Assigning a Variable Let's save the topics list ```r responses_topics <- LDA(responses_dtm, k = 5, # Amend this to test a certain number of topics control = list(seed = 1234)) %>% tidy(matrix = "beta") ``` --- .center2[<b><span style = "font-size:2.75rem">Step 4: Visualization and Interpretation</span></b>] --- ### Plot the Topics We'll use the top 10 most impactful terms in each area to fill out each potential topic <br> <br> <br> --- count: false .panel1-sw6-auto[ ```r *responses_topics ``` ] .panel2-sw6-auto[ ``` ## # A tibble: 2,895 × 3 ## topic term beta ## <int> <chr> <dbl> ## 1 1 time 0.0189 ## 2 2 time 0.0241 ## 3 3 time 0.0401 ## 4 4 time 0.0237 ## 5 5 time 0.0361 ## 6 1 improve 0.0292 ## 7 2 improve 0.00549 ## 8 3 improve 0.0190 ## 9 4 improve 0.0323 ## 10 5 improve 0.0495 ## # … with 2,885 more rows ``` ] --- count: false .panel1-sw6-auto[ ```r responses_topics %>% *group_by(topic) ``` ] .panel2-sw6-auto[ ``` ## # A tibble: 2,895 × 3 ## # Groups: topic [5] ## topic term beta ## <int> <chr> <dbl> ## 1 1 time 0.0189 ## 2 2 time 0.0241 ## 3 3 time 0.0401 ## 4 4 time 0.0237 ## 5 5 time 0.0361 ## 6 1 improve 0.0292 ## 7 2 improve 0.00549 ## 8 3 improve 0.0190 ## 9 4 improve 0.0323 ## 10 5 improve 0.0495 ## # … with 2,885 more rows ``` ] --- count: false .panel1-sw6-auto[ ```r responses_topics %>% group_by(topic) %>% *slice_max(beta, n = 10) ``` ] .panel2-sw6-auto[ ``` ## # A tibble: 50 × 3 ## # Groups: topic [5] ## topic term beta ## <int> <chr> <dbl> ## 1 1 performance 0.0314 ## 2 1 improve 0.0292 ## 3 1 team 0.0254 ## 4 1 worker 0.0247 ## 5 1 time 0.0189 ## 6 1 potential 0.0187 ## 7 1 skill 0.0152 ## 8 1 company 0.0148 ## 9 1 quality 0.0147 ## 10 1 task 0.0138 ## # … with 40 more rows ``` ] --- count: false .panel1-sw6-auto[ ```r responses_topics %>% group_by(topic) %>% slice_max(beta, n = 10) %>% *ungroup() ``` ] .panel2-sw6-auto[ ``` ## # A tibble: 50 × 3 ## topic term beta ## <int> <chr> <dbl> ## 1 1 performance 0.0314 ## 2 1 improve 0.0292 ## 3 1 team 0.0254 ## 4 1 worker 0.0247 ## 5 1 time 0.0189 ## 6 1 potential 0.0187 ## 7 1 skill 0.0152 ## 8 1 company 0.0148 ## 9 1 quality 0.0147 ## 10 1 task 0.0138 ## # … with 40 more rows ``` ] --- count: false .panel1-sw6-auto[ ```r responses_topics %>% group_by(topic) %>% slice_max(beta, n = 10) %>% ungroup() %>% *arrange(topic, -beta) ``` ] .panel2-sw6-auto[ ``` ## # A tibble: 50 × 3 ## topic term beta ## <int> <chr> <dbl> ## 1 1 performance 0.0314 ## 2 1 improve 0.0292 ## 3 1 team 0.0254 ## 4 1 worker 0.0247 ## 5 1 time 0.0189 ## 6 1 potential 0.0187 ## 7 1 skill 0.0152 ## 8 1 company 0.0148 ## 9 1 quality 0.0147 ## 10 1 task 0.0138 ## # … with 40 more rows ``` ] --- count: false .panel1-sw6-auto[ ```r responses_topics %>% group_by(topic) %>% slice_max(beta, n = 10) %>% ungroup() %>% arrange(topic, -beta) %>% *mutate(term = reorder_within(term, beta, topic)) ``` ] .panel2-sw6-auto[ ``` ## # A tibble: 50 × 3 ## topic term beta ## <int> <fct> <dbl> ## 1 1 performance___1 0.0314 ## 2 1 improve___1 0.0292 ## 3 1 team___1 0.0254 ## 4 1 worker___1 0.0247 ## 5 1 time___1 0.0189 ## 6 1 potential___1 0.0187 ## 7 1 skill___1 0.0152 ## 8 1 company___1 0.0148 ## 9 1 quality___1 0.0147 ## 10 1 task___1 0.0138 ## # … with 40 more rows ``` ] --- count: false .panel1-sw6-auto[ ```r responses_topics %>% group_by(topic) %>% slice_max(beta, n = 10) %>% ungroup() %>% arrange(topic, -beta) %>% mutate(term = reorder_within(term, beta, topic)) %>% *ggplot(aes(beta, term, fill = factor(topic))) ``` ] .panel2-sw6-auto[ <img src="topic-modeling-pres_files/figure-html/sw6_auto_07_output-1.png" width="80%" /> ] --- count: false .panel1-sw6-auto[ ```r responses_topics %>% group_by(topic) %>% slice_max(beta, n = 10) %>% ungroup() %>% arrange(topic, -beta) %>% mutate(term = reorder_within(term, beta, topic)) %>% ggplot(aes(beta, term, fill = factor(topic))) + *geom_col(show.legend = FALSE) ``` ] .panel2-sw6-auto[ <img src="topic-modeling-pres_files/figure-html/sw6_auto_08_output-1.png" width="80%" /> ] --- count: false .panel1-sw6-auto[ ```r responses_topics %>% group_by(topic) %>% slice_max(beta, n = 10) %>% ungroup() %>% arrange(topic, -beta) %>% mutate(term = reorder_within(term, beta, topic)) %>% ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) + *scale_fill_viridis_d() ``` ] .panel2-sw6-auto[ <img src="topic-modeling-pres_files/figure-html/sw6_auto_09_output-1.png" width="80%" /> ] --- count: false .panel1-sw6-auto[ ```r responses_topics %>% group_by(topic) %>% slice_max(beta, n = 10) %>% ungroup() %>% arrange(topic, -beta) %>% mutate(term = reorder_within(term, beta, topic)) %>% ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) + scale_fill_viridis_d() + *facet_wrap(~ topic, scales = "free") ``` ] .panel2-sw6-auto[ <img src="topic-modeling-pres_files/figure-html/sw6_auto_10_output-1.png" width="80%" /> ] --- count: false .panel1-sw6-auto[ ```r responses_topics %>% group_by(topic) %>% slice_max(beta, n = 10) %>% ungroup() %>% arrange(topic, -beta) %>% mutate(term = reorder_within(term, beta, topic)) %>% ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) + scale_fill_viridis_d() + facet_wrap(~ topic, scales = "free") + *scale_y_reordered() ``` ] .panel2-sw6-auto[ <img src="topic-modeling-pres_files/figure-html/sw6_auto_11_output-1.png" width="80%" /> ] --- count: false .panel1-sw6-auto[ ```r responses_topics %>% group_by(topic) %>% slice_max(beta, n = 10) %>% ungroup() %>% arrange(topic, -beta) %>% mutate(term = reorder_within(term, beta, topic)) %>% ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) + scale_fill_viridis_d() + facet_wrap(~ topic, scales = "free") + scale_y_reordered() + *theme_minimal() ``` ] .panel2-sw6-auto[ <img src="topic-modeling-pres_files/figure-html/sw6_auto_12_output-1.png" width="80%" /> ] <style> .panel1-sw6-auto { color: white; width: 53.9%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-sw6-auto { color: white; width: 44.1%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-sw6-auto { color: white; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- ### Assigning a Variable Let's save the plot ```r responses_top_terms <- responses_topics %>% group_by(topic) %>% slice_max(beta, n = 10) %>% ungroup() %>% arrange(topic, -beta) %>% mutate(term = reorder_within(term, beta, topic)) %>% ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) + scale_fill_viridis_d() + facet_wrap(~ topic, scales = "free") + scale_y_reordered() + theme_minimal() ``` --- <br> <br> <img src="topic-modeling-pres_files/figure-html/unnamed-chunk-16-1.png" width="864" style="display: block; margin: auto;" /> -- .footnote[Tip: you can save high (or really any) resolution visuals easily using [`ggsave`](https://sscc.wisc.edu/sscc/pubs/using-r-plots/saving-plots.html)] --- ### What Just Happened? -- > LDA is a form of (unsupervised) learning that views documents as bags-of-words (BoW) where order does not matter. Not having to track the placement of every term saves a lot of time and computational energy -- > LDA works by first making a key assumption: the way a document was generated was by picking a set of topics and then for each topic picking a set of words --- ### Steps to Finding Topics In a nutshell for each document `\(m\)` -- 1. Assume there are `\(k\)` topics across all of the documents -- 2. Create a distribution `\(\alpha\)` where the `\(k\)` topics are symmetric or asymmetrically spread across each document `\(m\)` by assigning each word a topic -- 3. For each word `\(w\)` in every document `\(m\)`, assume its topic is is associated incorrectly but every other word is assigned the correct topic -- 4. Probabilistically assign word `\(w\)` a topic based on two things: - what topics are in document `\(m\)` - Create a distribution `\(\beta\)` to assess how many times word `\(w\)` has been assigned a particular topic across all of the documents -- 5. Repeat this process a number of times for each document until saturation --- ## Interpret -- > Much like you would assess a factor or component, the topics are unlabeled and it is up to you to figure out what they could mean. Not every topic may be directly applicable, but should still be interpreted and reported. Discarding topics means that you are removing potentially relevant information --- Here is a brief assessment of some possible topics that are represented in the topic model with reference to *employee responses* <br> <br> <br> <br> <center> <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:center;color: #ffffff !important;vertical-align: middle !important;"> Topic </th> <th style="text-align:left;color: #ffffff !important;vertical-align: middle !important;"> Label </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;width: 20em; color: #ffffff !important;vertical-align: middle !important;"> 1 </td> <td style="text-align:left;width: 30em; color: #ffffff !important;vertical-align: middle !important;"> production and gains reliance on employee abilities </td> </tr> <tr> <td style="text-align:center;width: 20em; color: #ffffff !important;vertical-align: middle !important;"> 2 </td> <td style="text-align:left;width: 30em; color: #ffffff !important;vertical-align: middle !important;"> teams' abilities to tackle client needs affect on-time completion </td> </tr> <tr> <td style="text-align:center;width: 20em; color: #ffffff !important;vertical-align: middle !important;"> 3 </td> <td style="text-align:left;width: 30em; color: #ffffff !important;vertical-align: middle !important;"> achievement varies by timeframe and workers' talent </td> </tr> <tr> <td style="text-align:center;width: 20em; color: #ffffff !important;vertical-align: middle !important;"> 4 </td> <td style="text-align:left;width: 30em; color: #ffffff !important;vertical-align: middle !important;"> possible growth is helped or hindered by worker characteristics </td> </tr> <tr> <td style="text-align:center;width: 20em; color: #ffffff !important;vertical-align: middle !important;"> 5 </td> <td style="text-align:left;width: 30em; color: #ffffff !important;vertical-align: middle !important;"> increases tied to employees' capacity and capabilities </td> </tr> </tbody> </table> </center> -- .footnote[Your assessment would likely differ to varying degrees and that is the point - in that qualitative concepts such as triangulation and saturation still play a large and impactful role in the interpretation phase. Note with a much larger text data set, this task could be significantly easier] --- # That’s It! Any questions? -- <br> <br> <br> <br> <br> <br> <br> <br> <center> <br><br> <div class="fade_rule"></div> <br><br> </center> <center> <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br /><br />This work is licensed under a <br /><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a> </center>