`install.packages(c("dplyr", "gutenbergr", "stringr", "tidytext", "tidyr", "stopwords", "wordcloud", "rsample", "glmnet", "forcats", "broom", "igraph", "ggraph", "kableExtra", "yardstick")) `

# Introduction to Textmining in R

This post demonstrates how various R packages can be used for text mining in R. In particular, we start with common text transformations, perform various data explorations with term frequency (tf) and inverse document frequency (idf) and build a supervised classifiaction model that learns the difference between texts of different authors.

The content of this tutorial is based on the excellent book “Textmining with R (2019)” from Julia Silge and David Robinson and the blog post “Text classification with tidy data principles (2018)” from Julia Silges.

## Installation of R packages

If you like to install all packages at once, use the code below.

## Data import

We can access the full texts of various books from “Project Gutenberg” via the `gutenbergr`

package. We can look up certain authors or titles with a regular expression using the `stringr`

package. All functions in `stringr`

start with `str_`

and take a vector of strings as the first argument. To learn more about stringr, visit the stringr documentation.

```
library(gutenbergr)
library(stringr)
<- gutenberg_works(str_detect(author, "Doyle")) doyle
```

```
library(kableExtra)
::kable(head(doyle, 4)) |>
knitrkable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```

gutenberg_id | title | author | gutenberg_author_id | language | gutenberg_bookshelf | rights | has_text |
---|---|---|---|---|---|---|---|

108 | The Return of Sherlock Holmes | Doyle, Arthur Conan | 69 | en | Detective Fiction | Public domain in the USA. | TRUE |

126 | The Poison Belt | Doyle, Arthur Conan | 69 | en | Science Fiction | Public domain in the USA. | TRUE |

139 | The Lost World | Doyle, Arthur Conan | 69 | en | Science Fiction | Public domain in the USA. | TRUE |

244 | A Study in Scarlet | Doyle, Arthur Conan | 69 | en | Detective Fiction | Public domain in the USA. | TRUE |

We obtain *“Relativity: The Special and General Theory”* by Albert Einstein (gutenberg_id: 30155) and *“Experiments with Alternate Currents of High Potential and High Frequency”* by Nikola Tesla (gutenberg_id: 13476) from gutenberg and add the column “author” to the result.

`<- gutenberg_download(c(30155, 13476), meta_fields = "author") books `

Furthermore, we transfrom the data to a tibble (tibbles are a modern take on data frames), add the row number with the column name `document`

to the tibble and drop the column `gutenberg_id`

. We will use the information in column `document`

to train a model that can take an individual line (row) and give us a probability that the text in this particular line comes from a certain author.

```
library(dplyr)
<- as_tibble(books) |>
books mutate(document = row_number()) |>
select(-gutenberg_id)
```

```
kable(head(books, 8)) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```

text | author | document |
---|---|---|

EXPERIMENTS WITH ALTERNATE CURRENTS OF HIGH POTENTIAL AND HIGH FREQUENCY | Tesla, Nikola | 1 |

Tesla, Nikola | 2 | |

A Lecture Delivered before the Institution of Electrical Engineers, London | Tesla, Nikola | 3 |

Tesla, Nikola | 4 | |

by | Tesla, Nikola | 5 |

Tesla, Nikola | 6 | |

NIKOLA TESLA | Tesla, Nikola | 7 |

Tesla, Nikola | 8 |

# Data transformation

## Tokenization

First of all, we need to both break the text into individual tokens (a process called **tokenization**) and transform it to a tidy data structure (i.e. each variable must have its own column, each observation must have its own row and each value must have its own cell). To do this, we use tidytext’s `unnest_tokens()`

function. We also remove the *rarest words* in that step, keeping only words in our dataset that occur more than 10 times.

```
library(tidytext)
<- books |>
tidy_books unnest_tokens(word, text) |>
group_by(word) |>
filter(n() > 10) |>
ungroup()
```

```
kable(head(tidy_books, 8)) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```

author | document | word |
---|---|---|

Tesla, Nikola | 1 | experiments |

Tesla, Nikola | 1 | with |

Tesla, Nikola | 1 | alternate |

Tesla, Nikola | 1 | currents |

Tesla, Nikola | 1 | of |

Tesla, Nikola | 1 | high |

Tesla, Nikola | 1 | potential |

Tesla, Nikola | 1 | and |

## Stop words

Now that the data is in a tidy “one-word-per-row” format, we can manipulate it with packages like `dplyr`

. Often in text analysis, we will want to remove **stop words**: Stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth. We can remove stop words in our data by using the stop words provided in the package `stopwords`

with an `anti_join()`

from the package `dplyr`

.

```
library(stopwords)
library(tibble)
<- as_tibble(stopwords::stopwords("en"))
stopword <- rename(stopword, word=value)
stopword <- anti_join(tidy_books, stopword, by = 'word') tb
```

```
kable(head(tb, 8)) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```

author | document | word |
---|---|---|

Tesla, Nikola | 1 | experiments |

Tesla, Nikola | 1 | alternate |

Tesla, Nikola | 1 | currents |

Tesla, Nikola | 1 | high |

Tesla, Nikola | 1 | potential |

Tesla, Nikola | 1 | high |

Tesla, Nikola | 1 | frequency |

Tesla, Nikola | 3 | lecture |

The tidy data structure allows different types of exploratory data analysis (EDA), which we turn to next.

# Exploratory data analysis

## Term frequency (tf)

An important question in text mining is how to quantify what a document is about. One measure of how important a word may be is its **term frequency** (tf), i.e. how frequently a word occurs in a document.

We can start by using `dplyr`

to explore the most commonly used words.

`<- count(tb, word, sort = TRUE) word_count `

```
kable(head(word_count, 5)) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```

word | n |
---|---|

one | 239 |

body | 230 |

may | 224 |

can | 194 |

relativity | 193 |

Term frequency by author:

```
<- tb |>
author_count count(author, word, sort = TRUE)
```

```
kable(head(author_count, 10)) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```

author | word | n |
---|---|---|

Einstein, Albert | relativity | 193 |

Tesla, Nikola | may | 184 |

Einstein, Albert | theory | 181 |

Tesla, Nikola | bulb | 171 |

Tesla, Nikola | coil | 166 |

Tesla, Nikola | high | 166 |

Einstein, Albert | body | 156 |

Tesla, Nikola | one | 156 |

Einstein, Albert | reference | 150 |

Tesla, Nikola | tube | 147 |

Plot terms with a frequency greater than 100:

```
library(ggplot2)
|>
tb count(author, word, sort = TRUE) |>
filter(n > 100) |>
mutate(word = reorder(word, n)) |>
ggplot(aes(word, n)) +
geom_col(aes(fill=author)) +
xlab(NULL) +
scale_y_continuous(expand = c(0, 0)) +
coord_flip() +
theme_classic(base_size = 12) +
labs(fill= "Author", title="Word frequency", subtitle="n > 100")+
theme(plot.title = element_text(lineheight=.8, face="bold")) +
scale_fill_brewer()
```

Plot top 20 terms by author:

```
|>
tb count(author, word, sort = TRUE) |>
group_by(author) |>
top_n(20) |>
ungroup() |>
ggplot(aes(reorder_within(word, n, author), n,
fill = author)) +
geom_col(alpha = 0.8, show.legend = FALSE) +
scale_x_reordered() +
coord_flip() +
facet_wrap(~author, scales = "free") +
scale_y_continuous(expand = c(0, 0)) +
theme_classic(base_size = 12) +
labs(fill= "Author",
title="Most frequent words",
subtitle="Top 20 words by book",
x= NULL,
y= "Word Count")+
theme(plot.title = element_text(lineheight=.8, face="bold")) +
scale_fill_brewer()
```

You may notice expressions like “_k”, “co” in the Einstein text and “fig” in the Tesla text. Let’s remove these and other less meaningful words with a custom list of stop words and use anti_join() to remove them.

```
<- tibble(word = c("eq", "co", "rc", "ac", "ak", "bn", "fig", "file", "cg", "cb", "cm", "ab", "_k", "_k_", "_x"))
newstopwords
<- anti_join(tb, newstopwords, by = "word") tb
```

Now we plot the data again without the new stopwords:

```
|>
tb count(author, word, sort = TRUE) |>
group_by(author) |>
top_n(20) |>
ungroup() |>
ggplot(aes(reorder_within(word, n, author), n,
fill = author)) +
geom_col(alpha = 0.8, show.legend = FALSE) +
scale_x_reordered() +
coord_flip() +
facet_wrap(~author, scales = "free") +
scale_y_continuous(expand = c(0, 0)) +
theme_classic(base_size = 12) +
labs(fill= "Author",
title="Most frequent words after removing stop words",
subtitle="Top 20 words by book",
x= NULL,
y= "Word Count")+
theme(plot.title = element_text(lineheight=.8, face="bold")) +
scale_fill_brewer()
```

You also may want to visualize the most frequent terms as a simple word cloud:

```
library(wordcloud)
|>
tb count(word) |>
with(wordcloud(word, n, max.words = 15))
```

## Term frequency and inverse document frequency (tf-idf)

Term frequency is a useful measure to determine how frequently a word occurs in a document. There are words in a document, however, that occur many times but may not be important.

Another approach is to look at a term’s **inverse document frequency (idf)**, which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.

The inverse document frequency for any given term is defined as:

\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\]

Hence, term frequency and inverse document frequency allows us to find words that are characteristic for one document within a collection of documents. The `tidytext`

package uses an implementation of tf-idf consistent with tidy data principles that enables us to see how different words are important in documents within a collection or corpus of documents.

```
library(forcats)
<- tb |>
plot_tb count(author, word, sort = TRUE) |>
bind_tf_idf(word, author, n) |>
mutate(word = fct_reorder(word, tf_idf)) |>
mutate(author = factor(author,
levels = c("Tesla, Nikola",
"Einstein, Albert")))
|>
plot_tb group_by(author) |>
top_n(15, tf_idf) |>
ungroup() |>
mutate(word = reorder(word, tf_idf)) |>
ggplot(aes(word, tf_idf, fill = author)) +
scale_y_continuous(expand = c(0, 0)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~author, ncol = 2, scales = "free") +
coord_flip() +
theme_classic(base_size = 12) +
labs(fill= "Author",
title="Term frequency and inverse document frequency (tf-idf)",
subtitle="Top 20 words by book",
x= NULL,
y= "tf-idf") +
theme(plot.title = element_text(lineheight=.8, face="bold")) +
scale_fill_brewer()
```

In particular, the `bind_tf_idf`

function in the `tidytext`

package takes a tidy text dataset as input with one row per token (term), per document. One column (word here) contains the terms/tokens, one column contains the documents (authors in this case), and the last necessary column contains the counts, how many times each document contains each term (n in this example).

```
<- tb |>
tf_idf count(author, word, sort = TRUE) |>
bind_tf_idf(word, author, n)
```

```
kable(head(tf_idf, 10)) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```

author | word | n | tf | idf | tf_idf |
---|---|---|---|---|---|

Einstein, Albert | relativity | 193 | 0.0177618 | 0.6931472 | 0.0123116 |

Tesla, Nikola | may | 184 | 0.0139436 | 0.0000000 | 0.0000000 |

Einstein, Albert | theory | 181 | 0.0166575 | 0.6931472 | 0.0115461 |

Tesla, Nikola | bulb | 171 | 0.0129585 | 0.6931472 | 0.0089821 |

Tesla, Nikola | coil | 166 | 0.0125796 | 0.6931472 | 0.0087195 |

Tesla, Nikola | high | 166 | 0.0125796 | 0.0000000 | 0.0000000 |

Einstein, Albert | body | 156 | 0.0143567 | 0.0000000 | 0.0000000 |

Tesla, Nikola | one | 156 | 0.0118218 | 0.0000000 | 0.0000000 |

Einstein, Albert | reference | 150 | 0.0138045 | 0.0000000 | 0.0000000 |

Tesla, Nikola | tube | 147 | 0.0111397 | 0.0000000 | 0.0000000 |

Notice that *idf* and thus *tf-idf* are zero for extremely common words (like “may”). These are all words that appear in both documents, so the idf term (which will then be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection.

## Tokenizing by n-gram

We’ve been using the `unnest_tokens`

function to tokenize by word, or sometimes by sentence, which is useful for the kinds of frequency analyses we’ve been doing so far. But we can also use the function to tokenize into consecutive sequences of words, called **n-grams**. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.

```
<- books |>
einstein_bigrams filter(author == "Einstein, Albert") |>
unnest_tokens(bigram, text, token = "ngrams", n = 2)
```

```
kable(head(einstein_bigrams, 10)) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```

author | document | bigram |
---|---|---|

Einstein, Albert | 3797 | NA |

Einstein, Albert | 3798 | NA |

Einstein, Albert | 3799 | NA |

Einstein, Albert | 3800 | NA |

Einstein, Albert | 3801 | NA |

Einstein, Albert | 3802 | relativity the |

Einstein, Albert | 3802 | the special |

Einstein, Albert | 3802 | special and |

Einstein, Albert | 3802 | and general |

Einstein, Albert | 3802 | general theory |

We can examine the most common bigrams using dplyr’s `count()`

:

```
<- einstein_bigrams |>
einstein_bigrams_count count(bigram, sort = TRUE)
```

```
kable(head(einstein_bigrams_count, 10)) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```

bigram | n |
---|---|

NA | 921 |

of the | 613 |

to the | 247 |

in the | 197 |

of relativity | 164 |

theory of | 121 |

with the | 119 |

on the | 111 |

that the | 110 |

of a | 98 |

Now we use tidyr’s `separate()`

, which splits a column into multiple columns based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word. This time, we use the stopwords from the package `tidyr`

:

```
library(tidyr)
# seperate words
<- einstein_bigrams |>
bigrams_separated separate(bigram, c("word1", "word2"), sep = " ")
# filter stop words and NA
<- bigrams_separated |>
bigrams_filtered filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word) |>
filter(!is.na(word1))
# new bigram counts:
<- bigrams_filtered |>
bigram_counts count(word1, word2, sort = TRUE)
```

```
kable(head(bigram_counts, 10)) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```

word1 | word2 | n |
---|---|---|

reference | body | 56 |

gravitational | field | 53 |

special | theory | 35 |

ordinate | system | 34 |

space | time | 27 |

classical | mechanics | 26 |

lorentz | transformation | 23 |

measuring | rods | 22 |

straight | line | 17 |

rigid | body | 16 |

This one-bigram-per-row format is helpful for exploratory analyses of the text. As a simple example, we might be interested in the most often mentioned “theory”:

```
<- bigrams_filtered |>
bigram_theory filter(word2 == "theory") |>
count(word1, sort = TRUE)
```

```
kable(head(bigram_theory, 7)) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```

word1 | n |
---|---|

special | 35 |

lorentz | 4 |

newton’s | 4 |

_special | 1 |

comprehensive | 1 |

electrodynamic | 1 |

electromagnetic | 1 |

In other analyses you may be interested in the most common trigrams, which are consecutive sequences of 3 words. We can find this by setting n = 3:

```
<- books |>
trigram unnest_tokens(trigram, text, token = "ngrams", n = 3) |>
separate(trigram, c("word1", "word2", "word3"), sep = " ") |>
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word,
!is.na(word1)) |>
count(word1, word2, word3, sort = TRUE)
```

```
kable(head(trigram, 7)) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```

word1 | word2 | word3 | n |
---|---|---|---|

_x_1 | _x_2 | _x_3 | 12 |

light | _in | vacuo_ | 10 |

reference | body | _k_ | 10 |

space | time | continuum | 9 |

_x_2 | _x_3 | _x_4 | 8 |

reference | body | _k | 8 |

disruptive | discharge | coil | 6 |

## Network analysis

We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network, or “graph.” Here we’ll be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes. A graph can be constructed from a tidy object since it has three variables:

- from: the node an edge is coming from
- to: the node an edge is going towards
- weight: A numeric value associated with each edge

The `igraph`

package has many functions for manipulating and analyzing networks. One way to create an igraph object from tidy data is the `graph_from_data_frame()`

function, which takes a data frame of edges with columns for “from”, “to”, and edge attributes (in this case n):

```
library(igraph)
# filter for only relatively common combinations
<- bigram_counts |>
bigram_graph filter(n > 5) |>
graph_from_data_frame()
```

We use the `ggraph`

package to convert the igraph object into a `ggraph`

with the ggraph function, after which we add layers to it, much like layers are added in ggplot2. For example, for a basic graph we need to add three layers: nodes, edges, and text:

```
library(ggraph)
set.seed(123)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
```

Finally, we will change some settings to obtain to a better looking graph:

We add the

`edge_alpha`

aesthetic to the link layer to make links transparent based on how common or rare the bigram is.We add directionality with an arrow, constructed using

`grid::arrow()`

, including an`end_cap`

option that tells the arrow to end before touching the node.We tinker with the options to the node layer to make the nodes more attractive (larger, blue points).

We add a theme that’s useful for plotting networks,

`theme_void()`

.

```
set.seed(123)
<- grid::arrow(type = "closed", length = unit(.15, "inches"))
a
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
```

# Classification with logistic regression

In the first part we will build a statistical learning model. In the second part we will want to test it and assess its quality. Without dividing the dataset we would test the model on the data which the algorithm have already seen, which is why we start by splitting the data.

## Train test split

Let’s go back to the original `books`

dataset (not the `tidy_books`

dataset) because the lines of text are our individual observations.

We could use functions from the `rsample`

package to generate resampled datasets, but the specific modeling approach we’re going to use will do that for us so we only need a simple train/test split.

```
library(rsample)
<- books |>
books_split select(document) |>
initial_split(prop = 3/4)
<- training(books_split)
train_data <- testing(books_split) test_data
```

Notice that we just select specific text rows (column `document`

) for training and others for our test data (we set the proportion of data to be retained for modeling/analysis to 3/4) without selecting the actual text lines at this point.

## Training data (sparse matrix)

Now we want to transform our training data from a tidy data structure to a “sparse matrix” (these objects can be treated as though they were matrices, for example accessing particular rows and columns, but are stored in a more efficient format) to use for our classification algorithm.

```
<- tidy_books |>
sparse_words count(document, word) |>
inner_join(train_data, by = "document") |>
cast_sparse(document, word, n)
```

`dim(sparse_words)`

`[1] 4769 893`

We have over 4,700 training observations and almost 900 features. Text feature space handled in this way is very high dimensional, so we need to take that into account when considering our modeling approach.

One reason this overall approach is flexible is that you could at this point `cbind()`

other columns, such as non-text numeric data, onto this sparse matrix. Then you can use this combination of text and non-text data as your predictors in the classifiaction algorithm, and the regularized regression algorithm we are going to use will find which are important for your problem space.

## Response variable

We also need to build a tibble with a **response variable** to associate each of the `rownames()`

of the sparse matrix with an author, to use as the quantity we will predict in the model.

`<- as.integer(rownames(sparse_words)) word_rownames `

```
<- tibble(document = word_rownames) |>
books_joined left_join(books |>
select(document, author))
```

```
kable(head(books_joined, 7)) |>
kable_styling(bootstrap_options = "striped", "hover", "condensed", "responsive", full_width = F, position = "center")
```

document | author |
---|---|

1 | Tesla, Nikola |

3 | Tesla, Nikola |

5 | Tesla, Nikola |

11 | Tesla, Nikola |

21 | Tesla, Nikola |

24 | Tesla, Nikola |

25 | Tesla, Nikola |

## Logistic regression model

Now it’s time to train our classification model. Let’s use the `glmnet`

package to fit a logistic regression model with *lasso* (least absolute shrinkage and selection operator; also Lasso or LASSO) regularization. This regression analysis method performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces.

`Glmnet`

is a package that fits lasso models via penalized maximum likelihood. We do not cover the method and glmnet package in detail at this point, but if you want to learn more about glmnet and lasso regression, review the following resources:

The package is very useful for text classification because the variable selection that lasso regularization performs can tell you which words are important for your prediction problem. The glmnet package also supports parallel processing, so we can train on multiple cores with cross-validation on the training set using `cv.glmnet()`

.

```
library(glmnet)
<- books_joined$author == "Einstein, Albert"
is_einstein
<- cv.glmnet(sparse_words,
model
is_einstein,family = "binomial",
parallel = TRUE,
keep = TRUE)
```

Let’s use the package `broom`

(the broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames) to check out the coefficients of the model, for the largest value of lambda with error within 1 standard error of the minimum (`lambda.1se`

).

```
library(broom)
<- model$glmnet.fit |>
coefs tidy() |>
filter(lambda == model$lambda.1se)
```

Which coefficents are the largest in size, in each direction:

```
library(forcats)
|>
coefs group_by(estimate > 0) |>
top_n(10, abs(estimate)) |>
ungroup() |>
ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) +
geom_col(alpha = 0.8, show.legend = FALSE) +
coord_flip() +
labs(
x = NULL,
title = "Coefficients that increase/decrease probability the most",
subtitle = "A document mentioning lecture or probably is unlikely to be written by Albert Einstein"
+
) theme_classic(base_size = 12) +
theme(plot.title = element_text(lineheight=.8, face="bold")) +
scale_fill_brewer()
```

## Model evaluation with test data

Now we want to evaluate how well this model is doing using the test data that we held out and did not use for training the model. Let’s create a dataframe that tells us, for each document in the test set, the probability of being written by Albert Einstein.

```
<- coefs |>
intercept filter(term == "(Intercept)") |>
pull(estimate)
<- tidy_books |>
classifications inner_join(test_data) |>
inner_join(coefs, by = c("word" = "term")) |>
group_by(document) |>
summarize(score = sum(estimate)) |>
mutate(probability = plogis(intercept + score))
```

```
kable(head(classifications, 7)) |>
kable_styling(bootstrap_options = "striped", "hover", "condensed", "responsive", full_width = F, position = "center")
```

document | score | probability |
---|---|---|

7 | -1.6569483 | 0.1715378 |

9 | -0.5226801 | 0.3916220 |

27 | -1.2047108 | 0.2455423 |

32 | -1.0465089 | 0.2760125 |

36 | -0.5824899 | 0.3774680 |

38 | -0.1416163 | 0.4851455 |

40 | -4.7306180 | 0.0094857 |

Now let’s use the `yardstick`

package (yardstick is a package to estimate how well models are working using tidy data principles) to calculate some model performance metrics. For example, what does the ROC curve (receiver operating characteristic curve - a graph showing the performance of a classification model at all classification thresholds) look like:

```
library(yardstick)
<- classifications |>
comment_classes left_join(books |>
select(author, document), by = "document") |>
mutate(author = as.factor(author))
comment_classes
```

```
# A tibble: 1,553 × 4
document score probability author
<int> <dbl> <dbl> <fct>
1 7 -1.66 0.172 Tesla, Nikola
2 9 -0.523 0.392 Tesla, Nikola
3 27 -1.20 0.246 Tesla, Nikola
4 32 -1.05 0.276 Tesla, Nikola
5 36 -0.582 0.377 Tesla, Nikola
6 38 -0.142 0.485 Tesla, Nikola
7 40 -4.73 0.00949 Tesla, Nikola
8 41 -3.18 0.0434 Tesla, Nikola
9 42 -0.209 0.468 Tesla, Nikola
10 46 -0.433 0.413 Tesla, Nikola
# ℹ 1,543 more rows
```

```
<- comment_classes |>
roc roc_curve(author, probability)
|>
roc ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_line(
color = "midnightblue",
size = 1.5
+
) geom_abline(
lty = 2, alpha = 0.5,
color = "gray50",
size = 1.2
+
) labs(
title = "ROC curve for text classification using regularized regression",
subtitle = "Predicting whether text was written by Albert Einstein or Nikola Tesla"
+
) theme_classic(base_size = 12) +
theme(plot.title = element_text(lineheight=.8, face="bold"))
```

Let’s obtain the accuracy (AUC - the fraction of predictions that a classification model got right) on the test data:

```
<- comment_classes |>
auc roc_auc(author, probability)
```

```
kable(auc) |>
kable_styling(bootstrap_options = "striped", "hover", "condensed", "responsive", full_width = F, position = "center")
```

.metric | .estimator | .estimate |
---|---|---|

roc_auc | binary | 0.9741431 |

Next we turn to the **confusion matrix**. Let’s make the following definitions:

- “Einstein, Albert” is a positive class.
- “Tesla, Nikola” is a negative class.

True Positive (TP): |
False Positive (FP): |
---|---|

Reality: Text is from Einstein |
Reality: Text is from Tesla |

Model: Text is from Einstein |
Model: Text is from Einstein |

False Negative (FN): |
True Negative (TN): |
---|---|

Reality: Text is from Einstein |
Reality: Text is from Tesla |

Model: Text is from Tesla |
Model: Text is from Tesla |

We can summarize our “einstein-text-prediction” model using a 2x2 confusion matrix that depicts all four possible outcomes:

A

*true positive*is an outcome where the model correctly predicts the positive class (Einstein). Similarly, a*true negative*is an outcome where the model correctly predicts the negative class (Tesla).A

*false positive*is an outcome where the model incorrectly predicts the positive class. And a*false negative*is an outcome where the model incorrectly predicts the negative class.

Let’s use a probability of 0.5 as our threshold. That means all model predictions with a probability greater than 50% get labeld as beeing text from Einstein:

```
|>
comment_classes mutate(prediction = case_when(
> 0.5 ~ "Einstein, Albert",
probability TRUE ~ "Tesla, Nikola"),
prediction = as.factor(prediction)) |>
conf_mat(author, prediction)
```

```
Truth
Prediction Einstein, Albert Tesla, Nikola
Einstein, Albert 646 82
Tesla, Nikola 68 757
```

Let’s take a closer look at these misclassifications: false negatives (FN) and false positives (FP). Which documents here were incorrectly predicted to be written by Albert Einstein, at the extreme probability end of greater than 80% (false positive)?

```
<- comment_classes |>
FPfilter(probability > .8,
== "Tesla, Nikola") |>
author sample_n(10) |>
inner_join(books |>
select(document, text)) |>
select(probability, text)
```

```
kable(FP) |>
kable_styling(bootstrap_options = "striped", "hover", "condensed", "responsive", full_width = F, position = "center")
```

probability | text |
---|---|

0.8592571 | From experiences of this kind I am led to infer that, in order to be |

0.9115352 | already described, except with the view of completing, or more clearly |

0.8991634 | there is any motion which is measurable going on in the space, such a |

0.8161307 | [Transcriber's note: Corrected the following typesetting errors: |

0.8024522 | principle of the vacuum pump of the future. For the present, we must |

0.8213111 | shown by the following experiment: |

0.8963556 | brilliancy. Next, suppose we diminish to any degree we choose the |

0.8449774 | covered with a milky film, which is separated by a dark space from the |

0.8447254 | medium surely must exist, and I am convinced that, for instance, even |

0.8027681 | velocity--the energy associated with the moving body--is another, and |

These documents were incorrectly predicted to be written by Albert Einstein. However, they were written by Nikola Tesla.

Finally, let’s take a look at the texts which are from Albert Einstein that the model did not correctly identify (false negative):

```
<- comment_classes |>
FN filter(probability < .3,
== "Einstein, Albert") |>
author sample_n(10) |>
inner_join(books |>
select(document, text)) |>
select(probability, text)
```

```
kable(FN) |>
kable_styling(bootstrap_options = "striped", "hover", "condensed", "responsive", full_width = F, position = "center")
```

probability | text |
---|---|

0.2932465 | another, shall indicate position and time directly. Such was the |

0.2447074 | This conception is in itself not very satisfactory. It is still less |

0.1485998 | was necessary to surmount a serious difficulty, and as this lies deep |

0.2958725 | potential φ, hence the result we have obtained will hold quite |

0.2572766 | [20] Mathematicians have been confronted with our problem in the |

0.2032739 | any doubt to arise as to the prime importance of the Galileian |

0.0453919 | bring some one a few happy hours of suggestive thought! |

0.1163048 | ideas among themselves. |

0.2196137 | connection in which they actually originated. In the interest of |

0.2389984 | strings to the floor, otherwise the slightest impact against the floor |

We can conclude that the model did a very good job in predicting the authors of the texts. Furthermore, the texts of the misclassifications are quite short and we can imagine, that even a human reader who is familiar with the work of Einstein and Tesla would have difficulties to classify them correctly.