# Chapter 6 Discovering Behaviors with Unsupervised Learning

So far, we have been working with supervised learning methods, that is, models for which the training instances have two elements: (1) a set of input values (features) and (2) the expected output (label). As mentioned in chapter 1, there are other types of machine learning methods and one of those is **unsupervised learning** which is the topic of this chapter. In unsupervised learning, the training instances do not have a response variable (e.g., a label). Thus, the objective is to extract knowledge from the available data without any type of guidance (supervision). For example, given a set of variables that characterize a person, we would like to find groups of people with similar behaviors. For physical activity behaviors, this could be done by finding groups of very active people versus finding groups of people with low physical activity. Those groups can be useful for delivering targeted suggestions or services thus, enhancing and personalizing the user experience.

This chapter starts with one of the most popular unsupervised learning algorithms: **\(k\)-means clustering**. Next, an example of how this technique can be applied to find groups of students with similar characteristics is presented. Then, **association rules mining** is presented, which is another type of unsupervised learning method. Finally, association rules are used to find criminal patterns from a homicide database.

## 6.1 \(k\)-means Clustering

`kmeans_steps.R`

This is one of the most commonly used unsupervised methods due to its simplicity and efficacy. Its objective is to find groups of points such that points in the same group are similar and points from different groups are as dissimilar as possible. The number of groups \(k\) needs to be defined a priori. The method is based on computing distances to **centroids**. The centroid of a set of points is computed by taking the mean of each of their features. The \(k\)-means algorithm is as follows:

```
Generate k centroids at random.
Repeat until no change or max iterations:
Assign each data point to the closest centroid.
Update centroids.
```

To measure the distance between a data point and a centroid, the Euclidean distance is typically used, but other distances can be used as well depending on the application. As an example, let’s cluster user responses from the *STUDENTS’ MENTAL HEALTH* dataset. This database contains questionnaire responses about depression, acculturative stress, social connectedness, and help-seeking behaviors from students at a University in Japan. To demonstrate how \(k\)-means work, we will only choose two variables so we can plot the results. The variables are *ToAS* (Total Acculturative Stress) and *ToSC* (Total Social Connectedness). The *ToAS* measures the emotional challenges when adapting to a new culture while *ToSC* measures emotional distance with oneself and other people. For the clustering, the parameter \(k\) will be set to \(3\), that is, we want to group the points into \(3\) disjoint groups. The code that implements the \(k\)-means algorithm can be found in the script `kmeans_steps.R`

. The algorithm begins by selecting \(3\) centroids at random. Figure 6.1 shows a scatterplot of the variables *ToAS* and *ToSC* along with the random centroids.

Next, at the first iteration, each point is assigned to the closest centroid. This is depicted in Figure 6.2 (top left). Then, the centroids are updated (moved) based on the new assignments. In the next iteration, the points are reassigned to the closest centroids and so on. Figure 6.2 shows the first \(4\) iterations of the algorithm.

From iteration \(1\) to \(2\) the centroids moved considerably. After that, they began to stabilize. Formally, the algorithm tries to minimize the total within cluster variation of all clusters. The cluster variation of a single cluster \(C_k\) is defined as:

\[\begin{equation} W(C_k) = \sum_{x_i \in C_k}{(x_i - \mu_k)^2} \tag{6.1} \end{equation}\]

where \(x_i\) is a data point and \(\mu_k\) is the centroid of cluster \(C_k\). Thus, the total within cluster variation \(TWCV\) is:

\[\begin{equation} TWCV = \sum_{i=1}^k{W(C_i)} \tag{6.2} \end{equation}\]

that is, the sum of all within-cluster variations across all clusters. The objective is to find the \(\mu_k\) centroids that make \(TWCV\) minimal. Finding the global optimum is a difficult problem. However, the iterative algorithm described above often produces good approximations.

### 6.1.1 Grouping Student Responses

`group_students.R`

In the previous example, we only used two variables to perform the clustering. Let’s now use more variables from the *STUDENTS’ MENTAL HEALTH* dataset to find groups. The full code can be found in `group_students.R`

. After removing missing values, one-hot encoding categorical variables, and some additional cleaning, the following \(10\) variables were selected:

```
# Select which variables are going to be used for clustering.
<- c("Stay","English_cate","Intimate"
selvars "APD","AHome","APH","Afear",
,"ACS","AGuilt","ToAS")
```

Additionally, it is advisable to normalize the data between \(0\) and \(1\) since we are dealing with distance computations and we want to put the same weight on each variable. To plot the \(10\) variables, we can use **MDS** (described in chapter 4) to project the data into \(2\) dimensions (Figure 6.3).

Visually, it seems that there are \(4\) distinct groups of points. Based on this initial guess, we can set \(k=4\) and use the `kmeans()`

function included in base R to find the groups automatically.

`<- kmeans(normdf, 4) clusters `

The first argument of `kmeans()`

is a data frame or a matrix and the second argument the number of clusters. Figure 6.4 shows the resulting clustering. The `kmeans()`

method returns an object that contains several components including `cluster`

that stores the assigned cluster for each data point and `centers`

that stores the centroids.

The \(k\)-means algorithm found the same clusters as we would intuitively expect. We can check how different the groups are by inspecting some of the variables. For example, by plotting a boxplot of the *Intimate* variable (Figure 6.5). This variable is \(1\) if the student has an intimate partner or \(0\) otherwise. Since there are only two possible values the boxplot looks flat. This shows that *cluster_1* and *cluster_3* are different from *cluster_2* and *cluster_4*.

Additionally, let’s plot the *ACS* variable which represents the total score of culture shock (see Figure 6.6). This one has a minimum value of \(3\) and a max value of \(13\).

*cluster_2* and *cluster_4* were similar based on the *Intimate* variable, but if we take a look at the difference in medians based on *ACS*, *cluster_2* and *cluster_4* are the most dissimilar clusters which gives an intuitive idea of why those two clusters were split into two different ones by the algorithm.

So far, the number of groups \(k\) has been chosen arbitrarily or by visual inspection. But, *is there an automatic way to select the best k?* As always… this depends on the task at hand but there is a method called **Silhouette index** that can be used to select the optimal \(k\) based on an optimality criterion. This index is presented in the next section.

## 6.2 The Silhouette Index

As opposed to supervised learning, in unsupervised learning there is **no ground truth** to validate the results. In clustering, one way to validate the resulting groups is to plot them and manually explore the clusters’ data points and look for similarities and/or differences. But sometimes we may also want to automate the process and have a quantitative way to measure how well the clustering algorithm grouped the points with the given set of parameters. If we had such a method we could do parameter optimization, for example, to find the best \(k\). Well, there is something called *the silhouette index* (Rousseeuw 1987) and it can be used to measure the correctness of the clusters.

This index is computed for each data point and tells us how well they are clustered. The total silhouette index is the mean of all points’ indices and gives us an idea of how well the points were clustered overall. This index goes from \(-1\) to \(1\) and I’ll explain in a moment how to interpret it, but first let’s see how it is computed.

To compute the silhouette index two things are needed: the already created groups and the distances between points. Let:

\(a(i)=\) average dissimilarity (distance) of point \(i\) to all other points in \(A\), where \(A\) is the cluster to which \(i\) has been assigned to (Figure 6.7).

\(d(i,C)=\) average dissimilarity between \(i\) and all points in some cluster \(C\).

\(b(i)=\min_{C \neq A}d(i,C)\). The cluster \(B\) for which the minimum is obtained is the neighbor of point \(i\). (The second best choice for \(i\)).

Thus, \(s(i)\) (the silhouette index of point \(i\)) is obtained with the formula:

\[\begin{equation} s(i) = \frac{{b(i) - a(i)}}{{\max \{ a(i),b(i)\} }} \tag{6.3} \end{equation}\]

When \(s(i)\) is close to \(1\), it means that the *within* dissimilarity \(a(i)\) is much smaller than the smallest *between* dissimilarity \(b(i)\) thus, \(i\) can be considered to be well clustered. When \(s(i)\) is close to \(0\) it is not clear whether \(i\) belongs to \(A\) or \(B\). When \(s(i)\) is close to \(-1\), \(a(i)\) is larger than \(b(i)\) meaning that \(i\) may have been misgrouped. The total silhouette index \(S\) is the average of all indices \(s(i)\) of all points.

In R, the `cluster`

package has the function `silhouette()`

that computes the silhouette index. The following code snippet clusters the student responses into \(4\) groups and computes the index of each point with the `silhouette()`

function. Its first argument is the cluster assignments as returned by `kmeans()`

, and the second argument is a `dist`

object that contains the distances between each pair of points. We can compute this information from our data frame with the `dist()`

function. The `silhouette()`

function returns an object with the silhouette index of each data point. We can compute the total index by taking the average which in this case was \(0.346\).

```
library(cluster) # Load the required package.
set.seed(1234)
<- kmeans(normdf, 4) # Try with k=4
clusters
# Compute silhouette indices for all points.
<- silhouette(clusters$cluster, dist(normdf))
si
# Print first rows.
head(si)
#> cluster neighbor sil_width
#> [1,] 1 4 0.3482364
#> [2,] 2 4 0.3718735
#> [3,] 3 1 0.3322198
#> [4,] 1 4 0.3998996
#> [5,] 1 4 0.3662811
#> [6,] 3 1 0.1463607
# Compute total Silhouette index by averaging the individual indices.
mean(si[,3])
# [1] 0.3466427
```

One nice thing about this index is that it can be presented visually. To generate a silhouette plot, use the generic `plot()`

function and pass the object returned by `silhouette()`

.

```
plot(si, cex.names=0.6, col = 1:4,
main = "Silhouette plot, k=4",
border=NA)
```

Figure 6.8 shows the silhouette plot when \(k=4\). The horizontal lines represent the individual silhouette indices. In this plot, all of them are positive. The height of each cluster gives a visual idea of the number of data points contained in it with respect to other clusters. We can see for example that cluster \(2\) is the smallest one. On the right side, is the number of points in each cluster and their average silhouette index. At the bottom, the total silhouette index is printed (\(0.35\)). We can try to cluster the points into \(7\) groups instead of \(4\) and see what happens.

```
set.seed(1234)
<- kmeans(normdf, 7)
clusters <- silhouette(clusters$cluster, dist(normdf))
si plot(si, cex.names=0.6, col = 1:7,
main = "Silhouette plot, k=7",
border=NA)
```

Here, cluster \(2\) and \(4\) have data points with negative indices and the overall score is \(0.26\). This suggests that \(k=4\) produces more coherent clusters as compared to \(k=7\).

## 6.3 Mining Association Rules

Association rule mining consists of a set of methods to extract patterns (rules) from transactional data. For example, shopping behavior can be analyzed by finding rules from customers’ shopping transactions. A **transaction** is an event that involves a set of items. For example, when someone buys a soda, a bag of chips, and a chocolate bar the purchase is registered as *one* transaction containing \(3\) items. I apologize for using this example for those of you with healthy diets. Based on a database that contains many transactions, it is possible to uncover item relationships. Those relationships are usually expressed as implication rules of the form \(X \implies Y\) where \(X\) and \(Y\) are sets of items. Both sets are disjoint, this means that items in \(X\) are not in \(Y\) and vice-versa which can be formally represented as \(X \cap Y = \emptyset\). That is, the intersection of the two sets is the empty set. \(X \implies Y\) is read as: if \(X\) then \(Y\). The left-hand-side (lhs) \(X\) is called the **antecedent** and the right-hand-side (rhs) \(Y\) is called the **consequent**.

In the unhealthy supermarket example, a rule like \(\{chips, chocolate\} \implies \{soda\}\) can be interpreted as *if someone buys chips and chocolate then, it is likely that this same person will also buy soda*. These types of rules can be used for targeted advertisements, product placement decisions, etc.

The possible number of rules that can be generated grows exponentially as the number of items increases. Furthermore, not all rules may be interesting. The most well-known algorithm to find interesting association rules is called **Apriori** (Agrawal and Srikant 1994). To quantify if a rule is interesting or not, this algorithm uses two importance measures: **support** and **confidence**.

**Support.**The support \(\text{supp}(X)\) of an itemset \(X\) is the proportion of transactions that contain all the items in \(X\). This quantifies how frequent the itemset is.**Confidence.**The confidence of a rule \(\text{conf}(X \implies Y)=\text{supp}(X \cup Y)/\text{supp}(X)\) and can be interpreted as the conditional probability that \(Y\) occurs given that \(X\) is present. This can also be thought of as the probability that a transaction that contains \(X\) also contains \(Y\). The \(\cup\) operator is the union of two sets. This means taking all elements from both sets and removing repeated elements.

Now that there is a way to measure the importance of the rules, the Apriori algorithm first finds itemsets that satisfy a minimum support and generates rules from those itemsets that satisfy a minimum confidence. Those minimum thresholds are set by the user. The lower the thresholds, the more rules returned by the algorithm. One thing to note is that Apriori only generates rules with itemsets of size \(1\) on the right-hand side. Another common metric to measure the importance of a rule is the **lift**. Lift is typically used after Apriori has generated the rules to further filter and/or rank the results.

**Lift.**The lift of a rule \(\text{lift}(X \implies Y) = \text{supp}(X \cup Y) / (\text{supp}(X)\text{supp}(Y))\) is similar to the confidence but it also takes into consideration the frequency of \(Y\). A lift of \(1\) means that there is no association between \(X\) and \(Y\). A lift greater than \(1\) means that \(Y\) is likely to occur if \(X\) occurs and a lift less than \(1\) means that \(Y\) is unlikely to occur when \(X\) occurs.

Let’s compute all those metrics using an example. The following table shows a synthetic example database of transactions from shoppers with unhealthy behaviors.

The support of the itemset consisting of a single item *‘chocolate’* is \(\text{supp}(\{chocolate\}) = 5/10 = 0.5\) because *‘chocolate’* appears in \(5\) out of the \(10\) transactions. The support of \(\{chips, soda\}\) is \(3/10 = 0.3\).

The confidence of the rule \(\{chocolate, chips\} \implies \{soda\}\) is:

\[\begin{align*} \text{conf}(\{chocolate, chips\} \implies \{soda\})&=\frac{\text{supp}(\{chocolate, chips, soda\})}{\text{supp}(\{chocolate,chips\})} \\ &=(2/10) / (3/10) \\ &=0.66 \end{align*}\]

The lift of \(\{soda\} \implies \{ice cream\}\) is:

\[\begin{align*} \text{lift}(\{soda\} \implies \{ice cream\})&=\frac{\text{supp}(\{soda, ice cream\})}{\text{supp}(\{soda\})\text{supp}(\{ice cream\})} \\ &=(2/10) / ((7/10)(3/10)) \\ &=0.95. \end{align*}\]

Association rules mining is unsupervised in the sense that there are no labels or ground truth. Many applications of association rules are targeted to market basket analysis to gain insights into shoppers’ behavior and take actions to increase sales. To find such rules it is necessary to have ‘transactions’ (sets of items), for example, supermarket products. However, this is not the only application of association rules. There are other problems that can be structured as transactions of items. For example in medicine, *diseases* can be seen as transactions and *symptoms* as items. Thus, one can apply association rules algorithms to find symptoms and disease relationships. Another application is in recommender systems. Take, for example, movies. Transactions can be the set of movies watched by every user. If you watched a movie \(m\) then, the recommender system can suggest another movie that co-occurred frequently with \(m\) and that you have not watched yet. Furthermore, other types of relational data can be transformed into transaction-like structures to find patterns and this is precisely what we are going to do in the next section to mine criminal patterns.

### 6.3.1 Finding Rules for Criminal Behavior

`crimes_process.R`

`crimes_rules.R`

In this section, we will use association rules mining to find patterns in the *HOMICIDE REPORTS*^{14} dataset. This database contains homicide reports from 1980 to 2014 in the United States. The database is structured as a table with \(24\) columns and \(638,454\) rows. Each row corresponds to a homicide report that includes city, state, year, month, sex of victim, sex of perpetrator, if the crime was solved or not, weapon used, age of the victim and perpetrator, the relationship type between the victim and the perpetrator, and some other information.

Before trying to find rules, the data needs to be preprocessed and converted into transactions. Each homicide report will be a transaction and the items are the possible values of \(3\) of the columns: **Relationship**, **Weapon**, and **Perpetrator.Age**. The **Relationship** variable can take values like *Stranger*, *Neighbor*, *Friend*, etc. In total, there are \(28\) possible relationship values including *Unknown*. For the purpose of our analysis, we will remove rows with unknown values in **Relationship** and **Weapon**. Since **Perpetrator.Age** is an integer, we need to convert it into categories. The following age groups are created: child (< \(13\) years), teen (\(13\) to \(17\) years), adult (\(18\) to \(45\) years), and lateAdulthood (> \(45\) years). After these cleaning and preprocessing steps, the dataset has \(3\) columns and \(328,238\) rows (see Figure 6.11). The script used to perform the preprocessing is `crimes_process.R`

.

Now, we have a data frame that contains only the relevant information. Each row will be used to generate one transaction. An example transaction may be *{R.Wife, Knife, Adult}*. This one represents the case where the perpetrator is an *adult* who used a *knife* to kill his *wife*. Note the ‘R.’ at the beginning of ‘Wife.’ This ‘R.’ was added for clarity in order to identify that this item is a relationship. One thing to note is that every transaction will consist of exactly \(3\) items. This is a bit different than the market basket case in which every transaction can include a varying number of products. Although this item-size constraint was a design decision based on the structure of the original data, this will not prevent us from performing the analysis to find interesting rules.

To find the association rules, the `arules`

package (Hahsler et al. 2019) will be used. This package has an interface to an efficient implementation in C of the Apriori algorithm. This package needs the transactions to be stored as an object of type ‘transactions.’ One way to create this object is to use a **logical matrix** and cast it into a transactions object. The rows of the logical matrix represent transactions and columns represent items. The number of columns equals the total number of possible items. A `TRUE`

value indicates that the item is present in the transaction and `FALSE`

otherwise. In our case, the matrix has \(46\) columns. The `crimes_process.R`

script has code to generate this matrix `M`

. The \(46\) items (columns of `M`

) are:

```
as.character(colnames(M))
#> [1] "R.Acquaintance" "R.Wife" "R.Stranger"
#> [4] "R.Girlfriend" "R.Ex-Husband" "R.Brother"
#> [7] "R.Stepdaughter" "R.Husband" "R.Friend"
#> [10] "R.Family" "R.Neighbor" "R.Father"
#> [13] "R.In-Law" "R.Son" "R.Ex-Wife"
#> [16] "R.Boyfriend" "R.Mother" "R.Sister"
#> [19] "R.Common-Law Husband" "R.Common-Law Wife" "R.Stepfather"
#> [22] "R.Stepson" "R.Stepmother" "R.Daughter"
#> [25] "R.Boyfriend/Girlfriend" "R.Employer" "R.Employee"
#> [28] "Blunt Object" "Strangulation" "Rifle"
#> [31] "Knife" "Shotgun" "Handgun"
#> [34] "Drowning" "Firearm" "Suffocation"
#> [37] "Fire" "Drugs" "Explosives"
#> [40] "Fall" "Gun" "Poison"
#> [43] "teen" "adult" "lateAdulthood"
#> [46] "child"
```

The following snippet shows how to convert the matrix into an `arules`

transactions object. Before the conversion, the package `arules`

needs to be loaded. For convenience, the transactions are saved in a file `transactions.RData`

.

```
library(arules)
# Convert into a transactions object.
<- as(M, "transactions")
transactions
# Save transactions file.
save(transactions, file="transactions.RData")
```

Now that the database is in the required format we can start the analysis. The `crimes_rules.R`

script has the code to perform the analysis. First, the transactions file that we generated before is loaded:

```
library(arules)
library(arulesViz)
# Load preprocessed data.
load("transactions.RData")
```

Note that additionally to the `arules`

package, we also loaded the `arulesViz`

package (Hahsler 2019). This package has several functions to generate cool plots of the learned rules! A summary of the transactions can be printed with the `summary()`

function:

```
# Print summary.
summary(transactions)
#> transactions as itemMatrix in sparse format with
#> 328238 rows (elements/itemsets/transactions) and
#> 46 columns (items) and a density of 0.06521739
#>
#> most frequent items:
#> adult Handgun R.Acquaintance R.Stranger
#> 257026 160586 117305 77725
#> Knife (Other)
#> 61936 310136
#>
#> element (itemset/transaction) length distribution:
#> sizes
#> 3
#> 328238
#>
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 3 3 3 3 3 3
#>
#> includes extended item information - examples:
#> labels
#> Relationship1 R.Acquaintance
#> Relationship2 R.Wife
#> Relationship3 R.Stranger
```

The summary shows the total number of rows (transactions) and the number of columns. It also prints the most frequent items, in this case, *adult* with \(257026\) occurrences, *Handgun* with \(160586\), and so on. The itemset sizes are also displayed. Here, all itemsets have a size of \(3\) (by design). Some other summary statistics are also printed.

We can use the `itemFrequencyPlot()`

function from the `arulesViz`

package to plot the frequency of items.

```
itemFrequencyPlot(transactions,
type = "relative",
topN = 15,
main = 'Item frequecies')
```

The `type`

argument specifies that we want to plot the relative frequencies. Use `"absolute"`

instead to plot the total counts. `topN`

is used to select how many items are plotted. Figure 6.12 shows the output.

Now it is time to find some interesting rules! This can be done with the `apriori()`

function as follows:

```
# Run apriori algorithm.
<- apriori(transactions,
resrules parameter = list(support = 0.001,
confidence = 0.5,
# Find rules with at least 2 items.
minlen = 2,
target = 'rules'))
```

The first argument is the transactions object. The second argument `parameter`

specifies a list of algorithm parameters. In this case we want rules with a minimum support of \(0.001\) and a confidence of at least \(0.5\). The `minlen`

argument specifies the minimum number of allowed items in a rule (antecedent + consequent). We set it to \(2\) since we want rules with at least one element in the antecedent and one element in the consequent, for example, *{item1 => item2}*. This Apriori algorithm creates rules with only one item in the consequent. Finally, the `target`

parameter is used to specify that we want to find rules because the function can also return item sets of different types (see the documentation for more details). The returned rules are saved in the `resrules`

variable that can be used later to explore the results. We can also print a summary of the returned results.

```
# Print a summary of the results.
summary(resrules)
#> set of 141 rules
#>
#> rule length distribution (lhs + rhs):sizes
#> 2 3
#> 45 96
#>
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2.000 2.000 3.000 2.681 3.000 3.000
#>
#> summary of quality measures:
#> support confidence lift count
#> Min. :0.001030 Min. :0.5045 Min. :0.6535 Min. : 338
#> 1st Qu.:0.001767 1st Qu.:0.6478 1st Qu.:0.9158 1st Qu.: 580
#> Median :0.004424 Median :0.7577 Median :1.0139 Median : 1452
#> Mean :0.021271 Mean :0.7269 Mean :1.0906 Mean : 6982
#> 3rd Qu.:0.012960 3rd Qu.:0.8131 3rd Qu.:1.0933 3rd Qu.: 4254
#> Max. :0.376836 Max. :0.9539 Max. :4.2777 Max. :123692
#>
#> mining info:
#> data ntransactions support confidence
#> transactions 328238 0.001 0.5
```

By looking at the summary, we see that the algorithm found \(141\) rules that satisfy the support and confidence thresholds. The rule length distribution is also printed. Here, \(45\) rules are of size \(2\) and \(96\) rules are of size \(3\). Then, some standard statistics are shown for support, confidence, and lift. The `inspect()`

function can be used to print the actual rules. Rules can be sorted by one of the importance measures. The following code sorts by lift and prints the first \(20\) rules. Figure 6.13 shows the output.

```
# Print the first n (20) rules with highest lift in decreasing order.
inspect(sort(resrules, by='lift', decreasing = T)[1:20])
```

The first rule with a lift of \(4.27\) says that if a homicide was committed by an *adult* and the victim was the *stepson*, then is it likely that a *blunt object* was used for the crime. By looking at the rules, one can also note that whenever *blunt object* appears either in the lhs or rhs, the victim was most likely an infant. Another thing to note is that when the victim was *boyfriend*, the crime was likely committed with a *knife*. This is also mentioned in the reports ‘Homicide trends in the United States’ (Cooper, Smith, and others 2012):

From 1980 through 2008 ‘Boyfriends were more likely to be killed by knives than any other group of intimates.’

According to rule \(20\), crimes involving *girlfriend* have a strong relationship with *strangulation*. This can also be confirmed in (Cooper, Smith, and others 2012):

From 1980 through 2008 ‘Girlfriends were more likely to be killed by force….’

The resulting rules can be plotted with the `plot()`

function (see Figure 6.14). By default, it generates a scatterplot with the *support* in the \(x\) axis and *confidence* in the \(y\) axis colored by *lift*.

```
# Plot a default scatterplot of support vs. confidence colored by lift.
plot(resrules)
```

The plot shows that rules with a high *lift* also have a low *support* and *confidence*. Hahsler (2017) mentioned that rules with high *lift* typically have low *support*. The plot can be customized for example to show the *support* and *lift* in the axes and color them by confidence. The axes can be set with the `measure`

parameter and the coloring with the `shading`

parameter. The function also supports different plotting engines including static and interactive. The following code generates a customized interactive plot by setting `engine = "htmlwidget"`

. This is very handy if you want to know which points correspond to which rules. By hovering the mouse on the desired point the corresponding rule is shown as a tooltip box (Figure 6.15). The interactive plots also allow to zoom in regions by clicking and dragging.

```
# Customize scatterplot to make it interactive
# and plot support vs. lift colored by confidence.
plot(resrules, engine = "htmlwidget",
measure = c("support", "lift"), shading = "confidence")
```

The `arulesViz`

package has a nice option to plot rules as a graph. This is done by setting `method = "graph"`

. We can also make the graph interactive for easier exploration by setting `engine="htmlwidget"`

. For clarity, the font size is reduced with `cex=0.9`

. Here we plot the first \(25\) rules.

```
# Plot rules as a graph.
plot(head(sort(resrules, by = "lift"), n=25),
method = "graph",
control=list(cex=.9),
engine="htmlwidget")
```

Figure 6.16 shows a zoomed-in portion of the entire graph. Circles represent rules and rounded squares items. The size of the circle is relative to the *support* and color relative to the *lift*. Incoming arrows represent the items in the antecedent and the outgoing arrow of a circle points to the item in the consequent part of the rule. From this graph, some interesting patterns can be seen. First, when the age category of the perpetrator is *lateAdulthood*, the victims were the *husband* or *ex-wife*. When the perpetrator is a *teen*, the victim was likely a *friend* or *stranger*.

The `arulesViz`

package has a cool function `ruleExplorer()`

that generates a shiny app with interactive controls and several plot types. When running the following code (output not shown) you may be asked to install additional shiny related packages.

```
# Opens a shiny app with several interactive plots.
ruleExplorer(resrules)
```

Sometimes Apriori returns thousands of rules. There is a convenient `subset()`

function to extract rules of interest. For example, we can select only the rules that contain *R.Girlfriend* in the antecedent (lhs) and print the top three with highest lift (Figure 6.17 shows the result):

```
# Subset transactions.
<- subset(resrules, subset = lhs %in% "R.Girlfriend")
rulesGirlfriend
# Print rules with highest lift.
inspect(head(rulesGirlfriend, n = 3, by = "lift"))
```

In this section, we showed how interesting rules can be extracted from a crimes database. Several preprocessing steps were required to transform the tabular data into transactional data. This example already demonstrated how the same data can be represented in different ways (tabular and transactional). The next chapter will cover more details about how data can be transformed into **different representations** suitable for different types of learning algorithms.

## 6.4 Summary

One of the types of machine learning is **unsupervised learning** in which there are no labels. This chapter introduced some unsupervised methods such as clustering and association rules.

- The objective of
**\(k\)-means clustering**is to find groups of points such that points in the same group are similar and points from different groups are as dissimilar as possible. - The
**centroid**of a group is calculated by taking the mean value of each feature. - In
**\(k\)-means**, one needs to specify the number of groups \(k\) before running the algorithm. - The
**Silhouette Index**is a measure that tells us how well a set of points were clustered. This measure can be used to find the optimal number of groups \(k\). **Association rules**can find patterns in an unsupervised manner.- The
**Apriori algorithm**is the most well-known method for finding association rules. - Before using the
*Apriori algorithm*, one needs to format the data as**transactions**. - A
**transaction**is an event that involves a set of items.

### References

*Proc. 20th Int. Conf. Very Large Data Bases, VLDB*, 1215:487–99.

*Homicide Trends in the United States, 1980-2008*. BiblioGov.

*R Journal*9 (2): 163–75. https://journal.r-project.org/archive/2017/RJ-2017-047/RJ-2017-047.pdf.

*arulesViz: Visualizing Association Rules and Frequent Itemsets*. https://CRAN.R-project.org/package=arulesViz.

*Arules: Mining Association Rules and Frequent Itemsets*. https://CRAN.R-project.org/package=arules.

*Journal of Intelligent Information Systems*17 (2): 107–45.

*Journal of Computational and Applied Mathematics*20: 53–65.