- exploratory data analysis (EDA)
- exploring data depends on context (domain knowledge)
- creating features (predictors)
- review of
`group_by()`

,`summarize()`

,`inner_join()`

- dealing with
`NAs`

in`summarize()`

using`na.rm=TRUE`

- avoid over-plotting with
`alpha=..`

and`geom_jitter()`

This lab and the next one are adapted from chapter 4 in *Data Mining with R* by Luis Torgo. The data contains over 400 thousand sales reports with the following variables:

- ID - ID of the sales person
- Prod - ID of the product sold
- Quant - number of reported units sold
- Val - reported total value of the sale
- Insp - this variable has three possible values:
`ok`

if the report was inspected and found valid,`fraud`

if the report was inspected but found fraudulent, and`unkn`

if the report was not inspected.

```
library(tidyverse)
library(stargazer)
data <- read_csv("sales.csv")
glimpse(data)
```

```
## Observations: 401,146
## Variables: 5
## $ ID <chr> "v1", "v2", "v3", "v4", "v3", "v5", "v6", "v7", "v8", "v...
## $ Prod <chr> "p1", "p1", "p1", "p1", "p1", "p2", "p2", "p2", "p2", "p...
## $ Quant <dbl> 182, 3072, 20393, 112, 6164, 104, 350, 200, 233, 118, 23...
## $ Val <dbl> 1665, 8780, 76990, 1100, 20260, 1155, 5680, 4010, 2855, ...
## $ Insp <chr> "unkn", "unkn", "unkn", "unkn", "unkn", "unkn", "unkn", ...
```

We see that there are over 400 thousand sales reports. Letâ€™s first look at the quantitative variables: `Val' and`

Quant`.

```
data <- as.data.frame(data)
stargazer(select(data,Val, Quant), type="text", median = TRUE, digits = 0)
```

```
##
## ============================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
## ----------------------------------------------------------------------------
## Val 399,964 14,617 69,713 1,005 1,345 2,675 8,680 4,642,955
## Quant 387,304 8,442 918,351 100 107 168 738 473,883,883
## ----------------------------------------------------------------------------
```

There is a pretty wide range in the quantity variable, but it appears that the minimum sale is 100 units. The range in the total value of the sales report is, not surprisingly, also wide and skewed. Also, there are missing values (NAâ€™s) in both quantity and value.

Letâ€™s take a look at the character variables. First, letâ€™s see how many different values each takes on.

`n_distinct(data$Prod)`

`## [1] 4548`

`n_distinct(data$ID)`

`## [1] 6016`

`table(data$Insp)`

```
##
## fraud ok unkn
## 1270 14462 385414
```

We see that there are lots of different sales people (6,016 of them), and lots of different products (4,548 of them). Importantly, it appears that only a small fraction of reports has been inspected as over 385 thousand reports have unknown status. Of the roughly 15 thousand inspected reports only a small fraction is fraudulent, 8%. Thus, there is significant class imbalance in the variable we are trying to predict. Letâ€™s do a few central tendency statistics by `Insp`

.

IN-CLASS EXERCISE 1: Calculate median of value and quantity for â€˜okâ€™, â€˜fraudulentâ€™ and â€˜unknwnâ€™ transactions.

Hopefully, the above exercise showed that we need to be careful about missing values. Since there are missing values in both `Val`

and `Quant`

we will use the option `na.rm=TRUE`

inside the `mean()`

and `median()`

functions. This option (`rm`

stands for remove) tells `mean()`

and `median()`

to ignore missing values. Otherwise, if there was any missing value within a group, `mean()`

and `median()`

would return an `NA`

. In other words, we calculate mean and median over all the non-missing values within each group.

```
sum <- data %>% group_by(Insp) %>%
summarize(av_Val=mean(Val,na.rm=TRUE), av_Quant=mean(Quant,na.rm=TRUE),
med_Val=median(Val,na.rm=TRUE), med_Quant=median(Quant,na.rm=TRUE))
sum
```

```
## # A tibble: 3 x 5
## Insp av_Val av_Quant med_Val med_Quant
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 fraud 93200. 945504. 6790 737
## 2 ok 60797. 35784. 13635 432.
## 3 unkn 12629. 4260. 2620 166
```

We see that inspected transactions are decidedly larger in terms of value and quantity than uninspected ones (status `unkn`

). We see that in terms of *average* value fraudulent transactions are bigger, but in terms of *median* value they appear smaller. Quantity is bigger for fraudulent both in terms of average and median. Clearly, there are some complex relationships among value, quantity and fraudulent/ok status. Letâ€™s plot the data to see if we can shed some light.

We have two quantitative variables, `Val`

and `Quant`

which we can plot on the x and y axes. We also have qualitative variable `Insp`

which we can map to color. Since we saw pretty big range for both value and quantity, we will use log scales. Also, since we have about 400 thousand observations we should reduce over-plotting by making the points transparent using the `alpha=`

aesthetic.

```
ggplot(data, aes(x=Quant, y=Val, color=Insp)) + geom_point(alpha=0.25) +
scale_x_continuous(trans="log", breaks=c(1000,10000,100000,1000000,10000000,100000000))+
scale_y_continuous(trans="log", breaks=c(1000,10000,100000,1000000,10000000))
```