Data Wrangling
Why data-wrangling
If you can wrangle data into the proper form you can do anything with it…
Data-wrangling is absolutely essential for every data science task where we need to work with collected data.
A recent article from the New York Times said “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information.”
An excelent talk on data wrangling by Jenny Bryan:
https://www.youtube.com/watch?v=4MfUCX_KpdE
It is important to highlight the following facts;
- Excellent data wrangling skills will allow you to make R tools “flow” into one another.
- Different functions require different input formats and so do different solutions!
- The correct shape for the data will greatly simplify the stats needed to solve a problem.
- For example where you would have looped over data you can now simply aggregate
- For example where you would have repeated a model or visualization you now apply it over a hierarchy!
- Being data fluent will allow you to come up with innovative solutions!
The basics - tabular data examples
Here we will use an example of grocery store data.
Import data
What does the data look like?:
dummy_data <-
readRDS("../../static/data/data-wrangling/dummy_data.rds")
dummy_data %>% glimpse
## Observations: 1,000
## Variables: 10
## $ customer_no <chr> "49ef700455a56debe3214598663330f6", "210d4179edb2...
## $ spend_month <dttm> 2012-12-01, 2012-12-01, 2012-12-01, 2012-12-01, ...
## $ store_counts <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ visits_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1...
## $ item_sum <int> 1, NA, 6, 4, 1, 1, 1, NA, 4, 1, 1, 1, 1, 1, NA, 2...
## $ discount_sum <dbl> 0.00, 0.00, 4.05, 0.00, 0.00, 6.90, NA, 0.00, 10....
## $ spend_sum <dbl> 40.00, 15.90, 41.90, 117.96, NA, 43.25, 31.90, 18...
## $ spend_day <chr> "Fri", "Wed", "Wed", "Sun", "Fri", "Wed", "Wed", ...
## $ spend_time <chr> "afternoon", "afternoon", "evening", "lunch", "ev...
## $ spend_type <chr> "protein", "non_cash", "prepared_and_deli", "dair...
Reshaping to answer a question
The basics
Widening
What if I asked you; “Show me the spend in each spend_time on a customer level”.
For a person and also for some models you will need to widen the data. By ‘widen’ we mean adding columns. Humans read better when you describe an obeservation using columns after all…
To widen this data we use the function tidyr::spread
dummy_data %>%
select(customer_no,spend_time,spend_sum) %>%
tidyr::spread(key = spend_time, value = spend_sum, fill = 0)
## # A tibble: 998 x 6
## customer_no afternoon evening lunch morning `<NA>`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 001dc4f17ad2f7999c215ac147ad9a13 0 0 0 93.0 0
## 2 0020cab9597c12589249daa13a1427a8 30.8 0 0 0 0
## 3 0068525ef1d0d392ad3e97602511551a 59.9 0 0 0 0
## 4 00d75bffda7ea469a511c891a28b2df2 0 0 25.0 0 0
## 5 011aec649b655991c2835f151188c54e 0 0 0 0 0
## 6 014dd528202b032a46564f63100ced5c 0 0 0 422 0
## 7 015392d5b1b6e0b63a4ca5ce1139d4ac 0 30.0 0 0 0
## 8 01948501615939ecb5a8c49af2ab45dd 0 0 0 13.0 0
## 9 01a86d2dba1514e5a92ecb38b408bbe8 43.8 0 0 0 0
## 10 0210fe573d0e13b2cd68cabe31d3a94c 0 0 7.90 0 0
## # ... with 988 more rows
What were the averages for these metrics?
wide_summarised <-
dummy_data %>%
select(customer_no,spend_time,spend_sum) %>%
tidyr::spread(key = spend_time, value = spend_sum, fill = 0) %>%
summarise_if(is.numeric,mean,na.rm=TRUE)
wide_summarised
## # A tibble: 1 x 5
## afternoon evening lunch morning `<NA>`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19.8 13.5 21.1 20.9 0.271
Moving back to long format
This was very basic, but what if we wanted to do this in reverse?
Of course we can do the exact opposite using tidyr::gather
Let’s visualize the spend in each time category by moving back to the less natural ‘long’ format. Functions generally become more powerful the longer we can shape the data since they naturally scan vectors of columns
wide_summarised %>%
tidyr::gather(key = "time_category", value = "average_spend") %>%
ggplot(aes( x = time_category, y = average_spend))+
geom_bar(stat = "identity")
With the wide format we would need to plot each column individually but now we can do them altogether.
Bit of both
What if I asked you “Show me all the time interactions for numeric variables on a customer level”
Let’s look at the releavant data for spend again:
dummy_data %>%
select_if(is.numeric)
## # A tibble: 1,000 x 5
## store_counts visits_count item_sum discount_sum spend_sum
## <int> <int> <int> <dbl> <dbl>
## 1 1 1 1 0 40.0
## 2 1 1 NA 0 15.9
## 3 1 1 6 4.05 41.9
## 4 1 1 4 0 118
## 5 1 1 1 0 NA
## 6 1 1 1 6.90 43.2
## 7 1 1 1 NA 31.9
## 8 1 1 NA 0 186
## 9 1 1 4 10.0 14.0
## 10 1 1 1 NA 15.0
## # ... with 990 more rows
How would I wrangle out the interactions?
Well one way is to name the interactions by gathering the relevant metrics together before we widen them
dummy_data %>%
select(store_counts,item_sum,discount_sum,spend_sum,customer_no,spend_time) %>%
gather(key = "metric", value = "value",-customer_no, -spend_time) %>%
mutate(interaction = paste0(spend_time,"_",metric)) %>%
spread(key = interaction,value = value)
## # A tibble: 4,000 x 23
## customer_no spend_time metric afternoon_discou… afternoon_item_…
## <chr> <chr> <chr> <dbl> <dbl>
## 1 001dc4f17ad2f79… morning discoun… NA NA
## 2 001dc4f17ad2f79… morning item_sum NA NA
## 3 001dc4f17ad2f79… morning spend_s… NA NA
## 4 001dc4f17ad2f79… morning store_c… NA NA
## 5 0020cab9597c125… afternoon discoun… 15.0 NA
## 6 0020cab9597c125… afternoon item_sum NA 3.00
## 7 0020cab9597c125… afternoon spend_s… NA NA
## 8 0020cab9597c125… afternoon store_c… NA NA
## 9 0068525ef1d0d39… afternoon discoun… NA NA
## 10 0068525ef1d0d39… afternoon item_sum NA 5.00
## # ... with 3,990 more rows, and 18 more variables:
## # afternoon_spend_sum <dbl>, afternoon_store_counts <dbl>,
## # evening_discount_sum <dbl>, evening_item_sum <dbl>,
## # evening_spend_sum <dbl>, evening_store_counts <dbl>,
## # lunch_discount_sum <dbl>, lunch_item_sum <dbl>, lunch_spend_sum <dbl>,
## # lunch_store_counts <dbl>, morning_discount_sum <dbl>,
## # morning_item_sum <dbl>, morning_spend_sum <dbl>,
## # morning_store_counts <dbl>, NA_discount_sum <dbl>, NA_item_sum <dbl>,
## # NA_spend_sum <dbl>, NA_store_counts <dbl>
# arrange(-afternoon_spend_sum)
Changing a problem
Let’s say for instance we have the following data:
dummy_data_2 <-
tibble(shop = LETTERS,
churn_rate = runif(26, 0, 1),
attrition = runif(26, 0, 1),
join_rate = runif(26, 0, 1),
sales = rnorm(1,mean = 5, sd = 3)*runif(26,10^2,10^4))
dummy_data_2
## # A tibble: 26 x 5
## shop churn_rate attrition join_rate sales
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A 0.986 0.506 0.447 3409
## 2 B 0.588 0.0882 0.637 42124
## 3 C 0.804 0.0999 0.0627 9262
## 4 D 0.233 0.192 0.469 35910
## 5 E 0.0441 0.292 0.0621 13394
## 6 F 0.0300 0.675 0.656 32413
## 7 G 0.991 0.152 0.826 20847
## 8 H 0.141 0.895 0.395 51524
## 9 I 0.970 0.220 0.00620 9599
## 10 J 0.834 0.587 0.182 4008
## # ... with 16 more rows
If we wanted to take the each shop row and multiply the columns together to find the multiplicative affect how would we do it?
One way to do this is to use some form of apply function:
dummy_data_2 %>%
select(-1) %>%
apply(function(x) prod(x),MARGIN = 1)
## [1] 758.97678 1391.16293 46.65377 755.62063 10.71378
## [6] 430.42521 2589.78481 2566.70058 12.69749 357.94699
## [11] 11380.89775 2404.21539 7869.61525 141.64731 364.92837
## [16] 14655.70645 517.01111 1075.52391 2197.33395 120.62184
## [21] 246.22134 120.96861 42.18102 2226.17664 43.17457
## [26] 5226.73185
This feels sort of ugly and returns a vector… We would prefer to remain type stable here inside a dataframe.
We could simply reshape this data so that we are infact summarizing over rows not columns:
dummy_data_2 %>%
gather(key = "metric", value = "value",-shop) %>%
group_by(shop) %>%
summarise(prod = value %>% prod)
## # A tibble: 26 x 2
## shop prod
## <chr> <dbl>
## 1 A 759
## 2 B 1391
## 3 C 46.7
## 4 D 756
## 5 E 10.7
## 6 F 430
## 7 G 2590
## 8 H 2567
## 9 I 12.7
## 10 J 358
## # ... with 16 more rows
By reshaping the data we remain inside of a tidy context while leveraging our sql like backend dplyr
to solve the problem
Advanced wrangling
How do we use wrangling when the data and the question become more complex?
Let’s follow through the examples given by Jenny Brian on the Game Of Thrones data
About the data
Here we have a dataset api that we can use to pull data out about any Game Of Thrones character.
The api can be found via:
https://anapioficeandfire.come
Let’s look at what we get if we ask for some jason data on all the characters (1 entry):
data_characters_request <-
# httr::GET(url = "https://anapioficeandfire.com/api/characters?pageSize=1000")
httr::GET(url = "https://www.anapioficeandfire.com/api/characters")
data_characters <-
data_characters_request %>%
content()
data_characters[[1]]
## $url
## [1] "https://www.anapioficeandfire.com/api/characters/1"
##
## $name
## [1] ""
##
## $gender
## [1] "Female"
##
## $culture
## [1] "Braavosi"
##
## $born
## [1] ""
##
## $died
## [1] ""
##
## $titles
## $titles[[1]]
## [1] ""
##
##
## $aliases
## $aliases[[1]]
## [1] "The Daughter of the Dusk"
##
##
## $father
## [1] ""
##
## $mother
## [1] ""
##
## $spouse
## [1] ""
##
## $allegiances
## list()
##
## $books
## $books[[1]]
## [1] "https://www.anapioficeandfire.com/api/books/5"
##
##
## $povBooks
## list()
##
## $tvSeries
## $tvSeries[[1]]
## [1] ""
##
##
## $playedBy
## $playedBy[[1]]
## [1] ""
What info did we recieve from the api?
data_characters[[1]] %>% names
## [1] "url" "name" "gender" "culture" "born"
## [6] "died" "titles" "aliases" "father" "mother"
## [11] "spouse" "allegiances" "books" "povBooks" "tvSeries"
## [16] "playedBy"
Let’s get all the data from the api:
extract_api_data <- function(page = 1) {
data_characters_request <-
httr::GET(url = paste0("https://anapioficeandfire.com/api/characters?page=",page)) %>%
content()
}
all_pages <-
map(1:214,extract_api_data)
all_pages %>%
saveRDS("../../static/data/data-wrangling/all_pages.rds")
OK, so we have a nested structure of data describing each GOT character…
Let’s see which characters we have in here…
# data_characters <-
# all_pages %>%
# unlist(recursive = FALSE)
data_characters %>%
map("name") %>%
tail(20)
## [[1]]
## [1] "Wyl"
##
## [[2]]
## [1] "Wyl"
##
## [[3]]
## [1] "Wylla"
##
## [[4]]
## [1] "Xaro Xhoan Daxos"
##
## [[5]]
## [1] "Xhondo"
##
## [[6]]
## [1] "Yandry"
##
## [[7]]
## [1] "Ygon"
##
## [[8]]
## [1] "Ygritte"
##
## [[9]]
## [1] "Yna"
##
## [[10]]
## [1] "Yorkel"
##
## [[11]]
## [1] "Yorko Terys"
##
## [[12]]
## [1] "Yormwell"
##
## [[13]]
## [1] "Young Henly"
##
## [[14]]
## [1] "Ysilla"
##
## [[15]]
## [1] "Zarabelo"
##
## [[16]]
## [1] "Zei"
##
## [[17]]
## [1] "Zekko"
##
## [[18]]
## [1] "Zharaq zo Loraq"
##
## [[19]]
## [1] "Zollo"
##
## [[20]]
## [1] "Criston Cole"
Put this into a dataframe:
GOT_df <-
map_df(data_characters,`[`, c("name","gender","culture","born","died")) %>%
mutate(titles = data_characters %>% map("titles"))
GOT_df %>% sample_n(20)
## # A tibble: 20 x 6
## name gender culture born died titles
## <chr> <chr> <chr> <chr> <chr> <list>
## 1 Mezzara Female Ghiscari At Meereen "" <list …
## 2 Daeron Targaryen Male "" In 114 AC In 130 AC, a… <list …
## 3 Dake Male "" "" At Coldmoat <list …
## 4 Robin Hill Male "" "" "" <list …
## 5 Steffon Swyft Male "" "" "" <list …
## 6 Tyrion Tanner Male "" In 300 AC, at … "" <list …
## 7 Drennan Male Ironborn "" In 299 AC, a… <list …
## 8 Dacey Mormont Female Northmen In or between … In 299 AC, a… <list …
## 9 Wat Male "" "" "" <list …
## 10 Agrivane Male "" "" "" <list …
## 11 Jeyne Swann Female "" "" "" <list …
## 12 Bronn Male "" In or between … "" <list …
## 13 Tregar Male "" "" In 298 AC, a… <list …
## 14 Luthor Tyrell Male "" "" "" <list …
## 15 Layna Female "" In or between … "" <list …
## 16 Edwyd Fossoway Male "" "" In 299 AC, a… <list …
## 17 Elys Westerling Male "" "" "" <list …
## 18 Kyleg Male Free Folk "" "" <list …
## 19 Aggo Male Dothraki "" "" <list …
## 20 Sarra Frey Female "" In 285 AC "" <list …
So we can pull out the data from these nested lists into other columns or even apply models to them…
Let’s visualize the cultures of all the characters:
GOT_df %>%
filter(culture != "") %>%
# select(-data_char) %>%
ggplot()+
geom_bar(aes(x = culture, y = ..count.., fill = culture))+
ggplot2::theme(axis.text.x = element_text(angle = 90, hjust = 1))+
coord_flip()
OK, so in this sample sent back from the api we have this distribution of characters by cultures.
How many people died?
GOT_df %>%
filter(culture != "") %>%
mutate(died = ifelse(died == "",0,1)) %>%
# select(-data_char) %>%
ggplot()+
geom_bar(aes(y = died, x = culture, fill = culture), stat = "identity")+
ggplot2::theme(axis.text.x = element_text(angle = 90, hjust = 1))+
coord_flip()
What’s the probability of dying if you have some sort of title:
GOT_df %>%
mutate(died = ifelse(died == "",FALSE,TRUE)) %>%
mutate(has_title = titles %>% map_lgl(~ifelse(any(.x != ""), TRUE, FALSE))) %>%
group_by(has_title,died) %>%
tally %>%
spread(key = died, value = n, sep = "_") %>%
mutate(prob_of_dying = died_TRUE/(died_TRUE+died_FALSE))
## # A tibble: 2 x 4
## # Groups: has_title [2]
## has_title died_FALSE died_TRUE prob_of_dying
## <lgl> <int> <int> <dbl>
## 1 F 797 251 0.240
## 2 T 784 302 0.278