Important Predictors of Spend Behaviour

Introduction: A retailer wishes to understand a variety of shopping behaviour at its stores. For example how does its existing customers compare to its new customers? What about its lost customers compared to its existing customers? Does its customers look different from its competitors? To answer these questions we had a look at establishing which customer characteristics are the most important when categorising customers into the above spend behaviour groups.

Intro to Pytorch for Deeplearning

This notebook is designed to give a friendly introduction to pytorch There are many different deeplearning tools currently out there. My personal best are Keras and Pytorch. Why Pytorch over tensorflow? Tensorflow is way too low level for most people… And why go low level if the library itself already obfuscates the math from us? Well, because some people want to write their own tensor operations. But don’t worry, the complexity is not worth it for most of us who want 20% of the effor and 80% of the results.

Using GNU SCREEN when working in Terminal

What is GNU SCREEN GNU SCREEN (screen) is a text-based program usually described as a window manager or terminal multiplexer. While it does a great many things, its two biggest features are its detachability and its multiplexing. The detachability means that you can run programs from within screen, detach and log out, then log in later, reattach, and the programs will still be there. The multiplexing means that you can have multiple programs running within a single screen session, each within its own window.

Customer matching using random forest

Introduction Measuring treatment effect in data contexts where the response was already measured without an experimental design usually requires matching to control for confounding effects. Here I outline matching using random forests to improve on performance over genetic matching while maintaining reasonable matching quality Why? In most cases the first line of attack would be matching using propensity scoring. This is easily done using a GLM logit model with a package like MatchIt in R.

Installing and connecting to an Ingres database

Install Ingres You can find the installation manual for Ingres at: The easiest method is probably to download the tar file onto the Ubuntu machine. Unzip the tar file using the usual commands Inside the unzipped folder run the bash script - Provide the -user xxx parameter to setup the environment variables more easily for a specific user on the Ubuntu machine - Run only on a bash shell (not fish or other shells)

Benchmark your API

Introduction If you have ever created an API endpoint using plumbr or flask you may have wondered how to do some simple load testing… Here is a simple way to test this using an R script. Define a function that calls your api with some parameters Depending on what your api does you may want to test different parameters altogether, for our case we will just make a simple call using fixed parameters:

Validating customer matching in uplift analysis

Introduction What is matching Statistical matching is the process where we pair up responses in the treatment group with their respective doppelganger from the untreated group. Once matched these groups can be compared similar to a DoE treatment group. Problem statement Often during performance analytics we have to use statistical matching to construct a control group post program launch. In this case we may want to validate the quality of the matching.

A Survey of Uplift at 8020

Introduction In consumer analytics it is very important to measure the effectiveness of your marketing and campaign management. Unfortunately each company may have a different oppertunity-infrastructure proposition. Some may run product campaigns. Others run loyalty programs for customers. Some programs have already been run or outbound communication has already been sent out before it was decided to run performance analytics. All of these variations make measuring uplift quite complicated.

Using sequence caching in Vertica

Motivation The Vertica database uses sequences to keep track of auto-incremented columns. So for example, if you have a table with an ID column, and you want Vertica to automatically generate an ID for every new row that you add, Vertica needs a way of knowing what the next number should be. It does this by creating a sequence associated with the auto-incremented field, and when new rows are added, it checks for the next value in the sequence, and uses that.

A Survey of Machine Learning in Industry

Introduction This notebook explores the research and papers available on applying machine learning in manufacturing plants or large scale industrial applications. Great summary journal Here is a great summary paper written by: Thorsten Wuest, Daniel Weimer, Christopher Irgens & Klaus-Dieter Thoben Known applications of ML in manufacturing From the journal: Manufacturing requirement Theoretical ability of ML to meet requirements Ability to handle high-dimensional problems and data-sets with reasonable effort Certain ML techniques (e.

Two Functions for Working with Amps

Working with the AMPS dataset Understanding the data Working with AMPS in R Read in Libraries Read in Data Transforming Radio-buttons Transforming Checkboxes Bringing it all together Conclusions and Criticisms Working with the AMPS dataset On one of my recent projects, I’ve had to do a lot of work with AMPS, drawing the data from the SQL database directly rather than via the portal on our website.

Performing a Bayesian T-Test in Excel

Objectives In this post we will compare two samples for differences in their means. The catch is that we want to implement this in a spreadsheet programme (MS Excel), so that it can be easily used by people with no background in R/Python. Microsoft have included many functions allowing for standard frequentist analyses to be performed in Excel, however we will use these functions in order to perform bayesian inference.

Setup packrat project for working on remote databases with R

Overview As a data analyst I often consult for clients who need me to work on their existing infrastructure. But this requires me to set up my entire environment on a remote machine. So the question is; “What is the best way for me to setup an environment for database exploration on a clean machine?” Introducing packrat - R’s package management system. Installation The first thing you want to do is setup the environment you want to use on your local machine.

Setting up a Web App Dev Environment under Windows

Writing code is hard enough. But it sucks when your desktop OS seems to be actively fighting against you actually getting anything done. The best case scenario is that you’re running a Unix/Linux based OS. If you’re running something fairly mainstream like Ubuntu, chances are most everything you need works out the box (apart from wireless and graphics card drivers of course) and most of the documentation, tutorials and resources on the internet assume that.

Condensed R For Data Science: Data Visualisation

Data Visualisation Chapter Aesthetic mapping Exercises Facets Geoms Geom Exercises Statistical Transformations Transformation Exercises Position Adjustments position = identity position = fill position = dodge Something interesting: position = jitter Position Adjustment exercises Coordinate Systems Coord_flip() Coord_quickmap() Coord_polar() Coordinate Systems Exercises Layered Grammar of Graphics This piece is part of a series that serves as a condensed help guide that I use to explore R and the tidyverse packages as I work through R for Data Science available here

Predicting cloud movements

In this notebook we will attempt to predict or forecast the movement of rain/clouds over the area of North America! Of course the weather is a known chaotic system, therefore we will try to create an initial benchmark by throwing a deep learning model at the problem. The images can be downloaded from AWS S3 using the following link Import some libraries import os from os import listdir from PIL import Image as PImage import matplotlib.

Test multiple sklearn models with LIME

In this blog I load the toy dataset for detecting breast cancer. Often before you can fine tune and productionize any model you will have to play with and test a wide range of models. Luckily we have frameworks like scikit_learn and caret to train many different models! Prepare yourself import numpy import pandas import matplotlib.pyplot as plt from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.

8020 Hogwarts Sorting Hat!

Back story Load up the data Summarise data Add some useful labels Distribution of affinity to each house Number of people in each house Cross tabulation of each person’s top house VS desired house Cross tabulation of each person’s top house VS expected house Exploration Correlations between houses predicted Correlations between top house and desired house Correlations between top house and expected house Visualize similarity of people Artificial intelligent sorting hat!

Explaining machine learning models

Overview The data The problem Benchmark many models with caret Set crossvalidation parameters Build model data framework Train models Visualize the residuals Introducing DALEX explainers! Model performance Variable Importance Variable response Prediction breakdown Packages library(tidyverse) library(caret) library(magrittr) library(DALEX) Overview This blog will cover DALEX explainers. These are very useful when we need to validate a model or explain why a model made the prediction it made on an observation basis.

Cats vs Dogs classifier

Overview Download data Build network Data preprocessing Image data augmentation How it works Create new network Train new network with augmentation generators Further optimization Transfer learning - VGG16 Overview Deep neural networks using convulotional layers are currently (2018) the best immage classification algorithms out there… Let’s build one to see what its all about. How about identifying cats and dogs? This post follows through the example in the book “Deep learning with R” by Francios Chollet with J.

IMDB movie classification

Overview Load the IMDB data View data Link the original data Prepare data as tensors One-hot encode Set outcome data types Build network Split test/train Train If we train for only 4 epochs If we use dropout Investigate the best predicted movie Overview This post follows through the example in the book “Deep learning with R” by Francios Chollet with J. J. Alaire.

Finding geographical points of interest using Python

This blog will take a look at scraping the TomTom and Google Places APIs to get all the points of interest in an area. A recursive grid search algorithm is discussed that efficiently identifies all of the POIs in a large area where there is a limit on the number of results the API returns. TomTom vs Google First, let’s compare each API: TomTom Google Places Max free daily requests 2500 2500 Max results returned 100 20 Point search Yes Yes Max point search radius none 50km Rectangle search Yes No Up-to-date No Yes In each case, you need to register for your own API key which you include as a parameter in the search.

Office R Blog

Prelude and set up Pre-reqs Stuff for the deck Actual OfficeR Prelude and set up In this markdown I will explain how to use OfficeR to generate powerpoint presentations. The rmsfuns package was useful just to use the ‘load_pkg’ function which makes it easier to load multiple packages. Firstly, you need the OfficeR package. I am also using the extrafont package in order to use Tahoma in the powerpoint, according to the Eighty20 template.

Predict house prices - deep learning, keras

Overview Naive model (no time index) Load the data Scale the variables Define the model Measuring over-fit using k-folds crossvalidation Get results Benchmark vs Gradient boosting machines Time series models using LSTM together with an inference network Read in the data Read in the data Process data Design inference model Design LSTM model Test a LSTM model Everything set… time to get started! Back test LSTM model Combine LSTM and Inference networks into 1 deep neural network?

Let's play with autoencoders (Keras, R)

What are autoencoders? How do we build them? Build it Step 1 - load and prepare the data Step 2 - define the encoder and decoder Step 3 - compile and train the autoencoder Step 4 - Extract the weights of the encoder Step 5 - Load up the weights into an ecoder model and predict Conclusion library(ggplot2) library(keras) library(tidyverse) ## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.

Serving a machine learning model via API

About the Plumer package In oder to serve our API we will make use of the great Plumer package in R To read more about this package go to: Setup Load in some packages. If you are going to host the api on a suse or redhat linux server make sure you have all the dependencies as well as the packages installed to follow through this example yourself.

Blogging from a Python Jupyter Notebook

This is a follow on post to Stefan’s original to show how to generate a blog post from a Jupyter Notebook instead of an R markdown. This post itself started off life as a Jupyter Notebook which lives in the same content/posts folder as the other Rmd files used for the site. We’ll walk through how it became a blog post. The process is a little more complicated than for the Rmd files (since that’s what Blogdown was built for but we can still get it to work relatively easily.

How to add a blog to Blogdown

Pre-requisites Create from scratch Clone or open the blog repository from github Create a post using “Addins” dropdown Change/check your date format Write a kickass post Compile your new work using Blogdown Push your post to the website Create from existing Rmd Pre-requisites Install blogdown and Hugo in R-Console Create from scratch Creating it from scratch is probably the easiest since you can run and test your code as you type it up.

Dealing with nested data

Dealing with nested data can be really frustrating… Especially if you want to keep your workspace nice and tidy with all your data in tables! With no actual experience trying to get at these nested tibbles can seem almost impossible: via GIPHY -- Downloading data from an api created by Blizzard To illustrate how you would deal with nested data I found an api that let’s you download all kinds of data on the e-sport/game called Overwatch.

Data Wrangling

Why data-wrangling If you can wrangle data into the proper form you can do anything with it… Data-wrangling is absolutely essential for every data science task where we need to work with collected data. A recent article from the New York Times said “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information.

Data Science Workflow

Why it matters When working with recurring clients and projects together with tight deadlines it is easy to cut corners or forget good practice. Here are the top reasons why a data science workflow is vital: Work is well documented Folders and files logically are structured Projects remain version controlled Data output is tracked Logical seperation of scripts make productizing easy Good naming conventions make tracking of the project flow easy

Benchmarking machine learning models in parallel

Overview Having just started playing with deeplearning models in R, I wanted to visually compare them to other more traditional ML workflows. Of course, deeplearning is generally used where other models fail, but with no need for feature selection and rapidly increasing power and ease of use they may just evolve into a general learning paradigm. However, with tabular data and packages like caret the machine learning methods have become so streamlined that minimal user input is required at all.

Design and Analysis of Experiments with R

Introduction This document will cover in depth the use of DoE experiments in R! We will focus our attention to measuring treatment effects, not variances. Based on the work: Lawson, John. Design and Analysis of Experiments with R. Chapman and Hall/CRC, 20141217. VitalBook file. Definitions Experiment (also called a Run) is an action where the experimenter changes at least one of the variables being studied and then observes the effect of his or her actions(s).

Single Word Analysis of Early 19th Century Poetry Using tidytext

When reading poetry, it feels as if you are reading emotion. A good poet can use words to bring to their reader whichever emotions they choose. To do this, these poets often have to look deep within themselves, using their own lives as fuel. For this reason you would expect that poets and as a result, their poetry, would be influenced by the events happening around them. If certain events caused the poet’s view of the world to be an unhappy one, you’d expect their poetry to be unhappy, and of course the same for the opposite.

SatRday and visual inference of vine copulas

SatRday From the 16th to the 18th of February, satRday was held in the City of Cape Town in South Africa. The programme kicked off with two days of workshops and then the conference on Saturday. The workshops were divided up into three large sections: R and Git (Jennifer Bryan) Shiny, flexdashboard and (Julia Silge) Building and validating logistic regression models (Steph Locke) R and Git Easy integration of version control through git and Rstudio has never been this easy.

Rewiring replyr with dplyr

Introduction of Parameterized dplyr expression The usefullness of any small function you write will eventually be judged upon its ability to be generically applied across any arbitrary data. As I explored a blog post from Dec 2016, I became a lot more interested in writing dynamic code with dplyr functions that form part of the data wrangling silo in my analytical flow. This ability came with the new replyr package - No longer will I have the need to break up my data processing when columns have to be changed as my code depends on certain column names in my dataset that is currently in use.

A Soft Introduction to Machine Learning

Important machine learning libraries Future additions Part 1 - Data Preprocessing Part 2 - Regression Simple Linear Regression Multiple linear regression Polynomial Regression Support vector machine regression Regression Trees Random forest regression A more robust application of machine learning regressions (random forest) Part 3 - Clustering K-means Part 4 - Dimensionality Reduction Create PCA Part 5 - Reinforced Learning Multi-Armed Bandit Problem Upper Confidence Bound (UCB) method Improve results using UCB Visualize the model add selection Part 6 - Parameter Grid Search, Cross-validation and Boosting Grid Search and Parameter Tuning XGBoost Important machine learning libraries I created this document to serve as an easy introduction to basic machine learning methods.

Mirror, mirror on the wall

Introduction Saving your R dataframe to a .csv can be useful; being able to view the data all at once can help to see the bigger picture. Often though, multiple dataframes, all pieces of the same project, need to be viewed this way and related back to one another. In this case viewing becomes far easier when these dataframes are written to .xlsx across multiple sheets in a single workbook. Not to mention the time and energy saved when you no longer have to find and open multiple files.

Data Scientist with a wine hobby

After high school I made my way from Johannesburg, situated in the northern part of South Africa, to the famous wine country known as Stellenbosch. Here for the first time I got a ton of exposure to wine and the wonderful myriad of varietals that make up this “drink of the gods”. The one trick in wine tasting and exploring vini- and viticulture is the fact that the best way to learn about it all is going out to the farm and drinking the wine for yourself.

Gotta catch them all

Introduction When data becomes high-dimensional, the inherent relational structure between the variables can sometimes become unclear or indistinct. One, might want to find clusters for numerous amounts of reasons - me, I want to use it to better understand my childhood. To be more specific, I will be using clustering to highlight different groupings of pokemon. The results of this analysis can then retrospectively be applied to a younger me having to choose which pokemon I catch and keep, or perhaps which I must rather use in battle to gain experience points.

Untangling overlapping cellphone usage segments with Latent Class Analysis

Our behaviour is often very variable and reducing it to single number such as an average might be comforting but ultimately misleading. For instance, I generally use 500MB of data on my cellphone on a monthly basis but this can be as little as 200 MB and on occasion well beyond 1GB. This line of thinking would suggest my behaviour is best modelled as a distribution with a peak around 500MB and a mean just above 500MB.

Automated parameter selection for LOESS regression

Typically, when we want to understand the relationship between two variables we simply regress one on the other, plot the points and fit a linear trend line. To illustrate this, we use the EuStockMarkets datset pre-loaded in R. The data-set contains the daily closing prices of major European stock indices. We will focus on the FTSE. Below, we regress the FTSE daily closing price data on time (what we will call an “Index”) and plot a linear trend.

Cleaning up messy R code

Are you the type of person that likes your code to be identically indented? for spacing to be consistent throughout your script, everything to be clear and aligned, read easily and just look nice? Well, I am. Often I get tremendously untidy code from other coders (those bandits), and it takes me way too much time to understand where this for loop starts, where that function ends etc. Reading this code gets especially confusing if the person for example uses nested for loops (for loop inception - a for in a for in a for…).

An Introduction to Generalized Linear Models

I recently had a chunk of leave, and I thought that a good use of my time would be to read “An Introduction to Generalized Linear Models”, by Annette J. Dobson and Adrian G. Barnett (2008). My statistical background is somewhat haphazard, so this book really filled in some of the cracks in my foundation. It provides an overview of the theory, illustrated with examples, and includes code to implement the methods in both R and Stata.

Body Temperature Series of Two Beavers

This short post will explore a funny dataset that comes are part of R’s dataset library. The dataset will called out of interest of what it contains as well as using it to enagage with Hadley Wickhams’s ggplot2 package. Reynolds (1994) describes a small part of a study of the long-term temperature dynamics of beaver Castor canadensis in north-central Wisconsin. Body temperature was measured by telemetry every 10 minutes for four females, but data from a one period of less than a day for each of two animals is used here.

Intro to Pytorch for Deeplearning /*! * * Twitter Bootstrap * */ /*! * Bootstrap v3.3.6 ( * Copyright 2011-2015 Twitter, Inc. * Licensed under MIT ( */ /*! normalize.css v3.0.3 | MIT License | */ html { font-family: sans-serif; -ms-text-size-adjust: 100%; -webkit-text-size-adjust: 100%; } body { margin: 0; } article, aside, details, figcaption, figure, footer, header, hgroup, main, menu, nav, section, summary { display: block; } audio, canvas, progress, video { display: inline-block; vertical-align: baseline; } audio:not([controls]) { display: none; height: 0; } [hidden], template { display: none; } a { background-color: transparent; } a:active, a:hover { outline: 0; } abbr[title] { border-bottom: 1px dotted; } b, strong { font-weight: bold; } dfn { font-style: italic; } h1 { font-size: 2em; margin: 0.