Anri, Wed, Sep 1, 2021
Introduction: A retailer wishes to understand a variety of shopping behaviour at its stores. For example how does its existing customers compare to its new customers? What about its lost customers compared to its existing customers? Does its customers look different from its competitors? To answer these questions we had a look at establishing which customer characteristics are the most important when categorising customers into the above spend behaviour groups.
Stefan, Sun, Feb 17, 2019
This notebook is designed to give a friendly introduction to pytorch There are many different deeplearning tools currently out there.
My personal best are Keras and Pytorch.
Why Pytorch over tensorflow? Tensorflow is way too low level for most people… And why go low level if the library itself already obfuscates the math from us? Well, because some people want to write their own tensor operations.
But don’t worry, the complexity is not worth it for most of us who want 20% of the effor and 80% of the results.
Pieter, Fri, Feb 8, 2019
What is GNU SCREEN GNU SCREEN (screen) is a text-based program usually described as a window manager or terminal multiplexer. While it does a great many things, its two biggest features are its detachability and its multiplexing. The detachability means that you can run programs from within screen, detach and log out, then log in later, reattach, and the programs will still be there. The multiplexing means that you can have multiple programs running within a single screen session, each within its own window.
Stefan, Thu, Jan 31, 2019
Introduction Measuring treatment effect in data contexts where the response was already measured without an experimental design usually requires matching to control for confounding effects.
Here I outline matching using random forests to improve on performance over genetic matching while maintaining reasonable matching quality
Why? In most cases the first line of attack would be matching using propensity scoring. This is easily done using a GLM logit model with a package like MatchIt in R.
Stefan, Wed, Jan 30, 2019
Install Ingres You can find the installation manual for Ingres at:
http://docs.actian.com/ingres/11.0/index.html#page/QuickStart_Linux/Installing_Ingres_for_Linux.htm
The easiest method is probably to download the tar file onto the Ubuntu machine.
Unzip the tar file using the usual commands
Inside the unzipped folder run the express_install.sh bash script
- Provide the -user xxx parameter to setup the environment variables more easily for a specific user on the Ubuntu machine
- Run only on a bash shell (not fish or other shells)
Stefan, Wed, Oct 31, 2018
Introduction If you have ever created an API endpoint using plumbr or flask you may have wondered how to do some simple load testing…
Here is a simple way to test this using an R script.
Define a function that calls your api with some parameters Depending on what your api does you may want to test different parameters altogether, for our case we will just make a simple call using fixed parameters:
Stefan, Wed, Oct 31, 2018
Introduction What is matching Statistical matching is the process where we pair up responses in the treatment group with their respective doppelganger from the untreated group.
Once matched these groups can be compared similar to a DoE treatment group.
Problem statement Often during performance analytics we have to use statistical matching to construct a control group post program launch.
In this case we may want to validate the quality of the matching.
Stefan, Wed, Oct 24, 2018
Introduction In consumer analytics it is very important to measure the effectiveness of your marketing and campaign management.
Unfortunately each company may have a different oppertunity-infrastructure proposition. Some may run product campaigns. Others run loyalty programs for customers. Some programs have already been run or outbound communication has already been sent out before it was decided to run performance analytics.
All of these variations make measuring uplift quite complicated.
Gayle, Tue, Oct 23, 2018
Motivation The Vertica database uses sequences to keep track of auto-incremented columns. So for example, if you have a table with an ID column, and you want Vertica to automatically generate an ID for every new row that you add, Vertica needs a way of knowing what the next number should be. It does this by creating a sequence associated with the auto-incremented field, and when new rows are added, it checks for the next value in the sequence, and uses that.
Stefan, Wed, Oct 17, 2018
Introduction This notebook explores the research and papers available on applying machine learning in manufacturing plants or large scale industrial applications.
Great summary journal Here is a great summary paper written by:
Thorsten Wuest, Daniel Weimer, Christopher Irgens & Klaus-Dieter Thoben
https://www.tandfonline.com/doi/full/10.1080/21693277.2016.1192517
Known applications of ML in manufacturing From the journal:
Manufacturing requirement Theoretical ability of ML to meet requirements Ability to handle high-dimensional problems and data-sets with reasonable effort Certain ML techniques (e.
Peter Smith, Tue, Sep 18, 2018
Working with the AMPS dataset Understanding the data Working with AMPS in R Read in Libraries Read in Data Transforming Radio-buttons Transforming Checkboxes Bringing it all together Conclusions and Criticisms Working with the AMPS dataset On one of my recent projects, I’ve had to do a lot of work with AMPS, drawing the data from the SQL database directly rather than via the portal on our website.
Kylen, Wed, Aug 22, 2018
Objectives In this post we will compare two samples for differences in their means. The catch is that we want to implement this in a spreadsheet programme (MS Excel), so that it can be easily used by people with no background in R/Python. Microsoft have included many functions allowing for standard frequentist analyses to be performed in Excel, however we will use these functions in order to perform bayesian inference.
Stefan, Fri, Aug 10, 2018
Overview As a data analyst I often consult for clients who need me to work on their existing infrastructure. But this requires me to set up my entire environment on a remote machine.
So the question is; “What is the best way for me to setup an environment for database exploration on a clean machine?”
Introducing packrat - R’s package management system.
Installation The first thing you want to do is setup the environment you want to use on your local machine.
Marlan, Fri, Jul 27, 2018
Writing code is hard enough. But it sucks when your desktop OS seems to be actively fighting against you actually getting anything done. The best case scenario is that you’re running a Unix/Linux based OS. If you’re running something fairly mainstream like Ubuntu, chances are most everything you need works out the box (apart from wireless and graphics card drivers of course) and most of the documentation, tutorials and resources on the internet assume that.
Louis Becker, Thu, Jul 19, 2018
Data Visualisation Chapter Aesthetic mapping Exercises Facets Geoms Geom Exercises Statistical Transformations Transformation Exercises Position Adjustments position = identity position = fill position = dodge Something interesting: position = jitter Position Adjustment exercises Coordinate Systems Coord_flip() Coord_quickmap() Coord_polar() Coordinate Systems Exercises Layered Grammar of Graphics This piece is part of a series that serves as a condensed help guide that I use to explore R and the tidyverse packages as I work through R for Data Science available here
Stefan, Mon, Jul 16, 2018
In this notebook we will attempt to predict or forecast the movement of rain/clouds over the area of North America!
Of course the weather is a known chaotic system, therefore we will try to create an initial benchmark by throwing a deep learning model at the problem.
The images can be downloaded from AWS S3 using the following link https://s3-eu-west-1.amazonaws.com/data-problems/precipitation_data.zip
Import some libraries import os from os import listdir from PIL import Image as PImage import matplotlib.
Stefan, Wed, Jul 11, 2018
In this blog I load the toy dataset for detecting breast cancer. Often before you can fine tune and productionize any model you will have to play with and test a wide range of models.
Luckily we have frameworks like scikit_learn and caret to train many different models! Prepare yourself
import numpy import pandas import matplotlib.pyplot as plt from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.
stefan, Thu, Jul 5, 2018
Back story Load up the data Summarise data Add some useful labels Distribution of affinity to each house Number of people in each house Cross tabulation of each person’s top house VS desired house Cross tabulation of each person’s top house VS expected house Exploration Correlations between houses predicted Correlations between top house and desired house Correlations between top house and expected house Visualize similarity of people Artificial intelligent sorting hat!
Stefan Fouche, Thu, Jul 5, 2018
Overview The data The problem Benchmark many models with caret Set crossvalidation parameters Build model data framework Train models Visualize the residuals Introducing DALEX explainers! Model performance Variable Importance Variable response Prediction breakdown Packages library(tidyverse) library(caret) library(magrittr) library(DALEX) Overview This blog will cover DALEX explainers. These are very useful when we need to validate a model or explain why a model made the prediction it made on an observation basis.
Stefan Fouche, Tue, Jul 3, 2018
Overview Download data Build network Data preprocessing Image data augmentation How it works Create new network Train new network with augmentation generators Further optimization Transfer learning - VGG16 Overview Deep neural networks using convulotional layers are currently (2018) the best immage classification algorithms out there… Let’s build one to see what its all about.
How about identifying cats and dogs?
This post follows through the example in the book “Deep learning with R” by Francios Chollet with J.
Stefan Fouche, Tue, Jul 3, 2018
Overview Load the IMDB data View data Link the original data Prepare data as tensors One-hot encode Set outcome data types Build network Split test/train Train If we train for only 4 epochs If we use dropout Investigate the best predicted movie Overview This post follows through the example in the book “Deep learning with R” by Francios Chollet with J. J. Alaire.
Kylie, Wed, Jun 27, 2018
This blog will take a look at scraping the TomTom and Google Places APIs to get all the points of interest in an area. A recursive grid search algorithm is discussed that efficiently identifies all of the POIs in a large area where there is a limit on the number of results the API returns.
TomTom vs Google First, let’s compare each API:
TomTom Google Places Max free daily requests 2500 2500 Max results returned 100 20 Point search Yes Yes Max point search radius none 50km Rectangle search Yes No Up-to-date No Yes In each case, you need to register for your own API key which you include as a parameter in the search.
Kieron, Mon, Apr 30, 2018
Prelude and set up
Pre-reqs
Stuff for the deck
Actual OfficeR
Prelude and set up
In this markdown I will explain how to use OfficeR to generate powerpoint presentations.
The rmsfuns package was useful just to use the ‘load_pkg’ function which makes it easier to load multiple packages. Firstly, you need the OfficeR package. I am also using the extrafont package in order to use Tahoma in the powerpoint, according to the Eighty20 template.
Stefan Fouche, Sun, Apr 22, 2018
Overview Naive model (no time index) Load the data Scale the variables Define the model Measuring over-fit using k-folds crossvalidation Get results Benchmark vs Gradient boosting machines Time series models using LSTM together with an inference network Read in the data Read in the data Process data Design inference model Design LSTM model Test a LSTM model Everything set… time to get started! Back test LSTM model Combine LSTM and Inference networks into 1 deep neural network?
stefan, Tue, Apr 17, 2018
What are autoencoders? How do we build them? Build it Step 1 - load and prepare the data Step 2 - define the encoder and decoder Step 3 - compile and train the autoencoder Step 4 - Extract the weights of the encoder Step 5 - Load up the weights into an ecoder model and predict Conclusion library(ggplot2) library(keras) library(tidyverse) ## ── Attaching packages ──────────────────────────────────────────────────── tidyverse 1.
Stefan, Wed, Apr 4, 2018
About the Plumer package In oder to serve our API we will make use of the great Plumer package in R
To read more about this package go to:
https://www.rplumber.io/docs/
Setup Load in some packages.
If you are going to host the api on a suse or redhat linux server make sure you have all the dependencies as well as the packages installed to follow through this example yourself.
Marlan, Fri, Mar 30, 2018
This is a follow on post to Stefan’s original to show how to generate a blog post from a Jupyter Notebook instead of an R markdown. This post itself started off life as a Jupyter Notebook which lives in the same content/posts folder as the other Rmd files used for the site. We’ll walk through how it became a blog post.
The process is a little more complicated than for the Rmd files (since that’s what Blogdown was built for but we can still get it to work relatively easily.
stefan, Thu, Mar 29, 2018
Pre-requisites Create from scratch Clone or open the blog repository from github Create a post using “Addins” dropdown Change/check your date format Write a kickass post Compile your new work using Blogdown Push your post to the website Create from existing Rmd Pre-requisites Install blogdown and Hugo in R-Console
https://bookdown.org/yihui/blogdown/installation.html Create from scratch Creating it from scratch is probably the easiest since you can run and test your code as you type it up.
Stefan Fouche, Wed, Mar 28, 2018
Dealing with nested data can be really frustrating…
Especially if you want to keep your workspace nice and tidy with all your data in tables!
With no actual experience trying to get at these nested tibbles can seem almost impossible:
via GIPHY
-- Downloading data from an api created by Blizzard To illustrate how you would deal with nested data I found an api that let’s you download all kinds of data on the e-sport/game called Overwatch.
Stefan, Tue, Mar 27, 2018
Why data-wrangling If you can wrangle data into the proper form you can do anything with it…
Data-wrangling is absolutely essential for every data science task where we need to work with collected data.
A recent article from the New York Times said “Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information.
stefan, Mon, Mar 26, 2018
Why it matters When working with recurring clients and projects together with tight deadlines it is easy to cut corners or forget good practice.
Here are the top reasons why a data science workflow is vital:
Work is well documented
Folders and files logically are structured
Projects remain version controlled
Data output is tracked
Logical seperation of scripts make productizing easy
Good naming conventions make tracking of the project flow easy
Stefan Fouche, Wed, Nov 15, 2017
Overview Having just started playing with deeplearning models in R, I wanted to visually compare them to other more traditional ML workflows. Of course, deeplearning is generally used where other models fail, but with no need for feature selection and rapidly increasing power and ease of use they may just evolve into a general learning paradigm.
However, with tabular data and packages like caret the machine learning methods have become so streamlined that minimal user input is required at all.
Stefan Fouchè, Tue, Nov 7, 2017
Introduction This document will cover in depth the use of DoE experiments in R! We will focus our attention to measuring treatment effects, not variances.
Based on the work:
Lawson, John. Design and Analysis of Experiments with R. Chapman and Hall/CRC, 20141217. VitalBook file.
Definitions Experiment (also called a Run) is an action where the experimenter changes at least one of the variables being studied and then observes the effect of his or her actions(s).
Laurence Sonnenberg, Mon, Jun 12, 2017
When reading poetry, it feels as if you are reading emotion. A good poet can use words to bring to their reader whichever emotions they choose. To do this, these poets often have to look deep within themselves, using their own lives as fuel. For this reason you would expect that poets and as a result, their poetry, would be influenced by the events happening around them. If certain events caused the poet’s view of the world to be an unhappy one, you’d expect their poetry to be unhappy, and of course the same for the opposite.
Hanjo Odendaal, Sun, Feb 19, 2017
SatRday From the 16th to the 18th of February, satRday was held in the City of Cape Town in South Africa. The programme kicked off with two days of workshops and then the conference on Saturday. The workshops were divided up into three large sections:
R and Git (Jennifer Bryan) Shiny, flexdashboard and Shinyapps.io (Julia Silge) Building and validating logistic regression models (Steph Locke) R and Git
Easy integration of version control through git and Rstudio has never been this easy.
Hanjo Odendaal, Tue, Feb 14, 2017
Introduction of Parameterized dplyr expression The usefullness of any small function you write will eventually be judged upon its ability to be generically applied across any arbitrary data. As I explored a blog post from Dec 2016, I became a lot more interested in writing dynamic code with dplyr functions that form part of the data wrangling silo in my analytical flow. This ability came with the new replyr package - No longer will I have the need to break up my data processing when columns have to be changed as my code depends on certain column names in my dataset that is currently in use.
Stefan Fouche, Fri, Jan 6, 2017
Important machine learning libraries Future additions Part 1 - Data Preprocessing Part 2 - Regression Simple Linear Regression Multiple linear regression Polynomial Regression Support vector machine regression Regression Trees Random forest regression A more robust application of machine learning regressions (random forest) Part 3 - Clustering K-means Part 4 - Dimensionality Reduction Create PCA Part 5 - Reinforced Learning Multi-Armed Bandit Problem Upper Confidence Bound (UCB) method Improve results using UCB Visualize the model add selection Part 6 - Parameter Grid Search, Cross-validation and Boosting Grid Search and Parameter Tuning XGBoost Important machine learning libraries I created this document to serve as an easy introduction to basic machine learning methods.
Laurence Sonnenberg, Tue, Aug 30, 2016
Introduction Saving your R dataframe to a .csv can be useful; being able to view the data all at once can help to see the bigger picture. Often though, multiple dataframes, all pieces of the same project, need to be viewed this way and related back to one another. In this case viewing becomes far easier when these dataframes are written to .xlsx across multiple sheets in a single workbook. Not to mention the time and energy saved when you no longer have to find and open multiple files.
Hanjo Odendaal, Tue, Aug 9, 2016
After high school I made my way from Johannesburg, situated in the northern part of South Africa, to the famous wine country known as Stellenbosch. Here for the first time I got a ton of exposure to wine and the wonderful myriad of varietals that make up this “drink of the gods”.
The one trick in wine tasting and exploring vini- and viticulture is the fact that the best way to learn about it all is going out to the farm and drinking the wine for yourself.
Hanjo Odendaal, Thu, Feb 18, 2016
Introduction When data becomes high-dimensional, the inherent relational structure between the variables can sometimes become unclear or indistinct. One, might want to find clusters for numerous amounts of reasons - me, I want to use it to better understand my childhood. To be more specific, I will be using clustering to highlight different groupings of pokemon. The results of this analysis can then retrospectively be applied to a younger me having to choose which pokemon I catch and keep, or perhaps which I must rather use in battle to gain experience points.
Vulindlela Ndiweni, Wed, Feb 17, 2016
Our behaviour is often very variable and reducing it to single number such as an average might be comforting but ultimately misleading. For instance, I generally use 500MB of data on my cellphone on a monthly basis but this can be as little as 200 MB and on occasion well beyond 1GB. This line of thinking would suggest my behaviour is best modelled as a distribution with a peak around 500MB and a mean just above 500MB.
Allan Davids, Thu, Feb 11, 2016
Typically, when we want to understand the relationship between two variables we simply regress one on the other, plot the points and fit a linear trend line. To illustrate this, we use the EuStockMarkets datset pre-loaded in R. The data-set contains the daily closing prices of major European stock indices. We will focus on the FTSE. Below, we regress the FTSE daily closing price data on time (what we will call an “Index”) and plot a linear trend.
Allan Davids, Thu, Feb 4, 2016
Are you the type of person that likes your code to be identically indented? for spacing to be consistent throughout your script, everything to be clear and aligned, read easily and just look nice? Well, I am. Often I get tremendously untidy code from other coders (those bandits), and it takes me way too much time to understand where this for loop starts, where that function ends etc. Reading this code gets especially confusing if the person for example uses nested for loops (for loop inception - a for in a for in a for…).
Gayle Apfel, Tue, Dec 15, 2015
I recently had a chunk of leave, and I thought that a good use of my time would be to read “An Introduction to Generalized Linear Models”, by Annette J. Dobson and Adrian G. Barnett (2008). My statistical background is somewhat haphazard, so this book really filled in some of the cracks in my foundation. It provides an overview of the theory, illustrated with examples, and includes code to implement the methods in both R and Stata.
Hanjo Odendaal, Tue, Nov 24, 2015
This short post will explore a funny dataset that comes are part of R’s dataset library. The dataset will called out of interest of what it contains as well as using it to enagage with Hadley Wickhams’s ggplot2 package.
Reynolds (1994) describes a small part of a study of the long-term temperature dynamics of beaver Castor canadensis in north-central Wisconsin. Body temperature was measured by telemetry every 10 minutes for four females, but data from a one period of less than a day for each of two animals is used here.
, Mon, Jan 1, 0001
Intro to Pytorch for Deeplearning /*! * * Twitter Bootstrap * */ /*! * Bootstrap v3.3.6 (http://getbootstrap.com) * Copyright 2011-2015 Twitter, Inc. * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE) */ /*! normalize.css v3.0.3 | MIT License | github.com/necolas/normalize.css */ html { font-family: sans-serif; -ms-text-size-adjust: 100%; -webkit-text-size-adjust: 100%; } body { margin: 0; } article, aside, details, figcaption, figure, footer, header, hgroup, main, menu, nav, section, summary { display: block; } audio, canvas, progress, video { display: inline-block; vertical-align: baseline; } audio:not([controls]) { display: none; height: 0; } [hidden], template { display: none; } a { background-color: transparent; } a:active, a:hover { outline: 0; } abbr[title] { border-bottom: 1px dotted; } b, strong { font-weight: bold; } dfn { font-style: italic; } h1 { font-size: 2em; margin: 0.