Todd enjoys his job as a Statistical Analyst for Sierra Trading Post. He has been working there for over 5 years. Todd completed his M.S. in Applied Statistics and a B.A. in Economics at the University of Northern Colorado. After spending the day sorting through data issues, writing SQL and fitting models in R, Todd comes home to his loving Wife and children. His research interests include Reproducible Research, Missing Data, Logistic Regression and Issues Related to Big Data.

## Analyzing Lyft Data - Part I

Ride sharing is a popular way to get around and much cheaper than some alternatives.  How much do the people who choose to participate in ride sharing actually make? We will start this series of examining Lyft driving data by getting the data loaded and trying to understand which hours are the best to drive.

The first step is to get our data loaded. The data is contained in a Github repository . You can clone the repo to get started or just directly read in the csv file. We will be just read in the data file directly. If you want to see the whole script in one place, try LyftData.R.

# Load libraries ----
library(lubridate)
library(sqldf)
library(skimr)

stringsAsFactors = FALSE)

# Examine Data ----
dim(lyft)
str(lyft)

# My new favorite function
skim(lyft)

After getting the data loaded there is some cleaning that is needed. After cleaning the data we will be looking at which hours are the best to drive. It will be useful to create a variable to indicate which rides are part of the same driving session. This will allow us to have a nice level to analyze the ride data.

# Calculate StartTime and EndTime ----
lyft$StartTime <- as.POSIXct( paste( strptime(lyft$Date,format = "%m/%d/%Y"),
format(strptime(lyft$Time, "%I:%M %p"), format="%H:%M:%S") ),format="%Y-%m-%d %H:%M:%OS") lyft$EndTime <-lyft$StartTime+(lyft$Time_Min*60+lyft$Time_Sec) ################################################################################ # Calculate a driving session ---- # If more than 4 hours since last ride a new session # is assumed to have started ################################################################################ lyft$rideSession <- 0
for(i in 1:nrow(lyft) ){
#First Row - no comparison
if(i==1) {lyft$rideSession[1] <- 1 } else if(i== nrow(lyft) ){ #Last Row lyft$rideSession[i] <- lyft$rideSession[i-1] } else { timedifference <- as.numeric(difftime( lyft[i+1,]$StartTime, lyft[i,]$EndTime,units = "mins")) if(timedifference <= (60*4) ) { lyft$rideSession[i] <-lyft$rideSession[i-1] } else { lyft$rideSession[i] <- max(lyft$rideSession)+1 } } } lyft$TotalMoney <- lyft$Amount + lyft$Tip
lyft$starthour <- hour(lyft$StartTime)

# Night time driving 5pm to 3AM
night <- lyft[lyft$starthour %in% c(17:23,(0:3)) , ] After we have the data loaded and cleaned, we can now attempt visualization. We want to see the variability of money earned by hour. A boxplot will work nicely for our purpose. # Used for formatting the plot HourLabels <- data.frame(hour = 0L:23L, label = c(paste0(c(12,1:11),"AM") , paste0(c(12,1:11),"PM")) ) #Distribution of Money Made by start hour ---- plotData <- night myLabels <- HourLabels[HourLabels$hour %in% unique(plotData$starthour),]$label
myColors <- c("lightblue","wheat","lightgreen","gray")
totalrides <- nrow(night)

# Make Boxplot To Show Money Made by Hour ----
png(filename = "MoneyByHour.png")
with(
plotData,
boxplot(
TotalMoney ~ starthour ,
col= myColors,
las=1,
main=paste0("Distribution of Money Made by Start Hour\n",
"(n = ",totalrides,")") ,
names=myLabels,
yaxt="n"
)
)#End With Statement

axis(side=2,
at = axTicks(2),
labels =paste0("$",axTicks(2)), las=1) dev.off()  We end up with the plot below which shows us the variability by hour. The most profitable hour is 2AM with a median of$11.58.

Our data is still fairly sparse. A quick look with the table command will show this.

# Examine Frequency
table(lyft[c("day","starthour")])

This is due to the lower number of rides given. As this dataset grows we will be able to perform more interesting analysis, such as predicting the estimated earning for a given hour and day. Stay tuned for Part II where we look to answer some more questions and begin the modeling process.

Filed under Uncategorized

## Making a code book in sql server - Part 2

We have already covered how to add a data dictionary to a sql server table by using extended properties (see here). We can go one step further and making a simple stored procedure that will give us quick access to the code book. After building the stored procedure below we can now double click a table and use a keyboard shortcut. I have mine shortcut as Ctrl+5.

Create Procedure [dbo].[GetDataDictionary]

@TableName varchar(200)

as
Begin
SELECT
[major_id],
[minor_id],
t.name AS [Table Name],
c.name AS [Column Name],
[value] AS [Extended Property],
infos.[Data_Type],
infos.[is_nullable],
infos.[Numeric_Precision],
infos.[Numeric_Scale]
FROM
sys.extended_properties AS ep
inner join
sys.tables AS t
ON ep.major_id = t.object_id
inner join
sys.columns AS c
ON ep.major_id = c.object_id
and ep.minor_id = c.column_id
inner join
INFORMATION_SCHEMA.COLUMNS infos
on infos.table_name = t.name
and infos.column_name = c.name
Where
class = 1
and t.name =@TableName /* This is our Table name */

End


You can see what my shortcuts look like in the picture below.

If we follow the example in the article for building a code book, we can create a table with the iris data set and add a quick code book. Then when we highlight the table [iris] and use our keyboard shortcut of Ctrl+5, you should see this:

Now we can quickly check the definition of a column without having to leave our SSMS window. If you are interested in the other helper stored procedures besides GetDataDictionary, then take a look at this post.

Filed under Productivity, Sql Server

## Quick Bar Graph in Sql Server

Have you ever needed a bar graph and didn't want to leave Sql Server? Well if you don't have Sql Server 2016 yet I have some code for you. The code below creates a table with the numbers 1-10. We then sample repeatedly from that table and make a histogram of our sample. We should have something approaching a uniform distribution when we are done.

If Object_id('tempdb..#Test','U') is not null
Drop Table #Test
Create Table #Test ( numbers float )

If Object_id('tempdb..#RandomNumbers','U') is not null
Drop Table #RandomNumbers
Create Table #RandomNumbers ( numbers float )

-- Create Table with numbers 1 -10
Declare @Counter int
Set @Counter=1
While @Counter <=10
Begin
Insert Into #Test Select @counter
Set @Counter=@Counter+1
End

-- Repeated Sample with replacement from table
Declare @Counter2 int
Set @Counter2=1
While @Counter2 <=1000
Begin
Insert Into #RandomNumbers Select Top 1 Numbers From #test Order By newid()
Set @Counter2=@Counter2+1
End

-- Make histogram
Select
numbers,
replicate('>',count(*)) [Levels]
From
#RandomNumbers
Group By
numbers
Order By
count(*) desc

You should now have a nice little histogram showing you the distribution of your data. Another useful thing is to approach it like a pareto chart and order the data descending.

Select
numbers,
replicate('>',count(*))[Levels]
From
#RandomNumbers
Group By
numbers
Order By
count(*) desc

Filed under Sql Server

## Vectorize a Function

I was recently working and decided to write a function to assist in the process. It assigns a label to a number based upon the value. My first attempt worked, but only for one value at a time.

KMO_adequacy2 <- function(x){
if ( x > .90 ) {result <- "Marvelous"}
if (x <=.90 & x >.80 ) { result <-"Meritorious" }
if (x <= .8 & x >.70 ) { result <- "Middling" }
if (x <=.7 & x>.60)    { result <-"Mediocre" }
if (x <=.6 & x>.50)    {result <-"Miserable" }
if (x <=.5)            { result <- "Unacceptable" }
return(result)
}


This works.

 KMO_adequacy2(.95)
#[1] "Marvelous"

This does not work.

 KMO_adequacy2(c(.95,.50))
#[1] "Marvelous"
# Warning messages:
#   1: In if (x > 0.9) { :
#       the condition has length > 1 and only the first element will be used
#     2: In if (x <= 0.9 & x > 0.8) { :
#         the condition has length > 1 and only the first element will be used
#       3: In if (x <= 0.8 & x > 0.7) { :
#           the condition has length > 1 and only the first element will be used
#         4: In if (x <= 0.7 & x > 0.6) { :
#             the condition has length > 1 and only the first element will be used
#           5: In if (x <= 0.6 & x > 0.5) { :
#               the condition has length > 1 and only the first element will be used
#             6: In if (x <= 0.5) { :
#                 the condition has length > 1 and only the first element will be used

I should have thought more about the end goal of the function before I started coding, but I didn't. I started searching for good ways to vectorize a function in R. I found there is a function in base R called 'Vectorize'. All I needed was to create a new function using 'Vectorize' and I was done.

KMO_adequacy <- Vectorize(KMO_adequacy2)
#[1] "Marvelous"    "Unacceptable"

This allowed me to easily add a column to my data frame containing individual KMO measures and associate a label. I reached out to my fellow blogger Jeremy and he gave me a quick re-write of my original function. Here is his approach.

# Jeremy's re-write
KMO_J <- function(x){
names(x) <- x
names(x) <- ifelse(x > .90, "Marvelous", names(x))
names(x) <- ifelse(x <=.9 & x >.80, "Meritorious", names(x))
names(x) <- ifelse(x <=.8 & x >.70, "Middling", names(x))
names(x) <- ifelse(x <=.7 & x>.60, "Mediocre", names(x))
names(x) <- ifelse(x <=.6 & x>.50, "Miserable", names(x))
names(x) <- ifelse(x <=.5, "Unacceptable", names(x))
return(names(x))
}

KMO_J(c(.95,.50))
#[1] "Marvelous"    "Unacceptable"



We can do a quick check to make sure that we are getting the same output.

test <- seq(0,1,.1)
all.equal ( KMO_adequacy(test) , KMO_J(test) )
#[1] TRUE

My next question, is there a major performance difference between the two?  I ran a short simulation which is summarized in the plot below and shows that there is not a large difference in performance for the samples tested.

Have a better way to solve this problem? Post it in the comments below. If you are wondering what this KMO thing is all about, it is a measure of sampling accuracy (MSA) for conducting exploratory factor analysis (EFA). The cutoffs and names were taken from:

1. Barbara A. Cerny , Henry F. Kaiser
Multivariate Behavioral Research
Vol. 12, Iss. 1, 1977

Filed under R_local, Statistics

## Tips For Dealing with Large Datasets in Sql Server

Are you dealing with large (100 million row +) datasets that live in a sql database? Have you found your old methods not to be satisfactory? Here are a couple of tips from my own experience.

1. Do not use count(*) to figure out how big of a table you have.
2.  Do not use the max function to figure out when the last record was inserted

## 1.

I found this out the hard way when I actually got an error when I tried to run
Select Count(*) From Table  and received an arithmetic overflow error. I was puzzled at first and after some  searching I found the problem. A 'count' in sql server returns the datatype int, which means it can only be 2^31-1 or 2,147,483,647. The table that I was working with had more than 2.1 billion rows, so that caused a problem. Now, you may be thinking that you could just use a Count_Big instead, but that is probably not the right answer. Try using sp_spaceused instead. If you are interested in turning this into a shortcut for SSMS, look here.

## 2.

Sometimes I need to figure out when the latest record was inserted. Instead of taking a max on a datetime field. I can often get to my answer by using information about an index. Hopefully your table has a unique auto incrementing primary key that can aid you in finding the last record inserted. Make sure you understand the process that builds or alters your table. It could be that the maximum value of your primary key is not related to the most recent records.

Select max(LocalTimeStamp) From Table ( Slow )

vs

Select LocalTimeStamp From Table Where PrimaryKey = ( Select Max(PrimaryKey) From Table ) ( Quick )

Filed under Sql Server

## Conference on Statistical Practice 2015 - Day 1

The first day started off with a great lecture on basic software engineering principles that all Statisticians should know. Paul Teetor, gave the talk "What Can We Learn from Software Engineers?". He covered some basic but very important principles including:

• Coding Standards
• Defensive Programming
• Version Control
• Unit Testing

I appreciated him introducing me to the difference between "programming in the small and programming in the large". Often I find myself writing code "in the small" and in hindsight, that is not great. Paul walked our group through building a basic R package and putting it under version control in less than 20 minutes! So far, this has been my favorite session. The only other thing that is worth mentioning was a conversation that I had with a representative from Wolfram Alpha. This gentleman was explaining the features of their products and how great their software is. I was listening to his pitch when he caught me off guard with an odd statistic about how many lines of code their software had. It went something like this:

Sales_Rep : We have more code than the human genome!

Me: You guys should really write more efficient code

He quickly explained that their code is efficient, but I had to leave shortly after that to avoid laughing out loud. We ended the first day with a poster session and some socializing. Overall, it was a good first day.

Filed under CSP2015

## Checking For Duplicate Records in Sql Server Table

I often find myself needing to check if a table is unique with respect to a particular column. For instance, if I have a table of customer data I want to ensure that a customer only exists once. There are several ways that you can do this.  If I want to write a quick select if usually looks like this.

-- Method 1
Select count(*),Count(Distinct ColumnofInterest)
From Table

A while back I got tired of writing the above query over and over, so I created a stored procedure (SP). The SP below take two arguments, a table name and a column name and will determine if your column is unique. Currently it is written using a  count(*)  , which is not meant for large tables, i.e. > 2,000,000,000 rows .

-- Method 2
USE [YourDataBase]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

Create Procedure [dbo].[DupeChecker]
@Table nvarchar(128),
@Column nvarchar(128)
as

-- Construct SQL Statement Using Name Provided
Declare @MyQuery as nvarchar(max)='
Select 	@DupeFlag =case when count(*) = count(Distinct ' + @Column + ') then 0 else 1 end
From ' + @Table + ' with(nolock)
'

Declare @MyNumber smallint
Set @MyNumber = 0

exec sp_executesql @MyQuery, N'@DupeFlag smallint out', @MyNumber out

-- Raise Error if Table is not uqniue
Declare @ErrorText Varchar(100)
Set @ErrorText = 'Your column ' + @Column + ' on table ' + @Table + ' is not unique.'
If (@MyNumber =1)
Begin
RAISERROR(@ErrorText,16,1)
RETURN
End

-- If unique print a message to confirm
Declare @ErrorText2 Varchar(100)
Set @ErrorText2 = 'Your column ' + @Column + ' on table ' + @Table + ' is unique.'
If (@MyNumber =0)
Begin
Print (@ErrorText2)
End

GO


There is one more method that I commonly use to determine if my table is unique and that is an index.

-- Method 3
Create Unique NonClustered Index [NCI_Dupe_Checker] on TableName
( MyColumn )

So, now you have three different ways to check your table to make sure that your column is unique. Keep in mind that adding an index will take up additional space, where methods 1 and 2 will not.

Filed under Sql Server

## Keeping rows containing particular strings in R

I was recently presented with the need to filter out certain rows in my dataset based upon them containing the desired strings. I needed to retain any row that had a "utm_source" and "utm_medium" and "utm_campaign". Each row in my dataset was a single string. The idea is to parse the strings of interest. My approach was to use grep and check each string for each condition that I needed it to satisfy. I consulted with my co-blogger to see if he had a more intelligent way of approaching this problem. He tackled it with a regular expression using a look-ahead. You can see my 'checker' function below and Jeremy's function 'checker2'. Both seem to perform the required task correctly. So now it is simply a matter of performance.

#Sample Data
)

mydf <- as.data.frame(querystrings)

# This Should return TRUE when all conditions have been satisfied
checker <-function(foo){
grepl(pattern="utm_source", x=foo) &
grepl(pattern="utm_medium", x=foo)&
grepl(pattern="utm_campaign", x=foo)

}

checker2 <- function(foo){
grepl(pattern="^(?=.*utm_source)(?=.*utm_medium)(?=.*utm_campaign).*\$",
x=foo, perl=TRUE)
}
# This is the loop that was run to repeatedly test each function with a much larger dataset
# Yes, I know this is not an efficient way to do this but it is easy to read.
ttime=c()
for( i in 1:100){
tt <- system.time(  tresult <-mydf[checker(mydf[, 1]), ]  )
ttime =rbind(ttime,tt[3])
}

jtime=c()
for( i in 1:100){
jt <- system.time(  jresult<-mydf[checker2(mydf[, 1]), ] )
jtime =rbind(jtime,jt[3])
}

mean(ttime)
mean(jtime)

I am not able to share the full dataset that I was using, due to privacy concerns. The dataset that I tested both functions against had 26,746 rows. The 'checker' function which I wrote took  on average 0.0801 seconds and Jeremy's approach took  0.1488 seconds. I decided to stick with my checker function, but that was not because of speed. I would have happily accepted the increased computation time for mine if the times had been reversed. The reason for this is that I find mine easier to read. This means that there is a chance that I could come back to this code in 6 months and have a clue about what it is suppose to be doing. Regular Expressions can sometimes be quite hard to come back to and say, " oh yeah, I wanted to check if all the characters that occupy prime digits in my string are vowels!". I think that my simplistic grep statement will be easier to change if that becomes needed in the future and so I will stick with the 'checker' approach. Do you have a better way to approach this using R? If so, make sure to post a comment.

Filed under R

## Summary Function that is compatible with xtable

If you like to make nice looking documents using Latex, I highly recommend using the 'xtable' package. In most instances, it works quite well for producing a reasonable looking table from an R object. I however recently wanted a LaTeX table from the 'summary' function in base R. So naturally I tried:

library(xtable)
set.seed(123)
foo = rnorm(10)
summary(foo)
xtable(summary(foo))

Which gave me the following error:

Error in xtable.table(summary(foo)) :
xtable.table is not implemented for tables of > 2 dimensions

So I decided to create a simple function that would return a dataframe which is easy to use with xtable. Here is what I came up with.

summaryfunction= function (x){
if( is.numeric(x)!=TRUE) {stop("Supplied X is not numeric")}
mysummary = data.frame(
"Min." =as.numeric( min(x)),
"1st Qu." = quantile(x)[2],
"Median" = median(x),
"Mean" = mean(x),
"3rd Qu." = quantile(x)[4],
"Max." = max(x),
row.names=""

)
names(mysummary) = c("Min.","1st Qu.","Median","Mean","3rd Qu.","Max.")
return( mysummary )
}


Now, when I try to use xtable I get the following output:

% latex table generated in R 3.1.1 by xtable 1.7-4 package
% Tue Nov 04 21:58:07 2014
\begin{table}[ht]
\centering
\begin{tabular}{rrrrrrr}
\hline
& Min. & 1st Qu. & Median & Mean & 3rd Qu. & Max. \\
\hline
& -1.27 & -0.53 & -0.08 & 0.07 & 0.38 & 1.72 \\
\hline
\end{tabular}
\end{table}

This should lead to an easier way to incorporate more summaries when you are writing your paper, using R and Knitr of course. If you do use knitr, make sure to try the results = 'asis' option with xtable from R.

Filed under R

## Sending Email From R

When I am am working in Sql Server and need to send an email I use "sp_send_dbmail". So when I am working in R, I didn't know how to send an email. I often use this as notification that a process has finished. It also works nicely as a text to your cell phone. I had one additional reason why I wanted to be able to email from R. I wanted to send an email to my evernote account with just a few key strokes. The goal was to accomplish this by writing a simple wrapper function. Below is the solution that I came up with. It works, but there are serious security implications. I offer this merely as a proof of concept. Hopefully someone can show me a better way to handle passing your email password to the R function.

# install.packages("mailR")
library(mailR)
en <- function(s="Message From R",tags="#todo",b=" ",r = c("myevernoteaddress@m.evernote.com")){
sender <- "myemail@mydomain.com"
if(r=="me") r <- "myemail@mydomain.com"
send.mail(
from = sender,
to = r,
subject=paste(s,tags),
body =b,
smtp = list(host.name = "mail.mydomain.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
authenticate = TRUE,
send = TRUE
)
}

I often have an R terminal open, so when I have a great idea for a research project I can add a note to my evernote account simply. For instance:

en(s="Read up on current imputation methods",tags="#Research")

Then a new entry is added for me in evernote with the tag 'Research'. ( I have noticed that the tagging seems to only work if the tag previously exists in my evernote account.)  Often I have a task that just needs done that I don't want to forget about. I can issue a quick command and then I will have a record of it.