Analyzing Lyft Data - Part I

Reading Time: 3 minutes

Ride sharing is a popular way to get around and much cheaper than some alternatives.  How much do the people who choose to participate in ride sharing actually make? We will start this series of examining Lyft driving data by getting the data loaded and trying to understand which hours are the best to drive.

The first step is to get our data loaded. The data is contained in a Github repository . You can clone the repo to get started or just directly read in the csv file. We will be just read in the data file directly. If you want to see the whole script in one place, try LyftData.R.

# Load libraries ----
library(lubridate) 
library(sqldf)
library(skimr)

# Read in Data ---
lyft <- read.csv(file="https://raw.githubusercontent.com/Spoted21/lyft/master/lyft2.csv",
stringsAsFactors = FALSE)


# Examine Data ----
head(lyft)
dim(lyft)
str(lyft)

# My new favorite function
skim(lyft)

After getting the data loaded there is some cleaning that is needed. After cleaning the data we will be looking at which hours are the best to drive. It will be useful to create a variable to indicate which rides are part of the same driving session. This will allow us to have a nice level to analyze the ride data.

# Calculate StartTime and EndTime ----
lyft$StartTime <- as.POSIXct(
  paste(
    strptime(lyft$Date,format = "%m/%d/%Y"),
    format(strptime(lyft$Time, "%I:%M %p"), format="%H:%M:%S")
  ),format="%Y-%m-%d %H:%M:%OS")

lyft$EndTime <-lyft$StartTime+(lyft$Time_Min*60+lyft$Time_Sec)

################################################################################
# Calculate a driving session ----
# If more than 4 hours since last ride a new session 
# is assumed to have started
################################################################################


lyft$rideSession <- 0
for(i in 1:nrow(lyft) ){
  #First Row - no comparison
  if(i==1) {lyft$rideSession[1] <- 1 } else if(i== nrow(lyft) ){ #Last Row
    lyft$rideSession[i] <- lyft$rideSession[i-1] }  else {
      timedifference <- as.numeric(difftime( lyft[i+1,]$StartTime, lyft[i,]$EndTime,units = "mins"))
      if(timedifference <= (60*4) ) { lyft$rideSession[i] <-lyft$rideSession[i-1] } else {
        lyft$rideSession[i] <- max(lyft$rideSession)+1 }
    }
}

lyft$TotalMoney <- lyft$Amount + lyft$Tip
lyft$starthour <- hour(lyft$StartTime)

# Night time driving 5pm to 3AM
night <- lyft[lyft$starthour %in% c(17:23,(0:3)) , ]

After we have the data loaded and cleaned, we can now attempt visualization. We want to see the variability of money earned by hour. A boxplot will work nicely for our purpose.

# Used for formatting the plot
HourLabels <- data.frame(hour = 0L:23L, label =
                           c(paste0(c(12,1:11),"AM") ,
                             paste0(c(12,1:11),"PM"))
)

#Distribution of Money Made by start hour ----
plotData <- night
myLabels <- HourLabels[HourLabels$hour %in% unique(plotData$starthour),]$label 
myColors <- c("lightblue","wheat","lightgreen","gray")
totalrides <- nrow(night)

# Make Boxplot To Show Money Made by Hour ----
png(filename = "MoneyByHour.png")
with(
  plotData,
  boxplot(
    TotalMoney ~ starthour ,
    col= myColors,
    las=1,
    main=paste0("Distribution of Money Made by Start Hour\n",
                "(n = ",totalrides,")") ,
    names=myLabels,
    yaxt="n"
  )
)#End With Statement

# Add formatted axis 
axis(side=2,
     at = axTicks(2),
     labels =paste0("$ ",axTicks(2)),
     las=1)
dev.off()

 

We end up with the plot below which shows us the variability by hour. The most profitable hour is 2AM with a median of $11.58.

Our data is still fairly sparse. A quick look with the table command will show this.

# Examine Frequency
table(lyft[c("day","starthour")])

This is due to the lower number of rides given. As this dataset grows we will be able to perform more interesting analysis, such as predicting the estimated earning for a given hour and day. Stay tuned for Part II where we look to answer some more questions and begin the modeling process.

 

 

 

Leave a Comment

Filed under Uncategorized

Making a code book in sql server - Part 2

Reading Time: 2 minutes

We have already covered how to add a data dictionary to a sql server table by using extended properties (see here). We can go one step further and making a simple stored procedure that will give us quick access to the code book. After building the stored procedure below we can now double click a table and use a keyboard shortcut. I have mine shortcut as Ctrl+5.

Create Procedure [dbo].[GetDataDictionary]

@TableName varchar(200)

as
Begin
SELECT 
	[major_id], 
	[minor_id], 
	t.name AS [Table Name], 
	c.name AS [Column Name], 
	[value] AS [Extended Property],
	infos.[Data_Type],
	infos.[is_nullable],
	infos.[Numeric_Precision],
	infos.[Numeric_Scale]
FROM 
	sys.extended_properties AS ep
	inner join
	sys.tables AS t 
	ON ep.major_id = t.object_id 
	inner join
	sys.columns AS c 
	ON ep.major_id = c.object_id 
	  and ep.minor_id = c.column_id
	inner join 
	INFORMATION_SCHEMA.COLUMNS infos
	on infos.table_name = t.name 
	  and infos.column_name = c.name
Where 
	class = 1
	and t.name =@TableName /* This is our Table name */

End

You can see what my shortcuts look like in the picture below.

If we follow the example in the article for building a code book, we can create a table with the iris data set and add a quick code book. Then when we highlight the table [iris] and use our keyboard shortcut of Ctrl+5, you should see this:

Now we can quickly check the definition of a column without having to leave our SSMS window. If you are interested in the other helper stored procedures besides GetDataDictionary, then take a look at this post.

Leave a Comment

Filed under Productivity, Sql Server

Correctly Reporting P-Values in Summary Tables Reported with xtable

Reading Time: 3 minutes

Often when writing a manuscript in using knitr and xtable I am flustered by my p-values. In simple summary tables, R conveniently rounds my p-values to be 0: a mathematically inappropriate task. A colleague recently commented on the poor reporting of my table (shown below using print.xtable with the type="html" argument), inspiring a much needed change.

Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.001704 0.000005 100409.770956 0.000000
sizemedium 0.000046 0.000005 90.534705 0.000000
sizesmall 0.000003 0.000005 0.294331 0.587458
time -0.000004 0.000001 11.614917 0.000654

 

The fix is actually fairly straight forward, and can be summarized in a simple function: "fixp", with the code shown below:

fixp <- function(x, dig=3){
  x <- as.data.frame(x)
  
  if(substr(names(x)[ncol(x)],1,2) != "Pr")
    warning("The name of the last column didn't start with Pr. This may indicate that p-values weren't in the last row, and thus, that this function is inappropriate.")
  x[,ncol(x)] <- round(x[,ncol(x)], dig)
  for(i in 1:nrow(x)){
    if(x[i,ncol(x)] == 0)
      x[i,ncol(x)] <- paste0("< .", paste0(rep(0,dig-1), collapse=""), "1")
  }
  
  x
}

All that's going on: the function is pulling in the summary table (usually through a $coef), trying to turn it into a dataframe (some already are, though some tables are numeric (e.g. lm)), throwing a warning if the last heading doesn't begin with "Pr" (as it may not be the column that contains p-values), and editing any values that were rounded to 0 (at the user specified rounding point) to be < the smallest number that could be rounded to (e.g. <.01). Then we output the edited table, all ready for reporting! To mimic what was above, we set our digits to be equal to 6 (so go out 6 decimal places for the p-value), and re-run:

Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.001704 0.000005 100409.770956 < .000001
sizemedium 0.000046 0.000005 90.534705 < .000001
sizesmall 0.000003 0.000005 0.294331 0.587458
time -0.000004 0.000001 11.614917 0.000654

 

Much better! Also, to report a two digit p-value (for some writing styles), we simply set dig = 2:

Estimate Std.err Wald Pr(>|W|)
(Intercept) 0.001704 0.000005 100409.770956 < .01
sizemedium 0.000046 0.000005 90.534705 < .01
sizesmall 0.000003 0.000005 0.294331 0.59
time -0.000004 0.000001 11.614917 < .01

 

By design, the p-values can be manipulated independent of the estimates. This allows reporting of the estimated coefficients in meaningful units (in the above example, very small units), while reporting the p-values on a scale that many writing styles request.

Want to try this yourself? Here's an example that you can try with just a built in dataset in R:

#this gives a summary table with a small p-value. Trying to report this with xtable would results in an R rounding issue!
(mod <- coef(summary(lm(uptake ~ conc + Treatment + Type + Plant, data=CO2))))

#this fixes the p-value to 2 digits, correctly reporting p-values that would have been rounded to 0
fixp(mod,dig=2)

Here's the final output via print.xtable (dig=2 for fixp and xtable):

Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.42 4.67 8.00 < .01
conc 0.02 0.00 7.96 < .01
Treatmentchilled -12.50 5.10 -2.45 0.02
TypeMississippi -23.33 6.01 -3.88 < .01
Plant.L 21.58 11.14 1.94 0.06
Plant.Q -4.62 2.27 -2.03 0.05
Plant.C 1.46 5.10 0.29 0.78
Plant^4 2.34 2.27 1.03 0.31
Plant^5 -0.48 5.77 -0.08 0.93
Plant^6 -0.04 2.27 -0.02 0.99
Plant^7 -1.91 3.64 -0.53 0.6
Plant^8 -3.28 2.27 -1.44 0.15
Plant^10 0.55 2.27 0.24 0.81

 

Limitations (ish):

  1. Again, this assumes that the last column is the one to be transformed. This is by design, though may be inconvenient in some situations. If needed, the change is easily made through the definition of the function.
  2. When the last column is manipulated, it becomes a character column in the dataframe. Alternatively, when it is rounded but no entry rounds to 0, it is numeric.
  3. This assumes a dataframe-style format of your table. Thus, this method will NOT be effective at correcting reported p-values for an individual test: say a t-test, where only the statistic is reported (and not a table). Personally this is not a concern, as I deal with these situations in other ways, but for some users seeking an overall "p-value fixing" method this may not be the answer.

As with other functions I write posts on, this function is available in my package (creatively named "myStuff")  via Github. If you'd like to play with the most current version of the function, I'd encourage you to check it out here. Alternatively, to have access to other fun functions, install the package directly from GitHub with the code below (requires devtools):

devtools::install_github("flor3652/myStuff")

Leave a Comment

Filed under R

Quick Bar Graph in Sql Server

Reading Time: < 1 minute

Have you ever needed a bar graph and didn't want to leave Sql Server? Well if you don't have Sql Server 2016 yet I have some code for you. The code below creates a table with the numbers 1-10. We then sample repeatedly from that table and make a histogram of our sample. We should have something approaching a uniform distribution when we are done.

 

If Object_id('tempdb..#Test','U') is not null 
Drop Table #Test
Create Table #Test ( numbers float )

If Object_id('tempdb..#RandomNumbers','U') is not null 
Drop Table #RandomNumbers
Create Table #RandomNumbers ( numbers float )


-- Create Table with numbers 1 -10
Declare @Counter int
Set @Counter=1
While @Counter <=10
Begin
    Insert Into #Test Select @counter
    Set @Counter=@Counter+1
End

-- Repeated Sample with replacement from table
Declare @Counter2 int
Set @Counter2=1
While @Counter2 <=1000
Begin
    Insert Into #RandomNumbers Select Top 1 Numbers From #test Order By newid()
    Set @Counter2=@Counter2+1
End

-- Make histogram
Select
    numbers,
    replicate('>',count(*)) [Levels]
From
    #RandomNumbers
Group By
    numbers
Order By
    count(*) desc

You should now have a nice little histogram showing you the distribution of your data. Another useful thing is to approach it like a pareto chart and order the data descending.

Select
    numbers,
    replicate('>',count(*))[Levels]
From
    #RandomNumbers
Group By
    numbers
Order By
    count(*) desc

sqlhist

Leave a Comment

Filed under Sql Server

Vectorize a Function

Reading Time: 3 minutes

I was recently working and decided to write a function to assist in the process. It assigns a label to a number based upon the value. My first attempt worked, but only for one value at a time.

KMO_adequacy2 <- function(x){
  if ( x > .90 ) {result <- "Marvelous"}
  if (x <=.90 & x >.80 ) { result <-"Meritorious" }
  if (x <= .8 & x >.70 ) { result <- "Middling" }
  if (x <=.7 & x>.60)    { result <-"Mediocre" }
  if (x <=.6 & x>.50)    {result <-"Miserable" }
  if (x <=.5)            { result <- "Unacceptable" }
  return(result)
}

This works.

 KMO_adequacy2(.95)
#[1] "Marvelous"

This does not work.

 KMO_adequacy2(c(.95,.50))
#[1] "Marvelous"
# Warning messages:
#   1: In if (x > 0.9) { :
#       the condition has length > 1 and only the first element will be used
#     2: In if (x <= 0.9 & x > 0.8) { :
#         the condition has length > 1 and only the first element will be used
#       3: In if (x <= 0.8 & x > 0.7) { :
#           the condition has length > 1 and only the first element will be used
#         4: In if (x <= 0.7 & x > 0.6) { :
#             the condition has length > 1 and only the first element will be used
#           5: In if (x <= 0.6 & x > 0.5) { :
#               the condition has length > 1 and only the first element will be used
#             6: In if (x <= 0.5) { :
#                 the condition has length > 1 and only the first element will be used

I should have thought more about the end goal of the function before I started coding, but I didn't. I started searching for good ways to vectorize a function in R. I found there is a function in base R called 'Vectorize'. All I needed was to create a new function using 'Vectorize' and I was done.

KMO_adequacy <- Vectorize(KMO_adequacy2) 
KMO_adequacy(c(.95,.50))
#[1] "Marvelous"    "Unacceptable"

This allowed me to easily add a column to my data frame containing individual KMO measures and associate a label. I reached out to my fellow blogger Jeremy and he gave me a quick re-write of my original function. Here is his approach.

# Jeremy's re-write
KMO_J <- function(x){
   names(x) <- x
   names(x) <- ifelse(x > .90, "Marvelous", names(x))
   names(x) <- ifelse(x <=.9 & x >.80, "Meritorious", names(x))
   names(x) <- ifelse(x <=.8 & x >.70, "Middling", names(x))
   names(x) <- ifelse(x <=.7 & x>.60, "Mediocre", names(x))
   names(x) <- ifelse(x <=.6 & x>.50, "Miserable", names(x))
   names(x) <- ifelse(x <=.5, "Unacceptable", names(x))
   return(names(x))
} 

KMO_J(c(.95,.50))
#[1] "Marvelous"    "Unacceptable"

We can do a quick check to make sure that we are getting the same output.

test <- seq(0,1,.1)
all.equal ( KMO_adequacy(test) , KMO_J(test) )
#[1] TRUE

My next question, is there a major performance difference between the two?  I ran a short simulation which is summarized in the plot below and shows that there is not a large difference in performance for the samples tested.

Rplot

Have a better way to solve this problem? Post it in the comments below. If you are wondering what this KMO thing is all about, it is a measure of sampling accuracy (MSA) for conducting exploratory factor analysis (EFA). The cutoffs and names were taken from:

  1. Barbara A. Cerny , Henry F. Kaiser
    Multivariate Behavioral Research
    Vol. 12, Iss. 1, 1977

Leave a Comment

Filed under R_local, Statistics

Tips For Dealing with Large Datasets in Sql Server

Reading Time: 2 minutes

Are you dealing with large (100 million row +) datasets that live in a sql database? Have you found your old methods not to be satisfactory? Here are a couple of tips from my own experience.

  1. Do not use count(*) to figure out how big of a table you have.
  2.  Do not use the max function to figure out when the last record was inserted

1.

I found this out the hard way when I actually got an error when I tried to run
Select Count(*) From Table  and received an arithmetic overflow error. I was puzzled at first and after some  searching I found the problem. A 'count' in sql server returns the datatype int, which means it can only be 2^31-1 or 2,147,483,647. The table that I was working with had more than 2.1 billion rows, so that caused a problem. Now, you may be thinking that you could just use a Count_Big instead, but that is probably not the right answer. Try using sp_spaceused instead. If you are interested in turning this into a shortcut for SSMS, look here.

2.

Sometimes I need to figure out when the latest record was inserted. Instead of taking a max on a datetime field. I can often get to my answer by using information about an index. Hopefully your table has a unique auto incrementing primary key that can aid you in finding the last record inserted. Make sure you understand the process that builds or alters your table. It could be that the maximum value of your primary key is not related to the most recent records.

Select max(LocalTimeStamp) From Table ( Slow )

vs

Select LocalTimeStamp From Table Where PrimaryKey = ( Select Max(PrimaryKey) From Table ) ( Quick )

Have some useful tips to add? Please post them in the comments.

Leave a Comment

Filed under Sql Server

The Keep Function

Reading Time: 2 minutes

Occasionally when I am jotting some code I find myself creating several temporary variables with the intention of later getting rid of them. These variables involve quick names that are defined in a local scope and get quite confusing out of their context. If the project expands, however, I find myself with three options for playing with the next level of the project: remember what I've already used and try to avoid it, rename all of the temporary variables (e.g. rewrite the code of the base level of the project), or wipe the variables for use later. This decision is usually made based on how the project is going. If the project is going well, I'll go back and dutifully rewrite the initial code to track variables in a more unique way. If the project is still a bit shaky, I will clear the variable names that I tend to use as temporary variables and keep exploring. Remembering variable names never turns out well for me; I inevitably forget that a variable was defined in a previous section, use it thinking I had redefined it (when I didn't), and wonder at the strange results I get.

Clearing all variables names can be a task, though. The problem comes as, in order to move on to the next stage, I actually wish to keep a few of the variables and get rid of all of the rest. After playing with a few ideas (removing the unwanted variables one at a time, writing out the variables that I wanted to keep, etc.), I decided on the idea of writing a keep  function. The keep function does just what it says: given a list of variables as arguments, it keeps those variables and removes the rest. For an example, consider the vectors "x" and "y", which are combined to give a matrix "initMod." Using  keep(initMod)  keeps the matrix and will eliminate all other objects in the global environment (including "x" and "y"), allowing me to reuse the variable names "x" and "y" as temporary variables again (say, for "modifiedMod").

Code for keep can be found on GitHub, here. Note that the function will self delete if defined in the global environment, so add in a segment to always leave keep. You could also simply grab the containing personal package (myStuff) off of GitHub using the code below.

if(!require(devtools)) install.packages("devtools", dependencies=TRUE)
devtools::install_github("flor3652/myStuff")
library(myStuff)

Note that when using the function from the package you don't have to worry about it being deleted as it isn't in the global environment.

Leave a Comment

Filed under R

Text bashing in R for SQL

Reading Time: < 1 minute

Fairly often, a coworker who is strong in Excel, but weak in writing code will come to me for help in special details about customers in their datasets. Sometimes the reason is to call, email, or snail mail a survey, other times to do some classification grouping on the customer. Whatever the reason, the coworker has a list of ID numbers and needs help getting something out of a SQL database.

When it isn't as simple as just adding quotes and commas to the cells in Excel before copying all the ID's into the WHERE clause of a very basic SELECT statement, I often fall back to R and let it do the work of putting together the SELECT statement and querying the data.

Suppose that you're given an Excel file with 1.2 million ID's and there's some transformation that you need to do first. Obviously, you first read the file in using your protocol and package of choice. Since we're ultimately doing SQL, let's take advantage of the RODBC package's cool features.

library(RODBC)

xlsxFile <- file.path("C:", "MyFile.xlsx")
xlsxChan <- odbcConnectExcel2007(xlsxFile)
xlsxSheets <- sqlTables(xlsxChan)
View(xlsxSheets)

# From here you can choose the table (worksheet) you need to pull the data from.

xlsxData <- sqlFetch(xlsxChan, 'Sheet1$')
odbcClose(xlsxChan)

Now that the data have been pulled into R any manipulations that need to be made can be done until you end up with a list of ID's that you need to query the SQL database for. Let's start by writing the basis of the query.

myQuery <- "SELECT ID, Name, Street, City, State, Zip, Phone, Email FROM CustomerTable WHERE ID in (\s)"

Notice the use of the \s switch. We're ultimately going to be dropping this into sprintf() with the ID's so that we don't have to clutter up the script with a stupidly-long list of hard-coded values. Next we need to make R build that stupidly-long list of values to put into the query.

id <- as.character(xlsxData$ID)
id <- sprintf("'\s'", unique(id))
id <- toString(id)

# Now we can put the two together and send the whole query off.
myQuery <- sprintf(myQuery, id)

# Here do what is appropriate for you database flavor.  I'm pretending to use MySQL on Windows.
dbChan <- odbcConnect("CustDb", uid="jeremy", case="tolower")
dbData <- sqlQuery(dbChan, myQuery, stringsAsFactors=FALSE)
odbcCloseAll()

What you should end up with in dbData is the list of things that you are after and a happy colleague.

Leave a Comment

Filed under R, Sql Server

Conference on Statistical Practice 2015 - Day 1

Reading Time: < 1 minute

The first day started off with a great lecture on basic software engineering principles that all Statisticians should know. Paul Teetor, gave the talk "What Can We Learn from Software Engineers?". He covered some basic but very important principles including:

  • Coding Standards
  • Defensive Programming
  • Version Control
  • Unit Testing

I appreciated him introducing me to the difference between "programming in the small and programming in the large". Often I find myself writing code "in the small" and in hindsight, that is not great. Paul walked our group through building a basic R package and putting it under version control in less than 20 minutes! So far, this has been my favorite session. The only other thing that is worth mentioning was a conversation that I had with a representative from Wolfram Alpha. This gentleman was explaining the features of their products and how great their software is. I was listening to his pitch when he caught me off guard with an odd statistic about how many lines of code their software had. It went something like this:

Sales_Rep : We have more code than the human genome!

Me: You guys should really write more efficient code

He quickly explained that their code is efficient, but I had to leave shortly after that to avoid laughing out loud. We ended the first day with a poster session and some socializing. Overall, it was a good first day.

Leave a Comment

Filed under CSP2015

Checking For Duplicate Records in Sql Server Table

Reading Time: 2 minutes

I often find myself needing to check if a table is unique with respect to a particular column. For instance, if I have a table of customer data I want to ensure that a customer only exists once. There are several ways that you can do this.  If I want to write a quick select if usually looks like this.

-- Method 1
Select count(*),Count(Distinct ColumnofInterest)
From Table

A while back I got tired of writing the above query over and over, so I created a stored procedure (SP). The SP below take two arguments, a table name and a column name and will determine if your column is unique. Currently it is written using a  count(*)  , which is not meant for large tables, i.e. > 2,000,000,000 rows .

-- Method 2
USE [YourDataBase]
GO


SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO



Create Procedure [dbo].[DupeChecker]
@Table nvarchar(128),
@Column nvarchar(128)
as



-- Construct SQL Statement Using Name Provided
Declare @MyQuery as nvarchar(max)='
Select 	@DupeFlag =case when count(*) = count(Distinct ' + @Column + ') then 0 else 1 end 
From ' + @Table + ' with(nolock)
'


Declare @MyNumber smallint
Set @MyNumber = 0

exec sp_executesql @MyQuery, N'@DupeFlag smallint out', @MyNumber out



-- Raise Error if Table is not uqniue
Declare @ErrorText Varchar(100)
Set @ErrorText = 'Your column ' + @Column + ' on table ' + @Table + ' is not unique.'
If (@MyNumber =1)
	Begin
		RAISERROR(@ErrorText,16,1)
	  RETURN
	End

-- If unique print a message to confirm
Declare @ErrorText2 Varchar(100)
Set @ErrorText2 = 'Your column ' + @Column + ' on table ' + @Table + ' is unique.'
If (@MyNumber =0)
	Begin
	Print (@ErrorText2)
	End
		
GO

There is one more method that I commonly use to determine if my table is unique and that is an index.

-- Method 3
Create Unique NonClustered Index [NCI_Dupe_Checker] on TableName
( MyColumn )

So, now you have three different ways to check your table to make sure that your column is unique. Keep in mind that adding an index will take up additional space, where methods 1 and 2 will not.

Leave a Comment

Filed under Sql Server