Monthly Archives: April 2018

Analyzing Lyft Data - Part I

Reading Time: 3 minutes

Ride sharing is a popular way to get around and much cheaper than some alternatives.  How much do the people who choose to participate in ride sharing actually make? We will start this series of examining Lyft driving data by getting the data loaded and trying to understand which hours are the best to drive.

The first step is to get our data loaded. The data is contained in a Github repository . You can clone the repo to get started or just directly read in the csv file. We will be just read in the data file directly. If you want to see the whole script in one place, try LyftData.R.

# Load libraries ----
library(lubridate) 
library(sqldf)
library(skimr)

# Read in Data ---
lyft <- read.csv(file="https://raw.githubusercontent.com/Spoted21/lyft/master/lyft2.csv",
stringsAsFactors = FALSE)


# Examine Data ----
head(lyft)
dim(lyft)
str(lyft)

# My new favorite function
skim(lyft)

After getting the data loaded there is some cleaning that is needed. After cleaning the data we will be looking at which hours are the best to drive. It will be useful to create a variable to indicate which rides are part of the same driving session. This will allow us to have a nice level to analyze the ride data.

# Calculate StartTime and EndTime ----
lyft$StartTime <- as.POSIXct(
  paste(
    strptime(lyft$Date,format = "%m/%d/%Y"),
    format(strptime(lyft$Time, "%I:%M %p"), format="%H:%M:%S")
  ),format="%Y-%m-%d %H:%M:%OS")

lyft$EndTime <-lyft$StartTime+(lyft$Time_Min*60+lyft$Time_Sec)

################################################################################
# Calculate a driving session ----
# If more than 4 hours since last ride a new session 
# is assumed to have started
################################################################################


lyft$rideSession <- 0
for(i in 1:nrow(lyft) ){
  #First Row - no comparison
  if(i==1) {lyft$rideSession[1] <- 1 } else if(i== nrow(lyft) ){ #Last Row
    lyft$rideSession[i] <- lyft$rideSession[i-1] }  else {
      timedifference <- as.numeric(difftime( lyft[i+1,]$StartTime, lyft[i,]$EndTime,units = "mins"))
      if(timedifference <= (60*4) ) { lyft$rideSession[i] <-lyft$rideSession[i-1] } else {
        lyft$rideSession[i] <- max(lyft$rideSession)+1 }
    }
}

lyft$TotalMoney <- lyft$Amount + lyft$Tip
lyft$starthour <- hour(lyft$StartTime)

# Night time driving 5pm to 3AM
night <- lyft[lyft$starthour %in% c(17:23,(0:3)) , ]

After we have the data loaded and cleaned, we can now attempt visualization. We want to see the variability of money earned by hour. A boxplot will work nicely for our purpose.

# Used for formatting the plot
HourLabels <- data.frame(hour = 0L:23L, label =
                           c(paste0(c(12,1:11),"AM") ,
                             paste0(c(12,1:11),"PM"))
)

#Distribution of Money Made by start hour ----
plotData <- night
myLabels <- HourLabels[HourLabels$hour %in% unique(plotData$starthour),]$label 
myColors <- c("lightblue","wheat","lightgreen","gray")
totalrides <- nrow(night)

# Make Boxplot To Show Money Made by Hour ----
png(filename = "MoneyByHour.png")
with(
  plotData,
  boxplot(
    TotalMoney ~ starthour ,
    col= myColors,
    las=1,
    main=paste0("Distribution of Money Made by Start Hour\n",
                "(n = ",totalrides,")") ,
    names=myLabels,
    yaxt="n"
  )
)#End With Statement

# Add formatted axis 
axis(side=2,
     at = axTicks(2),
     labels =paste0("$ ",axTicks(2)),
     las=1)
dev.off()

 

We end up with the plot below which shows us the variability by hour. The most profitable hour is 2AM with a median of $11.58.

Our data is still fairly sparse. A quick look with the table command will show this.

# Examine Frequency
table(lyft[c("day","starthour")])

This is due to the lower number of rides given. As this dataset grows we will be able to perform more interesting analysis, such as predicting the estimated earning for a given hour and day. Stay tuned for Part II where we look to answer some more questions and begin the modeling process.

 

 

 

Leave a Comment

Filed under Uncategorized