Ride sharing is a popular way to get around and much cheaper than some alternatives. How much do the people who choose to participate in ride sharing actually make? We will start this series of examining Lyft driving data by getting the data loaded and trying to understand which hours are the best to drive.
The first step is to get our data loaded. The data is contained in a Github repository . You can clone the repo to get started or just directly read in the csv file. We will be just read in the data file directly. If you want to see the whole script in one place, try LyftData.R.
# Load libraries ---- library(lubridate) library(sqldf) library(skimr) # Read in Data --- lyft <- read.csv(file="https://raw.githubusercontent.com/Spoted21/lyft/master/lyft2.csv", stringsAsFactors = FALSE) # Examine Data ---- head(lyft) dim(lyft) str(lyft) # My new favorite function skim(lyft)
After getting the data loaded there is some cleaning that is needed. After cleaning the data we will be looking at which hours are the best to drive. It will be useful to create a variable to indicate which rides are part of the same driving session. This will allow us to have a nice level to analyze the ride data.
# Calculate StartTime and EndTime ---- lyft$StartTime <- as.POSIXct( paste( strptime(lyft$Date,format = "%m/%d/%Y"), format(strptime(lyft$Time, "%I:%M %p"), format="%H:%M:%S") ),format="%Y-%m-%d %H:%M:%OS") lyft$EndTime <-lyft$StartTime+(lyft$Time_Min*60+lyft$Time_Sec) ################################################################################ # Calculate a driving session ---- # If more than 4 hours since last ride a new session # is assumed to have started ################################################################################ lyft$rideSession <- 0 for(i in 1:nrow(lyft) ){ #First Row - no comparison if(i==1) {lyft$rideSession[1] <- 1 } else if(i== nrow(lyft) ){ #Last Row lyft$rideSession[i] <- lyft$rideSession[i-1] } else { timedifference <- as.numeric(difftime( lyft[i+1,]$StartTime, lyft[i,]$EndTime,units = "mins")) if(timedifference <= (60*4) ) { lyft$rideSession[i] <-lyft$rideSession[i-1] } else { lyft$rideSession[i] <- max(lyft$rideSession)+1 } } } lyft$TotalMoney <- lyft$Amount + lyft$Tip lyft$starthour <- hour(lyft$StartTime) # Night time driving 5pm to 3AM night <- lyft[lyft$starthour %in% c(17:23,(0:3)) , ]
After we have the data loaded and cleaned, we can now attempt visualization. We want to see the variability of money earned by hour. A boxplot will work nicely for our purpose.
# Used for formatting the plot HourLabels <- data.frame(hour = 0L:23L, label = c(paste0(c(12,1:11),"AM") , paste0(c(12,1:11),"PM")) ) #Distribution of Money Made by start hour ---- plotData <- night myLabels <- HourLabels[HourLabels$hour %in% unique(plotData$starthour),]$label myColors <- c("lightblue","wheat","lightgreen","gray") totalrides <- nrow(night) # Make Boxplot To Show Money Made by Hour ---- png(filename = "MoneyByHour.png") with( plotData, boxplot( TotalMoney ~ starthour , col= myColors, las=1, main=paste0("Distribution of Money Made by Start Hour\n", "(n = ",totalrides,")") , names=myLabels, yaxt="n" ) )#End With Statement # Add formatted axis axis(side=2, at = axTicks(2), labels =paste0("$ ",axTicks(2)), las=1) dev.off()
We end up with the plot below which shows us the variability by hour. The most profitable hour is 2AM with a median of $11.58.
Our data is still fairly sparse. A quick look with the table command will show this.
# Examine Frequency table(lyft[c("day","starthour")])
This is due to the lower number of rides given. As this dataset grows we will be able to perform more interesting analysis, such as predicting the estimated earning for a given hour and day. Stay tuned for Part II where we look to answer some more questions and begin the modeling process.