Analyzing Lyft Data - Part I

Reading Time: 3 minutes

Ride sharing is a popular way to get around and much cheaper than some alternatives.  How much do the people who choose to participate in ride sharing actually make? We will start this series of examining Lyft driving data by getting the data loaded and trying to understand which hours are the best to drive.

The first step is to get our data loaded. The data is contained in a Github repository . You can clone the repo to get started or just directly read in the csv file. We will be just read in the data file directly. If you want to see the whole script in one place, try LyftData.R.

After getting the data loaded there is some cleaning that is needed. After cleaning the data we will be looking at which hours are the best to drive. It will be useful to create a variable to indicate which rides are part of the same driving session. This will allow us to have a nice level to analyze the ride data.

After we have the data loaded and cleaned, we can now attempt visualization. We want to see the variability of money earned by hour. A boxplot will work nicely for our purpose.


We end up with the plot below which shows us the variability by hour. The most profitable hour is 2AM with a median of $11.58.

Our data is still fairly sparse. A quick look with the table command will show this.

This is due to the lower number of rides given. As this dataset grows we will be able to perform more interesting analysis, such as predicting the estimated earning for a given hour and day. Stay tuned for Part II where we look to answer some more questions and begin the modeling process.




Leave a Comment

Filed under Uncategorized

Leave a Reply