Keeping rows containing particular strings in R

Reading Time: 2 minutes

I was recently presented with the need to filter out certain rows in my dataset based upon them containing the desired strings. I needed to retain any row that had a "utm_source" and "utm_medium" and "utm_campaign". Each row in my dataset was a single string. The idea is to parse the strings of interest. My approach was to use grep and check each string for each condition that I needed it to satisfy. I consulted with my co-blogger to see if he had a more intelligent way of approaching this problem. He tackled it with a regular expression using a look-ahead. You can see my 'checker' function below and Jeremy's function 'checker2'. Both seem to perform the required task correctly. So now it is simply a matter of performance.

I am not able to share the full dataset that I was using, due to privacy concerns. The dataset that I tested both functions against had 26,746 rows. The 'checker' function which I wrote took  on average 0.0801 seconds and Jeremy's approach took  0.1488 seconds. I decided to stick with my checker function, but that was not because of speed. I would have happily accepted the increased computation time for mine if the times had been reversed. The reason for this is that I find mine easier to read. This means that there is a chance that I could come back to this code in 6 months and have a clue about what it is suppose to be doing. Regular Expressions can sometimes be quite hard to come back to and say, " oh yeah, I wanted to check if all the characters that occupy prime digits in my string are vowels!". I think that my simplistic grep statement will be easier to change if that becomes needed in the future and so I will stick with the 'checker' approach. Do you have a better way to approach this using R? If so, make sure to post a comment.

Leave a Comment

Filed under R

Leave a Reply