Unexpected behavior with summary function in R

I often find myself working with data that includes dates and times. Sometimes I am interested in looking at what happened on on a particular calendar day. I usually avoid dealing with actual datetime formats when working at the day level and prefer to use an integer for representing a day. For instance the first day of each quarter can be represented as integers (numeric also works for this example). For instance if I wanted to know the oldest date in my dataset I can just take the minimum since I am using a numerical data structure.

[1] 20140101

For some reason which I don't recall, I tried using the summary() function on my dates. The only values that would be valid are the minimum and the maximum, or so I thought.

The output:
# Min.            1st Qu.       Median      Mean          3rd Qu.       Max.
# 20140000 20140000 20140000 20140000 20140000 20140000

This behavior which seemed odd to me, is caused by the way summary() deals with numerical data. So I decided to look at what summary is actually doing. To view the code behind the summary function type:

The portion of code that we are interested in is this:

This code chunk gets executed when the object that we pass to summary is numeric.  If we substitute in our object 'mydates' we get the following code.

If you step through the code line by line, you will notice that after line 4, summary produces what we would expect to see for a min and max value. However, after you execute line 5, the numbers are changed because they are not the actual numbers but they are changed to be significant figures. For example try:

[1] 20140000

So be careful when using generic functions if you don't know what they are doing. I would encourage to take a look at the code behind some of the R functions you use the most. For instance using the fivenum() function does not change my min and max values the same way summary did.

Leave a Comment

Filed under R

Leave a Reply