This vignette introduces the data.table syntax, its general form, how to subset rows, select and compute on columns, and perform aggregations by group. Familiarity with the data.frame data structure from base R is useful, but not essential to follow this vignette.
Data manipulation operations such as subset, group, update, join, etc. are all inherently related. Keeping these related operations together allows for:
Briefly, if you are interested in reducing programming and compute time tremendously, then this package is for you. The philosophy that data.table adheres to makes this possible. Our goal is to illustrate it through this series of vignettes.
In this vignette, we will use NYC-flights14 data obtained from the flights package (available on GitHub only). It contains On-Time flights data from the Bureau of Transportation Statistics for all the flights that departed from New York City airports in 2014 (inspired by nycflights13). The data is available only for Jan-Oct’14.
We can use data.table ’s fast-and-friendly file reader fread to load flights directly as follows:
input else < "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv" >flights # 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 # 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11 # 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19 # 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7 # 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13 # --- # 253312: 2014 10 31 1 -30 UA LGA IAH 201 1416 14 # 253313: 2014 10 31 -5 -14 UA EWR IAH 189 1400 8 # 253314: 2014 10 31 -8 16 MQ LGA RDU 83 431 11 # 253315: 2014 10 31 -4 15 MQ LGA DTW 75 502 11 # 253316: 2014 10 31 -5 1 MQ LGA SDF 110 659 8 dim(flights) # [1] 253316 11
Aside: fread accepts http and https URLs directly, as well as operating system commands such as sed and awk output. See ?fread for examples.
In this vignette, we will
data.table is an R package that provides an enhanced version of a data.frame , the standard data structure for storing data in base R. In the Data section above, we saw how to create a data.table using fread() , but alternatively we can also create one using the data.table() function. Here is an example:
DT = data.table( a = 1:6, b = 7:12, c = 13:18 ) DT # ID a b c # # 1: b 1 7 13 # 2: b 2 8 14 # 3: b 3 9 15 # 4: a 4 10 16 # 5: a 5 11 17 # 6: c 6 12 18 class(DT$ID) # [1] "character"
You can also convert existing objects to a data.table using setDT() (for data.frame and list structures) or as.data.table() (for other structures). For more details pertaining to the difference (goes beyond the scope of this vignette), please see ?setDT and ?as.data.table .
getOption("datatable.print.nrows")
In contrast to a data.frame , you can do a lot more than just subsetting rows and selecting columns within the frame of a data.table , i.e., within [ . ] (NB: we might also refer to writing things inside DT[. ] as “querying DT ”, as an analogy or in relevance to SQL). To understand it we will have to first look at the general form of the data.table syntax, as shown below:
DT[i, j, by] ## R: i j by ## SQL: where | order by select | update group by
Users with an SQL background might perhaps immediately relate to this syntax.
Take DT , subset/reorder rows using i , then calculate j , grouped by by .
Let’s begin by looking at i and j first - subsetting rows and operating on columns.
ans # 1: 2014 6 1 -9 -5 AA JFK LAX 324 2475 8 # 2: 2014 6 1 -10 -13 AA JFK LAX 329 2475 12 # 3: 2014 6 1 18 -1 AA JFK LAX 326 2475 7 # 4: 2014 6 1 -6 -16 AA JFK LAX 320 2475 10 # 5: 2014 6 1 -4 -45 AA JFK LAX 326 2475 18 # 6: 2014 6 1 -6 -23 AA JFK LAX 329 2475 14
ans # 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9 # 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
We can use the R function order() to accomplish this.
ans # 1: 2014 1 5 6 49 EV EWR XNA 195 1131 8 # 2: 2014 1 6 7 13 EV EWR XNA 190 1131 8 # 3: 2014 1 7 -6 -13 EV EWR XNA 179 1131 8 # 4: 2014 1 8 -7 -12 EV EWR XNA 184 1131 8 # 5: 2014 1 9 16 7 EV EWR XNA 181 1131 8 # 6: 2014 1 13 66 66 EV EWR XNA 188 1131 9
We will discuss data.table ’s fast order in more detail in the data.table internals vignette.
ans # 1: 13 # 2: 13 # 3: 9 # 4: -26 # 5: 1 # 6: 0
A data.table (and a data.frame too) is internally a list as well, with the stipulation that each element has the same length and the list has a class attribute. Allowing j to return a list enables converting and returning data.table very efficiently.
As long as j-expression returns a list , each element of the list will be converted to a column in the resulting data.table . This makes j quite powerful, as we will see shortly. It is also very important to understand this for when you’d like to make more complicated queries!!
ans # 1: 13 14 # 2: 13 -3 # 3: 9 2 # 4: -26 -8 # 5: 1 2 # 6: 0 4 ## alternatively # ans
Since .() is just an alias for list() , we can name columns as we would while creating a list .
ans # 1: 13 14 # 2: 13 -3 # 3: 9 2 # 4: -26 -8 # 5: 1 2 # 6: 0 4
ans # 1: 5.839349 9.807884
Because the three main components of the query ( i , j and by ) are together inside [. ] , data.table can see all three and optimise the query altogether before evaluation, rather than optimizing each separately. We are able to therefore avoid the entire subset (i.e., subsetting the columns besides arr_delay and dep_delay ), for both speed and memory efficiency.
The function length() requires an input argument. We just need to compute the number of rows in the subset. We could have used any other column as the input argument to length() . This approach is reminiscent of SELECT COUNT(dest) FROM flights WHERE origin = 'JFK' AND month = 6 in SQL.
This type of operation occurs quite frequently, especially while grouping (as we will see in the next section), to the point where data.table provides a special symbol .N for it.
When querying a data.table for elements that do not exist, the behavior differs based on the method used.
setkeyv(flights, "origin")
flights["XYZ"] # Returns: # origin year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum . # 1: XYZ NA NA NA NA NA NA NA NA NA NA NA NA .
flights[origin == "XYZ"] # Returns: # Empty data.table (0 rows and 19 cols): year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay.
flights["XYZ", nomatch=NULL] # Returns: # Empty data.table (0 rows and 19 cols): year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay.
Understanding these behaviors can help prevent confusion when dealing with non-existing elements in your data.
.N is a special built-in variable that holds the number of observations in the current group. It is particularly useful when combined with by as we’ll see in the next section. In the absence of group by operations, it simply returns the number of rows in the subset.
Now that we now, we can now accomplish the same task by using .N as follows:
We could have accomplished the same operation by doing nrow(flights[origin == "JFK" & month == 6L]) . However, it would have to subset the entire data.table first corresponding to the row indices in i and then return the rows using nrow() , which is unnecessary and inefficient. We will cover this and other optimisation aspects in detail under the data.table design vignette.
If you’re writing out the column names explicitly, there’s no difference compared to a data.frame (since v1.9.8).
ans # 1: 13 14 # 2: 13 -3 # 3: 9 2 # 4: -26 -8 # 5: 1 2 # 6: 0 4
If you’ve stored the desired columns in a character vector, there are two options: Using the .. prefix, or using the with argument.
select_cols = c("arr_delay", "dep_delay") flights[ , ..select_cols] # arr_delay dep_delay # # 1: 13 14 # 2: 13 -3 # 3: 9 2 # 4: -26 -8 # 5: 1 2 # --- # 253312: -30 1 # 253313: -14 -5 # 253314: 16 -8 # 253315: 15 -4 # 253316: 1 -5
For those familiar with the Unix terminal, the .. prefix should be reminiscent of the “up-one-level” command, which is analogous to what’s happening here – the .. signals to data.table to look for the select_cols variable “up-one-level”, i.e., within the global environment in this case.
flights[ , select_cols, with = FALSE] # arr_delay dep_delay # # 1: 13 14 # 2: 13 -3 # 3: 9 2 # 4: -26 -8 # 5: 1 2 # --- # 253312: -30 1 # 253313: -14 -5 # 253314: 16 -8 # 253315: 15 -4 # 253316: 1 -5
The argument is named with after the R function with() because of similar functionality. Suppose you have a data.frame DF and you’d like to subset all rows where x > 1 . In base R you can do the following:
DF = data.frame(x = c(1,1,1,2,2,3,3,3), y = 1:8) ## (1) normal way DF[DF$x > 1, ] # data.frame needs that ',' as well # x y # 4 2 4 # 5 2 5 # 6 3 6 # 7 3 7 # 8 3 8 ## (2) using with DF[with(DF, x > 1), ] # x y # 4 2 4 # 5 2 5 # 6 3 6 # 7 3 7 # 8 3 8
## not run # returns all columns except arr_delay and dep_delay ans
## not run # returns year,month and day ans
with = TRUE is the default in data.table because we can do much more by allowing j to handle expressions - especially when combined with by , as we’ll see in a moment.
We’ve already seen i and j from data.table ‘s general form in the previous section. In this section, we’ll see how they can be combined together with by to perform operations by group. Let’s look at some examples.
ans # 1: JFK 81483 # 2: LGA 84433 # 3: EWR 87400 ## or equivalently using a character vector in 'by' # ans
ans # 1: JFK 81483 # 2: LGA 84433 # 3: EWR 87400
The unique carrier code "AA" corresponds to American Airlines Inc.
ans # 1: JFK 11923 # 2: LGA 11730 # 3: EWR 2649
ans # 1: JFK LAX 3387 # 2: LGA PBI 245 # 3: EWR LAX 62 # 4: JFK MIA 1876 # 5: JFK SEA 298 # 6: EWR MIA 848 ## or equivalently using a character vector in 'by' # ans
ans # 1: JFK LAX 1 6.590361 14.2289157 # 2: LGA PBI 1 -7.758621 0.3103448 # 3: EWR LAX 1 1.366667 7.5000000 # 4: JFK MIA 1 15.720670 18.7430168 # 5: JFK SEA 1 14.357143 30.7500000 # --- # 196: LGA MIA 10 -6.251799 -1.4208633 # 197: JFK MIA 10 -1.880184 6.6774194 # 198: EWR PHX 10 -3.032258 -4.2903226 # 199: JFK MCO 10 -10.048387 -1.6129032 # 200: JFK DCA 10 16.483871 15.5161290
Now what if we would like to order the result by those grouping columns origin , dest and month ?
data.table retaining the original order of groups is intentional and by design. There are cases when preserving the original order is essential. But at times we would like to automatically sort by the variables in our grouping.
ans # origin dest month V1 V2 # # 1: EWR DFW 1 6.427673 10.0125786 # 2: EWR DFW 2 10.536765 11.3455882 # 3: EWR DFW 3 12.865031 8.0797546 # 4: EWR DFW 4 17.792683 12.9207317 # 5: EWR DFW 5 18.487805 18.6829268 # --- # 196: LGA PBI 1 -7.758621 0.3103448 # 197: LGA PBI 2 -7.865385 2.4038462 # 198: LGA PBI 3 -5.754098 3.0327869 # 199: LGA PBI 4 -13.966667 -4.7333333 # 200: LGA PBI 5 -10.357143 -6.8571429
Keys: Actually keyby does a little more than just ordering. It also sets a key after ordering by setting an attribute called sorted .
We’ll learn more about keys in the Keys and fast binary search based subset vignette; for now, all you have to know is that you can use keyby to automatically order the result by the columns specified in by .
We can store the intermediate result in a variable, and then use order(origin, -dest) on that variable. It seems fairly straightforward.
ans # 1: EWR PHX 121 # 2: EWR MIA 848 # 3: EWR LAX 62 # 4: EWR DFW 1618 # 5: JFK STT 229 # 6: JFK SJU 690
But this requires having to assign the intermediate result and then overwriting that result. We can do one better and avoid this intermediate assignment to a temporary variable altogether by chaining expressions.
ans # 1: EWR PHX 121 # 2: EWR MIA 848 # 3: EWR LAX 62 # 4: EWR DFW 1618 # 5: JFK STT 229 # 6: JFK SJU 690 # 7: JFK SFO 1312 # 8: JFK SEA 298 # 9: JFK SAN 299 # 10: JFK ORD 432
Yes it does. As an example, if we would like to find out how many flights started late but arrived early (or on time), started and arrived late etc…
ans 0, arr_delay>0)] ans # dep_delay arr_delay N # # 1: TRUE TRUE 72836 # 2: FALSE TRUE 34583 # 3: FALSE FALSE 119304 # 4: TRUE FALSE 26593
It is of course not practical to have to type mean(myCol) for every column one by one. What if you had 100 columns to average mean() ?
How can we do this efficiently and concisely? To get there, refresh on this tip - “As long as the j -expression returns a list , each element of the list will be converted to a column in the resulting data.table ”. If we can refer to the data subset for each group as a variable while grouping, we can then loop through all the columns of that variable using the already- or soon-to-be-familiar base function lapply() . No new names to learn specific to data.table .
data.table provides a special symbol called .SD . It stands for Subset of Data. It by itself is a data.table that holds the data for the current group defined using by .
Recall that a data.table is internally a list as well with all its columns of equal length.
Let’s use the data.table DT from before to get a glimpse of what .SD looks like.
DT # ID a b c # # 1: b 1 7 13 # 2: b 2 8 14 # 3: b 3 9 15 # 4: a 4 10 16 # 5: a 5 11 17 # 6: c 6 12 18 DT[, print(.SD), by = ID] # a b c # # 1: 1 7 13 # 2: 2 8 14 # 3: 3 9 15 # a b c # # 1: 4 10 16 # 2: 5 11 17 # a b c # # 1: 6 12 18 # Empty data.table (0 rows and 1 cols): ID
To compute on (multiple) columns, we can then simply use the base R function lapply() .
DT[, lapply(.SD, mean), by = ID] # ID a b c # # 1: b 2.0 8.0 14.0 # 2: a 4.5 10.5 16.5 # 3: c 6.0 12.0 18.0
We are almost there. There is one little thing left to address. In our flights data.table , we only wanted to calculate the mean() of the two columns arr_delay and dep_delay . But .SD would contain all the columns other than the grouping variables by default.
Using the argument .SDcols . It accepts either column names or column indices. For example, .SDcols = c("arr_delay", "dep_delay") ensures that .SD contains only these two columns for each group.
Similar to part g), you can also specify the columns to remove instead of columns to keep using - or ! . Additionally, you can select consecutive columns as colA:colB and deselect them as !(colA:colB) or -(colA:colB) .
Now let us try to use .SD along with .SDcols to get the mean() of arr_delay and dep_delay columns grouped by origin , dest and month .
flights[carrier == "AA", ## Only on trips with carrier "AA" lapply(.SD, mean), ## compute the mean by = .(origin, dest, month), ## for every 'origin,dest,month' .SDcols = c("arr_delay", "dep_delay")] ## for just those specified in .SDcols # origin dest month arr_delay dep_delay # # 1: JFK LAX 1 6.590361 14.2289157 # 2: LGA PBI 1 -7.758621 0.3103448 # 3: EWR LAX 1 1.366667 7.5000000 # 4: JFK MIA 1 15.720670 18.7430168 # 5: JFK SEA 1 14.357143 30.7500000 # --- # 196: LGA MIA 10 -6.251799 -1.4208633 # 197: JFK MIA 10 -1.880184 6.6774194 # 198: EWR PHX 10 -3.032258 -4.2903226 # 199: JFK MCO 10 -10.048387 -1.6129032 # 200: JFK DCA 10 16.483871 15.5161290
ans # 1: 1 2014 1 14 13 AA JFK LAX 359 2475 9 # 2: 1 2014 1 -3 13 AA JFK LAX 363 2475 11 # 3: 2 2014 1 -1 1 AA JFK LAX 358 2475 8 # 4: 2 2014 1 -5 3 AA JFK LAX 358 2475 11 # 5: 3 2014 1 -11 36 AA JFK LAX 375 2475 8 # 6: 3 2014 1 -3 14 AA JFK LAX 368 2475 11
So that we have a consistent syntax and keep using already existing (and familiar) base functions instead of learning new functions. To illustrate, let us use the data.table DT that we created at the very beginning under the section What is a data.table?.
DT[, .(val = c(a,b)), by = ID] # ID val # # 1: b 1 # 2: b 2 # 3: b 3 # 4: b 7 # 5: b 8 # 6: b 9 # 7: a 4 # 8: a 5 # 9: a 10 # 10: a 11 # 11: c 6 # 12: c 12
DT[, .(val = list(c(a,b))), by = ID] # ID val # # 1: b 1,2,3,7,8,9 # 2: a 4, 5,10,11 # 3: c 6,12
Once you start internalising usage in j , you will realise how powerful the syntax can be. A very useful way to understand it is by playing around, with the help of print() .
## look at the difference between DT[, print(c(a,b)), by = ID] # (1) # [1] 1 2 3 7 8 9 # [1] 4 5 10 11 # [1] 6 12 # Empty data.table (0 rows and 1 cols): ID ## and DT[, print(list(c(a,b))), by = ID] # (2) # [[1]] # [1] 1 2 3 7 8 9 # # [[1]] # [1] 4 5 10 11 # # [[1]] # [1] 6 12 # Empty data.table (0 rows and 1 cols): ID
In (1), for each group, a vector is returned, with length = 6,4,2 here. However, (2) returns a list of length 1 for each group, with its first element holding vectors of length 6,4,2. Therefore, (1) results in a length of 6+4+2 = 12 , whereas (2) returns 1+1+1=3 .
The general form of data.table syntax is:
DT[i, j, by]
We have seen so far that,
We can do much more in i by keying a data.table , which allows for blazing fast subsets and joins. We will see this in the “Keys and fast binary search based subsets” and “Joins and rolling joins” vignette.
As long as j returns a list , each element of the list will become a column in the resulting data.table .
We will see how to add/update/delete columns by reference and how to combine them with i and by in the next vignette.