In one of my recent projects I had to group data from a large sparse matrix. This was mainly to speed up the model fitting process.
The story in short: I couldn’t find a decent solution since most at some point converted the sparse matrix into a dense form, to group over. This is OK for a small matrix, but not for those that explode into gigabytes in their dense form…
So here is my solution that will exploit the sparse triplet structure to efficiently group a sparse matrix using data.table
. First, let’s start with a small matrix to show how it works. We will work with a binary matrix (but any integer matrix should work also - and even floats, but one has to be more careful with that).
set.seed(2020)
Ms <- Matrix::rsparsematrix(10, 10, density = 0.1, rand.x = function(n) 1L)
Ms
## 10 x 10 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . . . . . . .
## [2,] . . 1 . 1 . . . . .
## [3,] . . . . . . . . . .
## [4,] . . . . . . . . . .
## [5,] . . . . . . 1 . . .
## [6,] . . . 1 . . . . . .
## [7,] . 1 . . . . . . 1 .
## [8,] . . 1 . . . . . 1 .
## [9,] . . . . 1 . . . . .
## [10,] . . . . . . 1 . . .
And here is the group function:
group.dgCMatrix <- function(x) {
stopifnot(inherits(x, 'dgCMatrix'))
# here we exploit some things from triplet sparse matrix structure to group quickly
df <- data.table::data.table(
i = x@i + 1L,
j = rep(seq(x@Dim[2]), times = diff(x@p)), # uncompress j
v = x@x,
key = c('i', 'j')
)
# When grouping by i, the key becomes i. Then join on i.
# Instead of paste one could use digest to compute hashes, but it is slower (2x - 10x).
# Note also that paste() has limited precision if you plan to use a floating point matrix.
df <- df[v != 0, .(jcode = paste0(v, '@', j, collapse = ',')), by = i][J(1L:x@Dim[1]), ]
# make groups
agg <- df[, .(.N, I = .I[1]), by = jcode]
cw <- x[agg$I,,drop = FALSE]
stopifnot(nrow(cw) == nrow(agg))
structure(cw, 'N' = agg$N, 'I' = agg$I)
}
Now let us use it:
GMs <- group.dgCMatrix(Ms)
GMs
## 7 x 10 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . . . . . . .
## [2,] . . 1 . 1 . . . . .
## [3,] . . . . . . 1 . . .
## [4,] . . . 1 . . . . . .
## [5,] . 1 . . . . . . 1 .
## [6,] . . 1 . . . . . 1 .
## [7,] . . . . 1 . . . . .
Note that the returned matrix has 2 attributes:
N
: the weights of each row (i.e. how many times the row occurs)I
: the index position of the row as reference to the original matrixattr(GMs, 'N')
## [1] 3 1 2 1 1 1 1
attr(GMs, 'I')
## [1] 1 2 5 6 7 8 9
To clarify: the first row of GMs
is an all zeros row, occurring N == 3
times with first occurrence on row-index I == 1
. The third row occurs N == 2
times and has a first occurrence on row-index I == 5
. Etc.
Now lets try this on a matrix that is unlikely to fit into memory.
M <- Matrix::rsparsematrix(1e6, 275e3, density = 1/1e5, rand.x = function(n) 1L)
In dense form, this matrix is … big. Assuming integers occupy 4 bytes:
sprintf('%.2f GB', prod(dim(M))*4/1024^3)
## [1] "1024.45 GB"
No way that this 1 TB dense matrix will fit into memory. But in current sparse form it is relatively small:
sprintf('%.2f GB', object.size(M)/1024^3)
## [1] "0.03 GB"
About 30 MB.
print(microbenchmark::microbenchmark(GM <- group.dgCMatrix(M), times = 1))
## Unit: seconds
## expr min lq mean median uq max
## GM <- group.dgCMatrix(M) 22.65951 22.65951 22.65951 22.65951 22.65951 22.65951
## neval
## 1
Grouping took about 20 seconds.
dim(GM)
## [1] 890366 275000
And we went from 1M rows down to 890k. And here is how the weights are distributed:
table(attr(GM, 'N'))
##
## 1 2 3 4 5 6 7 63717
## 853256 29626 6326 998 150 8 1 1
The 64k rows are all zeros and are all grouped into 1 group. The others where ones occur are scattered between 853k unique rows, and some duplicate rows that happened 2 or up to 7 times.