In one of my recent projects I had to group data from a large sparse matrix. This was mainly to speed up the model fitting process.

The story in short: I couldn’t find a decent solution since most at some point converted the sparse matrix into a dense form, to group over. This is OK for a small matrix, but not for those that explode into gigabytes in their dense form…

So here is my solution that will exploit the sparse triplet structure to efficiently group a sparse matrix using data.table. First, let’s start with a small matrix to show how it works. We will work with a binary matrix (but any integer matrix should work also - and even floats, but one has to be more careful with that).

Small matrix

set.seed(2020)
Ms <- Matrix::rsparsematrix(10, 10, density = 0.1, rand.x = function(n) 1L)
Ms
## 10 x 10 sparse Matrix of class "dgCMatrix"
##
##  [1,] . . . . . . . . . .
##  [2,] . . 1 . 1 . . . . .
##  [3,] . . . . . . . . . .
##  [4,] . . . . . . . . . .
##  [5,] . . . . . . 1 . . .
##  [6,] . . . 1 . . . . . .
##  [7,] . 1 . . . . . . 1 .
##  [8,] . . 1 . . . . . 1 .
##  [9,] . . . . 1 . . . . .
## [10,] . . . . . . 1 . . .

And here is the group function:

group.dgCMatrix <- function(x) {
stopifnot(inherits(x, 'dgCMatrix'))

# here we exploit some things from triplet sparse matrix structure to group quickly
df <- data.table::data.table(
i = x@i + 1L,
j = rep(seq(x@Dim[2]), times = diff(x@p)), # uncompress j
v = x@x,
key = c('i', 'j')
)

# When grouping by i, the key becomes i. Then join on i.
# Instead of paste one could use digest to compute hashes, but it is slower (2x - 10x).
# Note also that paste() has limited precision if you plan to use a floating point matrix.
df <- df[v != 0, .(jcode = paste0(v, '@', j, collapse = ',')), by = i][J(1L:x@Dim[1]), ]

# make groups
agg <- df[, .(.N, I = .I[1]), by = jcode]

cw <- x[agg$I,,drop = FALSE] stopifnot(nrow(cw) == nrow(agg)) structure(cw, 'N' = agg$N, 'I' = agg\$I)
}

Now let us use it:

GMs <- group.dgCMatrix(Ms)
GMs
## 7 x 10 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . . . . . . .
## [2,] . . 1 . 1 . . . . .
## [3,] . . . . . . 1 . . .
## [4,] . . . 1 . . . . . .
## [5,] . 1 . . . . . . 1 .
## [6,] . . 1 . . . . . 1 .
## [7,] . . . . 1 . . . . .

Note that the returned matrix has 2 attributes:

• N: the weights of each row (i.e. how many times the row occurs)
• I: the index position of the row as reference to the original matrix
attr(GMs, 'N')
## [1] 3 1 2 1 1 1 1
attr(GMs, 'I')
## [1] 1 2 5 6 7 8 9

To clarify: the first row of GMs is an all zeros row, occurring N == 3 times with first occurrence on row-index I == 1. The third row occurs N == 2 times and has a first occurrence on row-index I == 5. Etc.

Large matrix

Now lets try this on a matrix that is unlikely to fit into memory.

M <- Matrix::rsparsematrix(1e6, 275e3, density = 1/1e5, rand.x = function(n) 1L)

In dense form, this matrix is … big. Assuming integers occupy 4 bytes:

sprintf('%.2f GB', prod(dim(M))*4/1024^3)
## [1] "1024.45 GB"

No way that this 1 TB dense matrix will fit into memory. But in current sparse form it is relatively small:

sprintf('%.2f GB', object.size(M)/1024^3)
## [1] "0.03 GB"

print(microbenchmark::microbenchmark(GM <- group.dgCMatrix(M), times = 1))
## Unit: seconds
##                      expr      min       lq     mean   median       uq      max
##  GM <- group.dgCMatrix(M) 22.65951 22.65951 22.65951 22.65951 22.65951 22.65951
##  neval
##      1

dim(GM)
## [1] 890366 275000
table(attr(GM, 'N'))
##
## 853256  29626   6326    998    150      8      1      1