In one of my recent projects I had to group data from a large sparse matrix. This was mainly to speed up the model fitting process.

The story in short: I couldn’t find a decent solution since most at some point converted the sparse matrix into a dense form, to group over. This is OK for a small matrix, but not for those that explode into gigabytes in their dense form…

So here is my solution that will exploit the sparse triplet structure to efficiently group a sparse matrix using data.table. First, let’s start with a small matrix to show how it works. We will work with a binary matrix (but any integer matrix should work also - and even floats, but one has to be more careful with that).

Small matrix

set.seed(2020)
Ms <- Matrix::rsparsematrix(10, 10, density = 0.1, rand.x = function(n) 1L)
Ms
## 10 x 10 sparse Matrix of class "dgCMatrix"
##                          
##  [1,] . . . . . . . . . .
##  [2,] . . 1 . 1 . . . . .
##  [3,] . . . . . . . . . .
##  [4,] . . . . . . . . . .
##  [5,] . . . . . . 1 . . .
##  [6,] . . . 1 . . . . . .
##  [7,] . 1 . . . . . . 1 .
##  [8,] . . 1 . . . . . 1 .
##  [9,] . . . . 1 . . . . .
## [10,] . . . . . . 1 . . .

And here is the group function:

group.dgCMatrix <- function(x) {
  stopifnot(inherits(x, 'dgCMatrix'))

  # here we exploit some things from triplet sparse matrix structure to group quickly
  df <- data.table::data.table(
    i = x@i + 1L,
    j = rep(seq(x@Dim[2]), times = diff(x@p)), # uncompress j
    v = x@x,
    key = c('i', 'j')
  )

  # When grouping by i, the key becomes i. Then join on i.
  # Instead of paste one could use digest to compute hashes, but it is slower (2x - 10x).
  # Note also that paste() has limited precision if you plan to use a floating point matrix. 
  df <- df[v != 0, .(jcode = paste0(v, '@', j, collapse = ',')), by = i][J(1L:x@Dim[1]), ]

  # make groups
  agg <- df[, .(.N, I = .I[1]), by = jcode]

  cw <- x[agg$I,,drop = FALSE]
  stopifnot(nrow(cw) == nrow(agg))
  structure(cw, 'N' = agg$N, 'I' = agg$I)
}

Now let us use it:

GMs <- group.dgCMatrix(Ms)
GMs
## 7 x 10 sparse Matrix of class "dgCMatrix"
##                         
## [1,] . . . . . . . . . .
## [2,] . . 1 . 1 . . . . .
## [3,] . . . . . . 1 . . .
## [4,] . . . 1 . . . . . .
## [5,] . 1 . . . . . . 1 .
## [6,] . . 1 . . . . . 1 .
## [7,] . . . . 1 . . . . .

Note that the returned matrix has 2 attributes:

attr(GMs, 'N')
## [1] 3 1 2 1 1 1 1
attr(GMs, 'I')
## [1] 1 2 5 6 7 8 9

To clarify: the first row of GMs is an all zeros row, occurring N == 3 times with first occurrence on row-index I == 1. The third row occurs N == 2 times and has a first occurrence on row-index I == 5. Etc.

Large matrix

Now lets try this on a matrix that is unlikely to fit into memory.

M <- Matrix::rsparsematrix(1e6, 275e3, density = 1/1e5, rand.x = function(n) 1L)

In dense form, this matrix is … big. Assuming integers occupy 4 bytes:

sprintf('%.2f GB', prod(dim(M))*4/1024^3)
## [1] "1024.45 GB"

No way that this 1 TB dense matrix will fit into memory. But in current sparse form it is relatively small:

sprintf('%.2f GB', object.size(M)/1024^3)
## [1] "0.03 GB"

About 30 MB.

print(microbenchmark::microbenchmark(GM <- group.dgCMatrix(M), times = 1))
## Unit: seconds
##                      expr      min       lq     mean   median       uq      max
##  GM <- group.dgCMatrix(M) 22.65951 22.65951 22.65951 22.65951 22.65951 22.65951
##  neval
##      1

Grouping took about 20 seconds.

dim(GM)
## [1] 890366 275000

And we went from 1M rows down to 890k. And here is how the weights are distributed:

table(attr(GM, 'N'))
## 
##      1      2      3      4      5      6      7  63717 
## 853256  29626   6326    998    150      8      1      1

The 64k rows are all zeros and are all grouped into 1 group. The others where ones occur are scattered between 853k unique rows, and some duplicate rows that happened 2 or up to 7 times.