In one of my recent projects I had to group data from a large sparse matrix. This was mainly to speed up the model fitting process.

The story in short: I couldn’t find a decent solution since most at some point converted the sparse matrix into a dense form, to group over. This is OK for a small matrix, but not for those that explode into gigabytes in their dense form…

So here is my solution that will exploit the sparse triplet structure to efficiently group a sparse matrix using `data.table`

. First, let’s start with a small matrix to show how it works. We will work with a binary matrix (but any integer matrix should work also - and even floats, but one has to be more careful with that).

```
set.seed(2020)
Ms <- Matrix::rsparsematrix(10, 10, density = 0.1, rand.x = function(n) 1L)
Ms
```

```
## 10 x 10 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . . . . . . .
## [2,] . . 1 . 1 . . . . .
## [3,] . . . . . . . . . .
## [4,] . . . . . . . . . .
## [5,] . . . . . . 1 . . .
## [6,] . . . 1 . . . . . .
## [7,] . 1 . . . . . . 1 .
## [8,] . . 1 . . . . . 1 .
## [9,] . . . . 1 . . . . .
## [10,] . . . . . . 1 . . .
```

And here is the group function:

```
group.dgCMatrix <- function(x) {
stopifnot(inherits(x, 'dgCMatrix'))
# here we exploit some things from triplet sparse matrix structure to group quickly
df <- data.table::data.table(
i = x@i + 1L,
j = rep(seq(x@Dim[2]), times = diff(x@p)), # uncompress j
v = x@x,
key = c('i', 'j')
)
# When grouping by i, the key becomes i. Then join on i.
# Instead of paste one could use digest to compute hashes, but it is slower (2x - 10x).
# Note also that paste() has limited precision if you plan to use a floating point matrix.
df <- df[v != 0, .(jcode = paste0(v, '@', j, collapse = ',')), by = i][J(1L:x@Dim[1]), ]
# make groups
agg <- df[, .(.N, I = .I[1]), by = jcode]
cw <- x[agg$I,,drop = FALSE]
stopifnot(nrow(cw) == nrow(agg))
structure(cw, 'N' = agg$N, 'I' = agg$I)
}
```

Now let us use it:

```
GMs <- group.dgCMatrix(Ms)
GMs
```

```
## 7 x 10 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . . . . . . . .
## [2,] . . 1 . 1 . . . . .
## [3,] . . . . . . 1 . . .
## [4,] . . . 1 . . . . . .
## [5,] . 1 . . . . . . 1 .
## [6,] . . 1 . . . . . 1 .
## [7,] . . . . 1 . . . . .
```

Note that the returned matrix has 2 attributes:

`N`

: the weights of each row (i.e. how many times the row occurs)`I`

: the index position of the row as reference to the original matrix

`attr(GMs, 'N')`

`## [1] 3 1 2 1 1 1 1`

`attr(GMs, 'I')`

`## [1] 1 2 5 6 7 8 9`

To clarify: the first row of `GMs`

is an all zeros row, occurring `N == 3`

times with first occurrence on row-index `I == 1`

. The third row occurs `N == 2`

times and has a first occurrence on row-index `I == 5`

. Etc.

Now lets try this on a matrix that is unlikely to fit into memory.

`M <- Matrix::rsparsematrix(1e6, 275e3, density = 1/1e5, rand.x = function(n) 1L)`

In dense form, this matrix is … big. Assuming integers occupy 4 bytes:

`sprintf('%.2f GB', prod(dim(M))*4/1024^3)`

`## [1] "1024.45 GB"`

No way that this 1 TB dense matrix will fit into memory. But in current sparse form it is relatively small:

`sprintf('%.2f GB', object.size(M)/1024^3)`

`## [1] "0.03 GB"`

About 30 MB.

`print(microbenchmark::microbenchmark(GM <- group.dgCMatrix(M), times = 1))`

```
## Unit: seconds
## expr min lq mean median uq max
## GM <- group.dgCMatrix(M) 22.65951 22.65951 22.65951 22.65951 22.65951 22.65951
## neval
## 1
```

Grouping took about 20 seconds.

`dim(GM)`

`## [1] 890366 275000`

And we went from 1M rows down to 890k. And here is how the weights are distributed:

`table(attr(GM, 'N'))`

```
##
## 1 2 3 4 5 6 7 63717
## 853256 29626 6326 998 150 8 1 1
```

The 64k rows are all zeros and are all grouped into 1 group. The others where ones occur are scattered between 853k unique rows, and some duplicate rows that happened 2 or up to 7 times.