Davor Josipovic Just another WordPress blog – rather tryout


Grouping large sparse matrix

Filed under: R — Tags: — Davor @ 23:21

In one of my recent projects I had to group data from a large sparse matrix. This was mainly to speed up the model fitting process.

The story in short: I couldn’t find a decent solution since most at some point converted the sparse matrix into a dense form, to group over. This is OK for a small matrix, but not for those that explode into gigabytes in their dense form…

So I wrote a function to exploit the sparse triplet structure to efficiently group a sparse matrix. Here it is with explanation.


Transform any (binary) function to an aggregate in pure SQL

Filed under: Oracle — Tags: , — Davor @ 10:44

Few days back I needed an aggregate counterpart of a BITOR function. Unfortunately the Oracle database doesn’t have one.

So how do I make one? There are three options here:

  1. Write your own aggregate function. Here is one way to do it.
  2. Rewrite it in function of an other aggregate function. For example PRODUCT() can be rewritten as EXP(SUM(LN())) (cf. infra). But there is no obvious way for writing an aggregate BITOR() in function of existing Oracle functions.
  3. Simulate aggregation in pure SQL. If you for example have 4 elements {a, b, c, d}, you know that their BITOR aggregate is: BITOR(a,BITOR(b,BITOR(c,d))). Since SQL:1999 we have recursion in SQL. So why write an aggregate function if you can compute it within SQL?

This blogpost is all about the third option. I wanted to see (I) whether it is possible and (II) whether I can generalize it in a concise manner for all binary functions. I also couldn’t find any information about this on the Internet, so that is why I am writing this. It turns out (I) is true, and (II) also, albeit with complications. What I didn’t expect is very bad performance. So the solution below is only for educational use.

Aggregation with + operator

Let’s start with summation. Summation (+) is a binary operator/function that is easy to understand and easy to verify with the aggregate SUM() function.

Suppose this is your table:

  gr NUMBER(10), -- group
  nr NUMBER(10)  -- number

Insert some values in it


What we have is this:

        GR         NR
---------- ----------
         1         10 
         1         20 
         1         30 
         1         40 
         1         50 
         2         20 
         2         30 

The way to sum-aggregate these numbers is by adding a new column which will compute the sum of the current number and the previous sum:

        GR         NR RECURSIVE_SUM
---------- ---------- -------------
         1         10            10 
         1         20            30 
         1         30            60 
         1         40           100 
         1         50           150 
         2         20            20 
         2         30            50 

Finally I just have to take the last step in each group: 150 for gr = 1 and 50 for gr = 2. The recursive query that does all the above and uses only the (+) operator is the following:

WITH sel AS (
  SELECT t4.nr, t4.gr, ROW_NUMBER() OVER (partition BY gr ORDER BY nr) AS rn
  FROM t4
), rec(gr, r_out, rn, nr) AS (
    SELECT gr, nr, rn, nr FROM sel WHERE rn = 1
    SELECT sel.gr, rec.r_out + sel.nr, rec.rn + 1, sel.nr FROM rec, sel WHERE sel.rn = rec.rn + 1 AND sel.gr = rec.gr
), gr AS (
SELECT gr.gr, r_out SUM FROM gr INNER JOIN rec ON (gr.max = rec.rn AND gr.gr = rec.gr) ORDER BY gr.gr;
        GR        SUM
---------- ----------
         1        150 
         2         50 

This is equivalent to:

        GR        SUM
---------- ----------
         1        150 
         2         50 

Aggregation with * operator

Now let’s try multiplication. Just like summation, a product (*) is a binary operator/function. Now, to simulate the PRODUCT() function with the binary * function, you only have to change the r_out field in the recursive query:

WITH sel AS (
  SELECT t4.nr, t4.gr, ROW_NUMBER() OVER (partition BY gr ORDER BY nr) AS rn
  FROM t4
), rec(gr, r_out, rn, nr) AS (
    SELECT gr, nr, rn, nr FROM sel WHERE rn = 1
    SELECT sel.gr, rec.r_out * sel.nr, rec.rn + 1, sel.nr FROM rec, sel WHERE sel.rn = rec.rn + 1 AND sel.gr = rec.gr
), gr AS (
SELECT gr.gr, r_out product FROM gr INNER JOIN rec ON (gr.max = rec.rn AND gr.gr = rec.gr) ORDER BY gr.gr;
        GR    PRODUCT
---------- ----------
         1   12000000 
         2        600 

We can verify the above outcome with the aggregate PRODUCT() function rewritten as EXP(SUM(LN())). With a little algebra you can figure out why the equation holds. Here is the statement:

SELECT gr, EXP(SUM(LN(nr))) AS product FROM t4 GROUP BY gr ORDER BY gr;
        GR    PRODUCT
---------- ----------
         1   12000000 
         2        600 

Aggregation with BITOR function

By now it should be clear how to adjust the recursive query to simulate aggregation for any binary function: We only have to adjust the r_out field. Now let’s try to simulate the aggregated BITOR function. Because there is no BITOR we can rewrite it as BITOR(x,y) = (x+y)-BITAND(x,y);

WITH sel AS (
  SELECT t4.nr, t4.gr, ROW_NUMBER() OVER (partition BY gr ORDER BY nr) AS rn
  FROM t4
), rec(gr, r_out, rn, nr) AS (
    SELECT gr, nr, rn, nr FROM sel WHERE rn = 1
    SELECT sel.gr, (rec.r_out + sel.nr) - BITAND(rec.r_out, sel.nr), rec.rn + 1, sel.nr FROM rec, sel WHERE sel.rn = rec.rn + 1 AND sel.gr = rec.gr
), gr AS (
SELECT gr.gr, r_out BITOR FROM gr INNER JOIN rec ON (gr.max = rec.rn AND gr.gr = rec.gr) ORDER BY gr.gr;
        GR      BITOR
---------- ----------
         1         62 
         2         30 

Note 1: all the above recursive code is very inefficient. Tables exceeding 100 values will have a large performance impact. The problem is the statement following the UNION ALL: it has to select the right value(s) for each recursive iteration step. When I find more time I’ll try to optimize it. In the meantime, here is a query to fill the table with some dummy values for testing:

SELECT round(dbms_random.value(1,5)), round(dbms_random.value(1,99)) FROM DUAL
CONNECT BY level <= 100;


Versioned hardlinked shadowed backup solution with PowerShell

Filed under: PowerShell,Programming — Tags: , — Davor @ 15:36

I don’t like the fact that there are so many backup programs, most of which are expensive closed source solutions. All have their own (mostly) undocumented and proprietary archive systems, that are hardly usable without the software that made them. All this… while Windows 7 offers enough technology under the hood to make an open source scripted solution possible, where the files are versioned, incremental and hardlinked (thus saving space), while the versioned backup contents can be viewed in Explorer. No extra software is needed. The current version of the script you can find on GitHub.

Here is an overview:

Root of the backup folder.

Root of the backup folder. The contents are hardlinked, which means that 10 identical files backed up at different times take only the space of 1 file.

The includelist file for the backup might look like this:


And this is an example of how to run the backup script:

.\ps-backup.ps1 -Backup -BackupRoot "W:\Backups\Archive" -SourcePath "W:\Scripts\ps-backup\include_list.txt" -ExclusionList "W:\Scripts\ps-backup\exclude_list.txt"


Ignoring device I/O errors during copy with PowerShell

Filed under: PowerShell,Programming — Tags: , , — Davor @ 19:14

I had a failing hard drive with lots of RAR files with recovery records. Problem was that I couldn’t get the files off the hard drive with tools like robocopy and xcopy because the drive had many bad sectors resulting in CRC and I/O device errors. I also couldn’t repair the drive with chkdsk /R because the bad sectors kept reappearing.

I also tried Unstoppable Copier, but on best settings it seems to use 0,1% of the file size per read operation, which results in large data corruption when that read operation fails even if there are only a few bad sectors.

So I wrote a little PowerShell script which will copy the source files and replace the unreadable data with zero’s, as accurately as the partition’s cluster size allows. It will also make a XML file per file where it will store bad sector data positions for further reference. I have placed the script code on GitHub.

Note on reading bad sector data

When a hard drives fails to read a sector due to a CRC error, it doesn’t give back any data. Instead it raises an error: I/O device error. With old hard drives it was possible to issue the READ LONG command to skip the error correction part, and simply give back the data. Some programs like Spinrite use this to recover data based on re-read statistics. (If on hundred consecutive re-read operations for sector X your starting bits are “1011…”, then it is reasonably safe to assume those bits are not corrupted.) But the READ LONG command doesn’t work with modern hard drives. Modern hard drives either succeed or fail a read operation. There is no third option. What I have noticed during my own recovery is that even though the hard drive fails reading a sector the first time, there is some chance it will succeed on a retry. I have seen it succeed even after 100 fails! That’s why the -MaxRetries is so important in this script. But setting it too high will greatly slow down the recovery process in case a sector is truly unreadable.

Notes on script usage

  1. First you need to download the script. Here is a link to the file. Save it as Force-Copy.ps1 to, for example, your desktop. (Make sure the extension is correct: ps1!) (There is also a more simple script available at stackoverflow.)
  2. This is a PowerShell script, so it has to run under the PowerShell environment. Run PowerShell and navigate it to your desktop directory. For example, run cmd, type cd desktop and then type powershell -ExecutionPolicy bypass. (Execution policy needs to be changed because PowerShell will not allow scripts to be run by default. With bypass, the script is not blocked and there are no warnings or prompts. This is because the script is unsigned. Read more about PowerShell execution policy and signing here.)
  3. Now, to see some examples of use, type Get-Help .\Force-Copy.ps1 -Examples. The most simple command to copy a file from one to an other location would be:
    .\Force-Copy.ps1 -SourceFilePath "C:\bad_file.txt" -DestinationFilePath "W:\recovered_file.txt"


Circumvent the 260 MAX_LENGTH path in PowerShell with junctions

Filed under: PowerShell,Programming — Tags: , — Davor @ 18:21

This is a function I wrote to circumvent the 260 char path MAX_LENGTH in Win32 API on which all of the cmdlets in PowerShell are based.

Usage: $short_path = Shorten-Path “some-long-path” “temporary-directory-path”

What the function will do is simply make a junction point in the temporary directory, and return a shorter path that points to the same place as the original long path, but through the junction point. This is only if the path is too long: if not, then it will return the original path.

# Returns a shortened path made with junctions to circumvent 260 path length in win32 API and so PowerShell
function Shorten-Path {
				   HelpMessage="Path to shorten.")]
				   HelpMessage="Path to existing temp directory.")]
		[string][ValidateScript({Test-Path -LiteralPath $_ -PathType Container})]$TempPath    
	begin {
		# Requirements check
		if (-not $script:junction) {$script:junction = @{};}
		$max_length = 248; # this is directory max length; for files it is 260.
	process {
		# First check whether the path must be shortened.
		# Write-Warning "$($path.length): $path"
		if ($Path.length -lt $max_length) {
			Write-Debug "Path length: $($Path.length) chars."; 
			return $Path;
		# Check if there is allready a suitable symlink	
		$path_sub = $junction.keys | foreach { if ($Path -Like "$_*") {$_} } | Sort-Object -Descending -Property length | Select-Object -First 1;
		if ($path_sub) {
			$path_proposed = $Path -Replace [Regex]::Escape($path_sub), $junction[$path_sub];
			if ($path_proposed.length -lt $max_length) {
				# assert { Test-Path $junction[$path_sub] } "Assertion failed in junction path check $($junction[$path_sub]) for path $path_sub.";
				return $path_proposed;
		# No suitable symlink so make new one and update junction
		$path_symlink_length = ($TempPath + '\' + "xxxxxxxx").length;
		$path_sub = ""; # Because it is allready used in the upper half, and if it is not empty, we get nice errors...
		$path_relative = $Path;
		# Explanation: the whole directory ($Path) is taken, and with each iteration, a directory node is taken from
		# $path_relative and put in $path_sub. This is done until there is nothing left in $path_relative.
		while ($path_relative -Match '([\\]{0,2}[^\\]{1,})(\\.{1,})') {
			$path_sub += $matches[1];
			$path_relative = $matches[2];
			if ( ($path_symlink_length + $path_relative.length) -lt $max_length ) {
				$tmp_junction_name = $TempPath + '\' + [Convert]::ToString($path_sub.gethashcode(), 16);
				# $path_sub might be very large. We can not link to a too long path. So we also need to shorten it (i.e. recurse).
				$mklink_output = cmd /c mklink /D """$tmp_junction_name""" """$(Shorten-Path $path_sub $TempPath)""" 2>&1;
				$junction[$path_sub] = $tmp_junction_name;
				# assert { $LASTEXITCODE -eq 0 } "Making link $($junction[$path_sub]) for long path $path_sub failed with ERROR: $mklink_output.";
				return $junction[$path_sub] + $path_relative;
		# Path can not be shortened...
		# assert $False "Path $path_relative could not be shortened. Check code!"
	end {}

In $junction variable, all the junction points are stored, so you can remove them afterwards with:

foreach ($link in $junction.values) {
	$rmdir_error = cmd /c rmdir /q """$link""" 2>&1;
	if ( $LASTEXITCODE -ne 0 ) { Write-Warning "Removing link $link failed with ERROR: $rmdir_error." };

I am also using the assert function which you can find here.


Search for methods and properties in WMI with PowerShell

Filed under: PowerShell,Programming — Tags: , — Davor @ 10:14

Here is some simple recursive code to search for methods/properties in the WMI.

Much can be improved, but for now, it is a start.

## v0.1
## - first version
filter seek_properties {
	$_ | Get-Member | select-object name | property $_;
filter property ($obj) {			
			if ((++$global:i % 100000) -eq 0) { echo $i;}
			if ($obj.($_.name) -match 'YOUR SEARCH STRING') {
				echo "Object $($obj.__path) contains in property $($_.name): $($obj.$($_.name))";
				echo $obj;
				echo '-----------------------------------------------';
function start_search ($namespace) {
	Get-WmiObject -Namespace $namespace -list * | foreach {
		if ($_.__CLASS -eq "__NAMESPACE") {
			Get-WmiObject -Namespace $namespace $_.__CLASS | foreach { start_search "$($_.__NAMESPACE)\$($_.Name)"; } 		
		} else {
			echo "$namespace`: $($_.__CLASS)";
			Get-WmiObject -Namespace $namespace $_.__CLASS | seek_properties; 
$global:i = 0;
start_search "ROOT";

This script can take quite some time to finish…


Transcode mkv – batch file for windows

Filed under: batch,Programming — Tags: , , , , , — Davor @ 10:47

What it does: transcode your mkv to x264 with a specified file size. You drag & drop your mkv-file on the batch file, chose some options like target size, and it starts the transcoding process. It re-encodes the video and leaves all other streams intact (unless you exclude them). It uses handbreakcli and mkvtoolnix.

The batch file can be found on GitHub. I based it mainly on this fine article (for linux).

It’s still a work in progress, but currently, it does the job.

What you need is:

Make sure you set the correct paths in the transcode_settings.ini file. The rest should work.

Here is how it works. After you drag a file on transcode.bat, a command line window opens:

There is only one destination path set, so I press “0”. I can also fill in a new destination path. This new destination path will then be saved in transcode_settings.ini and will be available next time under option [1]. Now I have the option to remove some audio and subtitle tracks.

Since I don’t want to remove any audio tracks I press “Enter”. Then I press “1” to remove the English subtitle stream.

Now I have to set some general transcoding options. I want a very good quality to bitrate ratio, so I select the very slow option by pressing “V”. I further select “A” for auto cropping (which is in this case equivalent to no cropping), and I also type “N” to signify that the movie is not black & white. Since I want a fixed size encode, I type “Y”, “N”, “N” to get an impression of what the optimal size would be.

The batch will now transcode a few samples based on the selected settings, and suggest a few optimal transcoding techniques:

For DVD resolution, if I transcode the audio stream to ac3 @448kbit/s, then the 700MiB size is 10% within the optimal range of predefined RF 20. Now I can select the handbrake transcode setting. I type “S” for a fixed size 2-pass transcode, “D” for AC3 448kbit/s audio transcode, “D” for DVD resolution, and twice “N” since I don’t want any noise reduction or decomb filter. Since I selected “S” for Size in the first option, I am now aked what my target size is. I type “1” for 700MiB and the transcoding process starts.

Since this is an ordinary batch file, all the options are easily adjustable to one’s own needs.


Select screen for XBMC without 3th party tools

Filed under: batch — Tags: , — Davor @ 20:43

This is usually done by changing the default display with some tool (DisplaySwitch.exe, UltraMon,…) Here we simply use XBMC’s advanced settings and a simple batch file which will allow to select the startup options (in this case the screen), copy the corresponding advancedsettings.xml to the \userdata\ folder and start XBMC on the correct screen with custom properties.
For this we use XBMC in portable mode (with command line option –p), but it could also work without. Make this folder structure in your XBMC directory:


In adv_settings folder we put 2 files from which we will chose during startup:




These values are taken from guisettings.xml in the \userdata\ folder. So it’s best you set up XBMC and take the settings you need from guisettings.xml and put them into advancedsettings.xml. Apparently, in my case 10128000720050.00000 corresponds to full screen on the TV.
Finally this simple batch file will let you choose between the two:

@set batch_path=%~dp0

:: Choose Monitor-------------------------------------
@choice /C 12 /M "Start on TV[1], Monitor[2]?" /N /T 3 /D 1
@copy "%batch_path%adv_settings\advancedsettings_tv.xml" "%batch_path%..\userdata\advancedsettings.xml" /Y
@GOTO exit
@copy "%batch_path%adv_settings\advancedsettings_monitor.xml" "%batch_path%..\userdata\advancedsettings.xml" /Y
@GOTO exit
@echo "Wrong batch execution... check batch"
@start /D "%batch_path%..\..\" XBMC.exe -p

Now all you need to do is run the batch file, type 1 or 2 and XBMC will start on the correct screen. It is also obvious that these advanced settings allow much more customization than is shown here.

An other interesting way would be to use the API called SetwindowsPos… and even more interesting would be to expose the XMBC API to the command line prompt as suggested here.


Google Analytics on-the-fly insertion with Apache

Filed under: php,Programming — Tags: , — Davor @ 22:46

I recently needed to insert the Google Analytics script on the fly on each html page that an Apache server was sending. Kirill Minkovich has made a good overview of the standard ways for doing so. Particularly the second and third solutions are interesting because they use the ext_filter_module which is meant for such things.

External filters are slow. Since Apache 2.2.7 there is a internal SUBSTITUTE filter.

AddOutputFilterByType SUBSTITUTE text/html
Substitute "s#<head(.*)>#<head$1>\
    <!-- Global site tag (gtag.js) - Google Analytics -->\
    <script async src=\"https://www.googletagmanager.com/gtag/js?id=UA-19913980-1\"></script>\
       window.dataLayer = window.dataLayer || [];\
       function gtag(){dataLayer.push(arguments);}\
       gtag('js', new Date());\
       gtag('config', 'UA-XXXXXXX-X');\

That’s it! Make sure that DEFLATE filter is not run before, otherwise the SUBSTITUTE filter will get a gzipped stream and thus will not be able to insert the code.

Note: Aaron Gloege also has a very interesting php-only solution for this problem, but I, IMHO, thought it a bit farfetched. Nonetheless, it is probably (much) faster than the one used below.

Here is a simpler version for the filter.google-analytics.php file:

$ga_script =
"<script type=\"text/javascript\">
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-XXXXXXXX-X']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
$handle = fopen ("php://stdin","r");
// Will read the whole stream. 
// Use $sHtmlFile = fgets($handle); for testing purposes from command line.
$sHtmlFile = stream_get_contents($handle);
// We allow only one replacement before one of the 3 tags.
// From testing it appears that php PCRE engine replaces the
// first match from the stream. So the ga_script will probably
// be inserted before </head> as Google recommends.
$sHtmlFile = preg_replace(
if ($replacements == 0)
	$sHtmlFile = $sHtmlFile . $ga_script;
echo $sHtmlFile;

And of course, add the following to your apache.conf file:

LoadModule ext_filter_module modules/mod_ext_filter.so
ExtFilterDefine insert-google-analytics cmd="/bin/php /cgi-bin/filter.google-analytics.php" mode=output
AddOutputFilter insert-google-analytics htm html


System-wide events and Qt

Filed under: C++,Programming,Qt — Tags: , , — Davor @ 23:02

Few days back I wanted to create a macro recorder for the desktop. I though I could do this from within the Qt framework which has a neat high-level platform-independent event system that I thought I could incorporate with widgets to make a nice user interface. Well, that’s what I though at least… Truth is, Qt doesn’t offer any means for catching system-wide events. Eventually, I managed to put together a little command line application that can record and replay/loop the mouse and keyboard input messages based on the Windows API. What you will read next is my explanation why catching system-wide events is impossible from within the Qt framework, what the Windows API offers for this purpose, and a proposal on how system-wide events could be incorporated in the Qt event system.


Windows and Qt applications are event driven, which means that they wait for the system to pass events to them upon which they can act. Simply put, those events are passed as “messages” to application windows through a callback function named the “window procedure”.

For example, when a user moves a mouse, the mouse device driver places the recorded movement in the system’s message queue. The system then determines the destination window and “posts” this message to the correct “thread message queue” (i.e. the message queue of the thread which created the destination window). Our Qt application has only access to these messages from its thread message queue. These messages are called “spontaneous events” in the Qt documentation. (Note: there exist also “sent” messages which bypass the message queue, but which we do not consider here.)

Now as one can see, the system is very selective on which mouse and keyboard events our Qt application will get. For example, our window will not get any keyboard messages if it is not in focus, nor will it get any mouse events if the mouse is not inside the window borders. So how can we access those events from the Qt framework? To my knowledge Qt doesn’t offer any ready made functions for this, so we will have to find some other way: directly accessing the Windows API.


The Windows API offers a mechanism called a “hook” which we can hook somewhere in the message-handling mechanism and which will intercept and reroute the messages to some “hook procedure” we define. We define a hook with the API SetWindowsHookEx procedure. This procedure allows us to specify the “hook type” which defines the kind of messages to be intercepted (like WH_MOUSE_LL, WH_KEYBOARD_LL, WH_SHELL), the “hook scope” which can be thread or global (i.e. system), and the hook procedure with the following predefined signature:

 LRESULT CALLBACK HookProc(int nCode, WPARAM wParam, LPARAM lParam);

So to get you started here is a very basic application which uses the above described mechanism an which can be used as a starting point for building applications which are able to monitor system events.

#include <QtGui/QApplication>
#include <QDebug>
#include <windows.h>
LRESULT CALLBACK HookProc(int nCode, WPARAM wParam, LPARAM lParam)
    // process event...
    qDebug() << nCode << wParam << lParam;
    return CallNextHookEx(NULL, nCode, wParam, lParam);
int main(int argc, char *argv[])
    QApplication a(argc, argv);
    if (SetWindowsHookEx(WH_KEYBOARD_LL, HookProc, qWinAppInst(), NULL) == 0)
        qDebug() << "Hook failed for application instance" << qWinAppInst() << "with error:" << GetLastError();
    return a.exec();

For more information, the following links might be interesting:

I’ve put together a little command line utility that is able to record and replay/loop the recording based on the above explanation. Sourcecode is available here. To record: “winrec -record /f “recording.txt””. To play: “winrec -play /f “recording.txt” /repeat 0″. To stop any of these, just press Crtl+Esc.

I think integration with an Qt would probably be straightforward when used with QAbstractEventDispatcher, although I didn’t have time to test it.

Older Posts »

Powered by WordPress