Fast lookup in sorted array

The matlab code for this post is available at https://github.com/jvoigts/find_halfspace_matlab

A lot of the first-order data analysis I’m running amounts to taking lots of big time series, aligning them to triggers (time of stimulus onset, response times etc) and running statistics on the resulting aligned samples.

Because I store triggers as time stamps in seconds and not just as sample numbers (so that re-sampling the data does not become an issue, and for ease of use when defining time windows etc.), I often need to turn a time into an index/sample number.

Finding the time stamp closest to a given time in a huge vector of time stamps (in my case, one time stamp per sample) requires considerable cpu time – here’s a method that practically eliminates this overhead.

Find method

ts=linspace(0,20,500000);
tic;
for i=1:1000
<%%KEEPWHITESPACE%%>   f=min(find(ts&gt;10));
end;
toc 

This method is simple, but find scales with array size as o(n), and you’ll be doing one of these for each trigger (the min() doesn’t add significant overhead). Even the above toy example runs in ~1.3sec on my machine, that’s too slow for running big analyses.

Halfspace search (slow, recursive matlab)

Timestamps are if not linear, so at least sorted, so instead of the linear search with find, we could just do a half space search: Divide the array in half, and check which half the desired element will be in, divide that half again etc until the element is found. This scales with o(log n). In contrast to a straight binary search however, we want to stop not when we find the element that equals the time we’re looking for, but the closest smaller one, because in all likelihood there is no timestamp with that exact value. A quick matlab implementation could look like this:

(UPDATE: Brian Lau pointed out that this implementation is slower than it needs to be because it relies on recursive function calls, which are slow in matlab – scroll down for a faster implementation.)

 function [data,x,lower,upper]= find_halfspace_rec(data,x,lower,upper);
mp=floor((lower+upper)/2);
if mp~=lower
<%%KEEPWHITESPACE%%>    if data(mp)&lt;x
<%%KEEPWHITESPACE%%>        [data,x,lower,upper]= find_halfspace_rec(data,x,mp,upper);
<%%KEEPWHITESPACE%%>    else
<%%KEEPWHITESPACE%%>        [data,x,lower,upper]= find_halfspace_rec(data,x,lower,mp);
<%%KEEPWHITESPACE%%>    end;
end;

With a nice wrapper:

 function l = find_halfspace(data,x);
% returns the number of the last sample in the ASCENDING array data that is
% &lt; x using a simple half space search

upper = numel(data); lower = 1;
[data,x,l,u] = find_halfspace_rec(data,x,lower,upper);

now we can rewrite the example like this:

 ts=linspace(0,20,500000);
tic;
for i=1:1000
<%%KEEPWHITESPACE%%>   f=find_halfspace(ts,10);
end;
toc

This now runs at ~0.17s, so ~13% of the time for the find based method.

Faster halfspace search in Matlab

This implementation does the same half-space search as the above one, but avoids calling a new function for each iteration, making it much faster. This is the version you should generally use.

 function middle = find_halfspace(x,target)
<%%KEEPWHITESPACE%%>    % returns the number of the last sample
<%%KEEPWHITESPACE%%>    % in the ASCENDING array data that is
<%%KEEPWHITESPACE%%>    % &lt; x using a simple half space search
<%%KEEPWHITESPACE%%>    first = 1;
<%%KEEPWHITESPACE%%>    last = length(x);
<%%KEEPWHITESPACE%%>    while first &lt;= last
<%%KEEPWHITESPACE%%>        middle = floor((first + last)/2);
<%%KEEPWHITESPACE%%>        if target &lt; x(middle)
<%%KEEPWHITESPACE%%>            last = middle - 1;
<%%KEEPWHITESPACE%%>        elseif target &gt; x(middle)
<%%KEEPWHITESPACE%%>            first = middle + 1;
<%%KEEPWHITESPACE%%>        else
<%%KEEPWHITESPACE%%>            break
<%%KEEPWHITESPACE%%>        end
<%%KEEPWHITESPACE%%>    end
end

This code runs at ~0.004s, so we’re down to ~0.35% of the find method, and the code is now so fast that there’s no reason to use sample number calculations instead of this time based lookup.

Halfspace search (outdated – the iterative matlab implementation is about as fast) (c/mex)

Now, to really make this fast, we can write the method in c and compile a mex file.
This removes the matlab overhead of looping and should cut times down another notch.
Compile this with mex find_halfspace_mex.c 

 #include "mex.h"
/*
<%%KEEPWHITESPACE%%> * find_halfspace_mex.c
<%%KEEPWHITESPACE%%> * same as find_halfspace.m
<%%KEEPWHITESPACE%%> */

int binary_search(double a[], int low, int high, double target[]) {

<%%KEEPWHITESPACE%%>    int result=-1;
<%%KEEPWHITESPACE%%>    while (low = a[middle])
<%%KEEPWHITESPACE%%>            low = middle + 1;
<%%KEEPWHITESPACE%%>        else
<%%KEEPWHITESPACE%%>            return middle;
<%%KEEPWHITESPACE%%>        result=middle;
<%%KEEPWHITESPACE%%>    }
<%%KEEPWHITESPACE%%>    return result+1;
}

void find_halfspace_mex(double y[], double x[],double f[],int low, int high)
{
<%%KEEPWHITESPACE%%>        y[0]=binary_search(x, low, high, f);

}

void mexFunction( int nlhs, mxArray *plhs[],
<%%KEEPWHITESPACE%%>        int nrhs, const mxArray *prhs[] )
{
<%%KEEPWHITESPACE%%>    double *x,*y, *f;
<%%KEEPWHITESPACE%%>    size_t mrows,ncols,maxsize;
<%%KEEPWHITESPACE%%>    /* Check for proper number of arguments. */
<%%KEEPWHITESPACE%%>    if(nrhs!=2) {
<%%KEEPWHITESPACE%%>        mexErrMsgIdAndTxt( "MATLAB:find_halfspace_mex:invalidNumInputs", "two inputs required.");
<%%KEEPWHITESPACE%%>    } else if(nlhs&amp;gt;1) {
<%%KEEPWHITESPACE%%>        mexErrMsgIdAndTxt( "MATLAB:timestwo:maxlhs", "Too many output arguments.");
<%%KEEPWHITESPACE%%>    }

<%%KEEPWHITESPACE%%>    /* The input must be a noncomplex scalar double.*/
<%%KEEPWHITESPACE%%>    mrows = mxGetM(prhs[0]);
<%%KEEPWHITESPACE%%>    ncols = mxGetN(prhs[0]);

<%%KEEPWHITESPACE%%>    if (mrows&amp;gt;ncols)
<%%KEEPWHITESPACE%%>        maxsize = mrows;
<%%KEEPWHITESPACE%%>    else
<%%KEEPWHITESPACE%%>        maxsize = ncols;

<%%KEEPWHITESPACE%%>    /* Create matrix for the return argument. */
<%%KEEPWHITESPACE%%>    plhs[0] = mxCreateDoubleMatrix(1, 1, mxREAL);

<%%KEEPWHITESPACE%%>    /* Assign pointers to each input and output. */
<%%KEEPWHITESPACE%%>    x = mxGetPr(prhs[0]);
<%%KEEPWHITESPACE%%>    f = mxGetPr(prhs[1]);

<%%KEEPWHITESPACE%%>    y = mxGetPr(plhs[0]);

<%%KEEPWHITESPACE%%>    if ((mrows==0)||(ncols==0))
<%%KEEPWHITESPACE%%>        y[0]=-1;
<%%KEEPWHITESPACE%%>    else
<%%KEEPWHITESPACE%%>        find_halfspace_mex(y,x,f,0,maxsize);
}

This makes a find_halfspace_mex.mexa64 (or .mexw or .dll depending on your system) and we can replace the matlab method in the example like this:

 ts=linspace(0,20,500000);
tic;
for i=1:1000
<%%KEEPWHITESPACE%%>   f=find_halfspace_mex(ts,10);
end;
toc

This code too runs at <0.05s, in some cases this might be somewhat faster than the matlab based implementations, or maybe there are other reasons why going to c might be useful.

This entry was posted in Matlab. Bookmark the permalink.

Comments are closed.