After finding the mouse’s nose position (see post on Tracking mouse position in the gap-crossing task), I wanted to get a fast, robust estimate of the basic whisking pattern, together with approximate times when whiskers could have been in contact with a target.
- Approximate times and numbers of whiskers that appear to be <0.5 mm from the target platform edge.
- Approximate whisker angle for at least a majority of whiskers in most frames. This should be good enough to compute a rough whisking pattern (frequency, phase, amplitude).
The problems with the dataset are:
- Size: we have ~100 videos, containing 20-100k frames each that need tracking (mouse is in the right position in these frames).
- Image heterogeneity: There are 6 mice with different whisker thicknesses & colors, somewhat different light/background noise conditions (dust on the back light etc.).
- Obscured features: The target and home platforms are darker than the back light which makes whiskers intersecting them appear different, and there is a optical fiber intersecting the whiskers in many frames.
- Full whiskers. The mice have (unilateral) untrimmed whiskers, which means there are tons of overlapping lines and generally not enough information to even attempt to maintain any whisker identity over pixels or frames.
This is fairly different from ‘real’ whisker tracking, where usually we’re after getting precise contact times (for electrophysiology etc.) and contact parameters such as whisker bending over time to estimate torques. Usually for these cases, you’d use ~1kHz imaging, clipped whiskers, where only a row (usually C row) or even only one whisker is left, and even go to head-fixed preps (O’Connor et al. 2010). In these cases the convnet step detailed here should still be useful, but you’d use a more sophisticated method to track parametric whisker shapes. Here’s our older paper from 2008 on this, better methods have been published since, like the Clack et al. 2012 (well documented code available), the Knutsen et al. (initial paper in 2005) tracker (on github), or the BIOTACT Whisker Tracking Tool (software&docs) (paper).
Locating the head
To start out, we’re first locating whether there’s a mouse in each frame (to avoid tracking empty images), the position of the mouses head, and the position of the target platform. See the earlier post on this for details.
Pixel-wise whisker labeling with a convolutional neural network
Next, we want to label all pixels that represent whiskers, ideally independently of light conditions, background noise etc. If this labeling is sufficiently clean, a relatively simple method can be used later to get the location and orientation of individual whisker segments. Here, I’m using a very small convolutional neural network (tutorial) to identify whisker pixels. This code uses ConvNet code by Sergey Demyanov (github page).
We’ll need a training set of raw and binary label images in which all whiskers are manually annotated, using photoshop or some other tool. It is crucial to get all whiskers in these images with very high precision, and paint over or mark to ignore (and then exclude from training set) all non-labelled whiskers so that the training can run on a clean training set. Also make sure that the training set includes enough negative examples, including pixels from all possible ooccluders such as optical cables, recording tethers etc. Here, I used just 4 images, a few more would probably have been better.
The network for this example is pretty simple:
Input radius of 5px, so we’re feeding the network 11x11px tiles,
First layer with 8 outputs, second layer with 4 outputs, softmax for output.
The input radius / size of the image tiles around the pixel that is to be identified should be as small as possible while getting the job done. Large radii mean more parameters to learn and slow down processing later. We need around 5 pixels to do a proper line/ridge detection, and maybe a few more in order to train the NN to avoid labeling whisker-like structures that are part of the target platform etc.
In order to avoid accidentally tracking pieces of fur that are too close to the head but locally look like whiskers, we’d need a fairly large input radius for the cnn so it could be trained to label every hair that is too close to the head as negative. Instead, because locating pixels that are part of the head is dead simple via smoothing and thresholding (the head is the only big very dark object in the images) we can just accept that the cnn will give a few ‘false positives’ here, and just run a very fast cleanup pass with a much simpler convolution with larger kernel (20px diameter circle). This way the cnn can run on small easy to train 11×11 tiles and we still avoid fur labeling.
To make the training set I’m picking all positive examples, plus rotated copies, plus a large number of negative ones picked from random image locations. Further, to avoid over-training to the specific image brightness levels of the training set, I’m adding random offsets to each training sample. Because we’re just using a small number of training images, i’m not using a separate test set to track convergence for now.
The Training is then run to convergence, for around ~4 hrs on a 2 year old core7 system.
Now that the whisker labels look ok, I’m running an approximate whisker angle tracking with a Hough transform. Of course the labeled image would make a good input for a proper vibrissa tracking tool, like the ones listed above, that can track a proper parametric whisker shape and even attempt to establish and maintain whisker identity over frames.
Running the tracking on large datasets
Now that the method works in principle, there’s still a few small tricks needed to it run at decent speeds. First off, given that the nose position is already known, we can restrict the NN to run only on a circular area of the image around that, and given that the direction the animal is headed is known, and the whiskers are clipped on one side, we only need to track one side of the face and can cut the circle in half.
This leaves one major avoidable time sink iin this implementation, arranging the image data so it can be fed into the neural network. The implementation I use here expects an inputsize X inputsize X outputsize array, with one inputsize X inputsize tile per desired output. This is just a consequence of using a general purpose implementation as a convolutional NN. The simple solution of looping over output pixels and copying a section of the input image into an array takes up ~2 seconds in my dataset, way longer than the NN run and gives me below 0.5 fps, which means I can only get through a few videos a day.
The solution in matlab is to just pre-compute a mapping of indices from each desired output pixel coordinate to the indices of the input pixels corresponding to the inputsize X inputsize tile for that pixel.
%steps for tiling the image isteps=inradius+10:size(uim,1)-inradius-10; % cutting off additional 10px on each side jsteps = inradius+10:size(uim,2)-inradius-10; uim_2_ii=zeros(numel(uim),(((inradius*2)+1).^2)); % ^this is the mapping from linear input image pixel index [1:width*height] % to a list of (inradius*2)+1) X (inradius*2)+1) indices that make up the % tile to go with that (output) pixel (these indices are again linear). % Once we have this mapping, we can just feed % input_image(uim_2_ii(linear desired output pixel index,:)) % into the CNN which is faster than getting the % -inradius:inradius X -inradius:inradius tile each time. for i=isteps for j=jsteps x=x+1; ii=sub2ind(size(uim),i+meshgrid(-inradius:inradius)', .. j+meshgrid(-inradius:inradius)); %linear indices for that tile uim_2_ii(x,:)= ii(:); % shape from matrix into vector end; end; uim_2_ii=uim_2_ii(1:x,:); % now points to tile for each output/predicted pixel in uim
Once that 2d lookup table is done, feeding data to the CNN becomes negligibly fast.
Now, we’re limited mostly by the file access and the CNN and can track at ~5-6fps, which is good enough to get through a decent sized dataset in a few days.
Now to get the approximate whisking pattern, a simple median or mean of the angles coming from the hough transform does a decent job, and simply averaging (and maybe thresholding) the CNN output at the platform edge gives a decent measure of whether vibrissae overlapped the target in any frame. This is of course no direct indicator of whether there was contact between the two, but for many analyses this is a sufficient proxy, and at the very least gives a clear indication of whisking cycles where there was no contact.