Deep Learning for Sensor Fusion and Sequence Classification

Deep Learning (DL) has achieved remarkable performance on ‘cognitive’ tasks such as object recognition in images, real-time spoken language translation, etc. These are monumental achievements that have transformed our thinking about what machines could do for us.

DL, it turns out, can also perform well on more mundane machine learning tasks such as sequence classification of high resolution multivariate sensor data. This post presents such a model which utilizes the Microsoft Cognitive toolkit (CNTK).

Sensor data sequence classification has many applications, such as:

Manufacturing: Predict if a machine in a manufacturing plant is about to fail based on the collective ‘hum’ of the relevant sensor data.

Automotive:  Detect if the car is about to slide given sensor data – e.g. speed, lateral acceleration, wheel velocities, steering wheel angle, driveline torque, etc. – and take corrective action

Cyber security: Detect if a hacker is actively trying to penetrate a secure network based on the pattern of network data packets to a server

The problem presented here is similar to the automotive scenario above. Assume a series sliding-160817_640of high resolution values (100 times a sec) from multiple sensors. Our task is to detect if the car is about to slide or not, with some probability.

First we need good training data which means a large sample of timed sensor values along with an indication of whether car is sliding or not – collected under a variety of road and speed conditions. Obtaining good training data is often the hardest problem in machine learning. I don’t want to trivialize this point but assume that we have this data somehow.

Assume the raw data  is as follows (not real data):

seq# s1   s2   s3   i  
---- --   --   --   --
1    1.0  0.5  10    0
2    2.0  0.6  11    0
3    1.5  0.8  9     1
4    1.5  0.8  10    1
...

Where:
s1-s3 are sensor values
i is the indicator variable 1=sliding, 0=not sliding
each row is at a constant 10ms increment from the previous
(100 rows will give the sensor values for 1 second)

We need to transform this data into a format that can be consumed by the CNTK reader as multivariate time series data for sequence classification. There is a special format for that described here.

But first we need to segment the data into multiple sets. Positive training samples – sequence of sensor values just before the car is about to slide (i.e. the indicator transitions from 0 to 1 and stays at 1 for several rows). And negative training samples where the indicator stays 0 for the entire sequence (and beyond for some time).

The length of the sequence can be determined from experimentation (or it can vary for each sequence). Assume we set it to a fixed 30 (1/3 of sec.) for this task.

After transformation of the raw data, the input training file will look as follows:

1 | F 1.0 0.5 10 | L 1 0
1 | F 1.0 0.5 10 
1 | F 2.0 0.6 11 
...

2 | F 1.0 0.5 10 | L 0 1
2 | F 1.0 0.5 10 
2 | F 2.0 0.6 11 
...

Where:
Vertical bar (|) is a separator
The leftmost value in each row is the record identifier
All rows with the same record number are part of the same input sample
The input values are ordered sequentially by row
For our case there are a total of 30 rows per sample
'F' is the tag for features. Here it is a vector of 3 values (1 per sensor)
'L' is the tag for labels either "0 1" or "1 0" (binary classification)
The label tag only appears on the first row of the input sample

Given the above training data format, we can construct a CNTK model that can be trained to classify a sequence of sensor values. The core part of the model is the Recurrent Neural Net Gated Recurrent Unit layer (RNN-GRU). The CNTK model is shown below (‘#’ denotes comment start):

FDim=$FDim$            #3 dimensions for 3 sensor values
LDim=$LDim$            #2 for binary classification
HDim = 15              #dimension of the GRU internal state vector

t = DynamicAxis()      #CNTK primitive for variable length sequences
features  = Input(FDim, dynamicAxis=t)
labels    = Input(LDim)

#below is the function to create a GRU RNN layer
R = BS.RNNs.RecurrentGRU(HDim, cellDim=HDim, features, inputDim=FDim).h
L = BS.Sequences.Last(R) #we need the last value of the GRU hidden vector

#below is a macro that defines a sequence of layers (concatenated by :)
M = Sequential (
  LayerNormalizatinLayer{}             #normalization layer
  # BatchNormalizationLayer{} :        #alternative normalization
  DenseLayer {10, activation=ReLU} :   #dense layer with ReLU activation
  DenseLayer {2}                       #output layer
)
Y = M(L)                   #invoke the M macro to obtain un-normalized output
O   = Softmax(Y)                          #normalize output (sum to 1)
CE  = CrossEntropyWithSoftmax(labels, Y)  #error minimized by the optimizer
Err = ClassificationError(labels, Y)      #error to evaluate model performance

# Root Nodes
featureNodes = (features)                 #define the special nodes
labelNodes = (labels)                     #labels
criterionNodes = (CE)                     #SGD optimizer minimizes this loss
evaluationNodes = (Err)                   #evaluation results shown in log
outputNodes = (Y:O)                       #this is the output of the model

Above is partial CNTK code – just the core model. It needs to be wrapped in code that references the training data; sets the parameters for the SDG optimizer; specifies cross validation and testing; etc. Please refer to CNTK documentation for details.

Sensor fusion is reflected by the hidden state of the RNN-GRU layer. After all of the input values are applied  (in a recursive way with the hidden state), the final hidden state vector represents the ‘essence’ of the combined sequence of sensor values (known as embedding).

I constructed a similar model for a real word problem involving automotive sensor data (different from the example problem discussed here). The final result was very reasonable in that the out-of-sample (test set) classification error rate was 3% for sequences 1/3 sec long. The error rate dropped to almost zero for longer 1/2 sec sequences.

Further, I compiled the trained model into a minimal C++program and tested its runtime performance on a PC. The model code was tiny – less than 50Kb in size (including the model). Runtime evaluation rate was 600+ evals per sec – indicating feasibility for use in real-time embedded systems.

In conclusion I want to say that while DL is making its name by conquering very hard problems, its applicability to simpler problems should not be overlooked. It might be that DL techniques are just as viable or better than more traditional methods used to tackle simpler problems.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s