Given the abundance of normal driving data, the problem naturally leads to an anomaly detection (AD) formulation. Let’s try some off-the-shell well-known methods for example Isolation Forest!
In theory, AD approach isn’t affected by the Cold Start problem as training data is normal data only, and is hence we only need labels during evaluation of the intrusion detection system.
But will it work accurately enough? Let’s try!
2a. A naive AD approach using IsolationForest
What did you notice based on the sequencing of the CAN messages? There seems to be particular rhythms.
Let’s include timing-based features, such as time diff since last message of the same type.
We can clearly visualize that attack and normal messages have different timing distribution. Normal YawRate messages typically comes every 12.5ms, and the attack messages are injected much closer to previous messages (smaller time diffs).
Can AD make use of this?
Also, we need to handle the NaN values due to the unique feature of CAN bus data, i.e. YawRate/Gx/Gy is a different CAN message than SteeringAngle or CarSpeed.
This is certainly a start. But the accuracy is no where near what’s needed for deployment, especially FPR (false positive rate)! In fact, a majority of predictions are false positives.
While there are a lot of exciting approaches for AD and sequential time-series data, including using RNN/LSTM/CNN, autoencoders, self-supervised learning, etc. The fundamental problem with AD is that it is hard to achieve high TPR while simultaneously achieving very low FPR which is what we need.
Since AD after all, is a harder problem than supervised learning and while they are important parts of the tool box, we need another strategy to tackle the problem.
2b. Machine teaching: leveraging ML to “program” a classifier by specifying human-generated outputs
If we zoom in, it is perhaps easier to see the zig-zag patterns of alternating real vs injected messages. It’s clear that perhaps we can leverage a ML to classify these kinds of smooth vs zig-zag patterns.
After all, ML should excel at pattern recognition.
The significance of this approach is that it is much easier for human experts to synthesize the attack data than to write the detection program. And such is the promise of Software 2.0, but will it work?
Let’s inspect closely one such attack event.
Let’s try a gradient-boosted trees firstly, e.g. sklearn’s HistGradientBoostingClassifier can work well on larger dataset before bringing out bigger guns.
Impressive as it seems, we must note that the false-positive rate is still a bit high at FPR = 520/(520+8660) = 5.66%. Since CAN messages are very frequent (100-200 msgs/sec in each car), this is still no where near deployment-ready!
We can certainly improve results by tuning the model, tuning the fill-NA method, or bring out larger guns like Bidirectional LSTM or CNN or Transformers which can work well on pattern recognition problems on sequential data such as this problem. Powerful deep learning models can recognize these attack patterns well, and can be trained much faster on the full dataset which is quite large in our case.
However, we must reckon that these models, after all, are recognizing attack patterns that humans are generating and injecting artificially. While this is convenient to generate output and train the detector program a la “Software 2.0”, for our situation, because the attacks are purely synthetic, we cannot be too sure that they are learning the right things and work robustly and can be trusted to deploy in the field. It’s best to employ them in the right deployment scope, namely useful pattern recognizers.