3. Encoding Human Insights

The H1ST.AI approach to this problem begins by thinking about the end-users of the decision system, and their uses cases.

3a. Use case analysis: turning on safe-mode vs post-moterm analysis

The H1ST.AI approach to this problem begins by thinking about the end-users of the decision system, and their uses cases.

What are the use cases for such Automotive Cybersecurity system? We can envision two distinctive use cases:

  1. The onboard intrusion detection system can detect an attack event in realtime and set the car into a safe mode so that drivers can safely get to a safe location and not be stuck in the highway with malfunctioning cars.
  2. An security expert could review the attack in post-mortem mode, in which the IDS provides message-by-message attack vs normal classification.

For use case #1 “safe mode triggering by attack event detection”, the ML requirement is that it has near-zero FPR.

To give an example, each second might contain 100 of CAN messages per car. If we have a fleet with just 1000 cars, each driven 1h per day, then a FPR of 0.001% at message-level still means that each day we have 0.00001 x 100msg x 3600s x 1000cars = 3600 false positive events per day that a security operation center will need to handle!

Additionally, for deployment & anticipated regulatory purpose, the system should behave robustly and explainably. While explainability is a complex subject, we meant that one could anticipate the system’s behavior reasonably well, as well as for legal/regulation purposes. As we saw with iForest or GBM ML models, they don’t quite meet this requirement, as it is hard to explain precisely how these models classify attacks, even if they can achieve good accuracy.

For use case #2 “post-morterm analysis”, it turns out that the requirement is very different. Some FPR could be traded off for higher TPR for post-mortem. And the system might not need to highly explainable as it is after all the jobs of the security experts to analyze the attacks in depth and make the final decisions.

3b. Problem (re)formulation into H1st.AI Graph

We reformulate the problem into the form of a decision graph, where the outermost flow detects attack events and corresponding yes branches handles message classification. For this tutorial we focus on injection attacks which are most common in the wild (we will revisit this later).

The graph looks like this.

3c. Encoding human insights for event detection as a H1st.Model

Remember when we start analyzing the CAN dataset, we have remarked that the normal data is highly regular, especially in terms of the message frequency for each CAN ID.

It turns out that using message frequency statistics for injection event detection is highly accurate for safe-mode use cases (high TPR, low FNR). This surprising fact was first pointed out by the original CAN bus hackers Chris Valasek and Charlie Miller in the seminal white paper Adventures in Automotive Networks and Control Units.

It is pretty straightforward to detect the attacks discussed in this paper. They always involve either sending new, unusual CAN packets or flooding the CAN bus with common packets… Additionally, the frequency of normal CAN packets is very predictable… Therefore we propose that a system can detect CAN anomalies based on the known frequency of certain traffic and can alert a system or user if frequency levels vary drastically from what is well known.

Using H1ST, we can encode insights of such “human” models and use them just like ML models. An h1.Model is essentially anything that can predict. H1ST provides tools to help automate their saving and loading, too, easing the way for using them in an integrated decision system.

A data-science project in H1ST.AI is designed to be a Python-importable package. You can create such a project using the h1 command-line tool.

Organizing model code this way makes it easy to use. The Model API is uniquely designed so that models can be used interactively in notebooks as well as in more complex project such as this one.

Note:

The H1st package of the full tutorial is available from the H1st Github project at https://github.com/h1st-ai/h1st/tree/master/examples/AutoCyber.

Simply go ahead and clone it, then follow along.

The details of training the message frequency statistics is quite simple: looping through a number of files to compute window statistics such as how many msg per CAN ID are found & what’s the min & max and percentile values.

The content of models/msg_freq_event_detector.py should look like following.

import pandas as pd 
import h1st as h1

import config
import util

class MsgFreqEventDetectorModel(h1.Model):
    def load_data(self, num_files=None):
        return util.load_data(num_files, shuffle=False)
    
    def train(self, prepared_data):
        files = prepared_data["normal_files"]
        
        from collections import defaultdict
        def count_messages(f):
            df = pd.read_parquet(f)
            counts = defaultdict(list)
            
            for window_start in util.gen_windows(df, window_size=config.WINDOW_SIZE, step_size=config.WINDOW_SIZE):
                w_df = df[(df.Timestamp >= window_start) & (df.Timestamp < window_start + config.WINDOW_SIZE)]
                for sensor in config.SENSORS:
                    counts[sensor].append(len(w_df.dropna(subset=[sensor])))

            return pd.DataFrame(counts)
        
        ret = [count_messages(f) for f in files]
        df = pd.concat(ret)

        self.stats = df.describe()
    
    def predict(self, data):
        df = data['df']
        window_starts = data["window_starts"]
        window_results = []
        for window_start in window_starts:
            w_df = df[(df.Timestamp >= window_start) & (df.Timestamp < window_start + config.WINDOW_SIZE)]
            results = {}
            for _, sensor in enumerate(config.SENSORS):
                w_df_sensor = w_df.dropna(subset=[sensor])
                max_normal_message_freq = self.stats.at['max', sensor]
                msg_freq = len(w_df_sensor)
                if msg_freq > (max_normal_message_freq * 1.1):
                    results[sensor] = 1
                else:
                    results[sensor] = 0
                # print("%s => %s" % ((window_start, sensor, msg_freq, max_normal_message_freq), results[sensor]))
                results["WindowInAttack"] = any(results.values())
            results["window_start"] = window_start # information for down-stream
            window_results.append(results)
        return {"event_detection_results": window_results}

Now let’s import and train this MsgFreqEventDetectorModel.

Using h1st.Model enable ease of saving/loading them. By default, the “model”, “stats” and “metrics” properties are persisted and they support a variety of flavors & data structure.

Note

We call h1.init() to setup the model repository with storage location specified in MODEL_REPO_PATH. You can also use put MODEL_REPO_PATH in config.py and call h1.init() without any parameter.

It should take several minutes to compute the regular frequency a.k.a. “train” this model.

m.train(data)
m.stats
SteeringAngle CarSpeed YawRate Gx Gy
count 11084.000000 11084.000000 11084.000000 11084.000000 11084.000000
mean 34.316763 17.158607 34.314778 34.314778 34.314778
std 1.311491 2.121101 1.359257 1.359257 1.359257
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 33.000000 17.000000 33.000000 33.000000 33.000000
50% 34.000000 17.000000 34.000000 34.000000 34.000000
75% 35.000000 18.000000 35.000000 35.000000 35.000000
max 40.000000 22.000000 41.000000 41.000000 41.000000

Persisting returns a model version ID that you can use to load it back later, (or you can also give it name).

m.persist()
2020-09-17 19:59:38,891 INFO h1st.model_repository.model_repository: Saving stats property...
'01EJFJEB89MNS65B4CK714TT4B'

3d. Working with H1st Graph

Let’s now make some event-level predictions.

Note that since the model was persisted using H1st model repo, this means that we can easily come back to a notebooks and/or scripts and load the trained model or computed statistics.

Importantly, H1st allows much speedier integration into a Graph (and later deployment, too).

data['attack_files'][0]
'data/attack-samples/20181116_Driver1_Trip4-1.parquet'
import pandas as pd

from graph import WindowGenerator
from msg_freq_event_detector import MsgFreqEventDetectorModel

graph = h1.Graph()
graph.start()\
     .add(WindowGenerator())\
     .add(MsgFreqEventDetectorModel().load())
graph.end()

df = pd.read_parquet(data['attack_files'][0])

results = graph.predict({"df": df})
results.keys()
2020-09-17 19:59:38,914 INFO h1st.model_repository.model_repository: Loading version 01EJFJEB89MNS65B4CK714TT4B ....
dict_keys(['window_starts', 'event_detection_results'])

And we should see that we can start detecting attacks events. We’ll evaluate this later, and now let’s finish adding our detection graph by adding the message classifier.

Note that the graph returns separate output keys, collected from all the nodes’s outputs. Typically each node is expected to return a dict.

3e. Adding a message classifier, harmonizing human + ML models in the graph

For message-level classification we can simply bring back our gradient-boosted trees which did a decent job of recognizing injection messages. (Integrating sequence model such as Bidirectional LSTM is left as an exercise for the reader).

As before, we’ve re-orgarnized it as a H1st.Model in the tutorial folder, ready for use.

The content of models/gradient_boosting_msg_classifier.py looks like this.

import h1st as h1
import pandas as pd

import config
import util

FEATURES = config.SENSORS + ["%s_TimeDiff" % s for s in config.SENSORS]

class GradientBoostingMsgClassifierModel(h1.Model):
    def load_data(self, num_files=None):
        return util.load_data(num_files, shuffle=False)

    def prep(self, data):
        def concat_processed_files(files):
            dfs = []
            for f in files:
                z = pd.read_parquet(f)
                z = util.compute_timediff_fillna(z, dropna_subset=FEATURES)
                dfs.append(z)
            df2 = pd.concat(dfs)
            return df2
        split = int(len(data["attack_files"])*0.5)
        train_files = data["attack_files"][:split]
        test_files = data["attack_files"][split:]
        result = {
            "train_files": train_files,
            "test_files": test_files,
            "train_attack_df": concat_processed_files(train_files),
            "test_attack_df": concat_processed_files(test_files)
        }
        print("len train_attack_df = %s" % len(result["train_attack_df"]))
        print("len test_attack_df = %s" % len(result["test_attack_df"]))
        return result

    def train(self, prepared_data):
        df = prepared_data["train_attack_df"]
        from sklearn.experimental import enable_hist_gradient_boosting
        from sklearn.ensemble import HistGradientBoostingClassifier
        X = df[FEATURES]
        y = df.Label == config.ATTACK_LABEL
        self.model = HistGradientBoostingClassifier(max_iter=500).fit(X, y)

    def evaluate(self, prepared_data):
        df = prepared_data["test_attack_df"]
        ypred = self.model.predict(df[FEATURES])
        import sklearn.metrics
        cf = sklearn.metrics.confusion_matrix(df.Label == config.ATTACK_LABEL, ypred)
        acc = sklearn.metrics.accuracy_score(df.Label == config.ATTACK_LABEL, ypred)
        print(cf)
        print("Accuracy = %.4f" % acc)
        self.metrics = {"confusion_matrix": cf, "accuracy": acc}
    
    def predict(self, data):
        df = data["df"].copy()
        df = util.compute_timediff_fillna(df)
        df['MsgIsAttack'] = 0
        df['WindowInAttack'] = 0
        for event_result in data["event_detection_results"]:
            if event_result['WindowInAttack']:
                # print("window %s in attack: event_result = %s" % (event_result['window_start'], event_result))
                in_window = (df.Timestamp >= event_result['window_start']) & (df.Timestamp < event_result['window_start'] + config.WINDOW_SIZE)
                w_df = df[in_window]
                if len(w_df) > 0:
                    ypred = self.model.predict(w_df[FEATURES])
                    df.loc[in_window, "WindowInAttack"] = 1
                    df.loc[in_window, "MsgIsAttack"] = ypred.astype(int)
        return {"injection_window_results": df}
from gradient_boosting_msg_classifier import GradientBoostingMsgClassifierModel

m2 = GradientBoostingMsgClassifierModel()
data = m2.load_data(num_files=6)
prepared_data = m2.prep(data)
len train_attack_df = 1030994
len test_attack_df = 868436
prepared_data["train_attack_df"]
Timestamp SteeringAngle CarSpeed YawRate Gx Gy Label AttackSensor AttackMethod AttackParams AttackEventIndex SteeringAngle_TimeDiff CarSpeed_TimeDiff YawRate_TimeDiff Gx_TimeDiff Gy_TimeDiff
2 0.024343 67.604385 0.000000 0.189777 0.002458 -0.002173 Normal NA NA 0.0 <NA> -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
3 0.027083 67.608772 0.000000 0.189777 0.002458 -0.002173 Normal NA NA 0.0 <NA> 0.013509 -1.000000 -1.000000 -1.000000 -1.000000
4 0.037508 67.608772 0.000000 0.189665 0.002375 -0.002151 Normal NA NA 0.0 <NA> -1.000000 -1.000000 0.013230 0.013230 0.013230
5 0.038148 67.613159 0.000000 0.189665 0.002375 -0.002151 Normal NA NA 0.0 <NA> 0.011065 -1.000000 -1.000000 -1.000000 -1.000000
6 0.043605 67.617538 0.000000 0.189665 0.002375 -0.002151 Normal NA NA 0.0 <NA> 0.005457 -1.000000 -1.000000 -1.000000 -1.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
369439 1649.991996 7.202400 6.485162 0.998420 0.433515 0.138374 Normal NA NA 0.0 <NA> -1.000000 -1.000000 0.011372 0.011372 0.011372
369440 1649.994320 7.168000 6.485162 0.998420 0.433515 0.138374 Normal NA NA 0.0 <NA> 0.012805 -1.000000 -1.000000 -1.000000 -1.000000
369441 1650.003266 7.168000 6.485162 0.994170 0.437117 0.137005 Normal NA NA 0.0 <NA> -1.000000 -1.000000 0.011269 0.011269 0.011269
369442 1650.007432 7.133600 6.485162 0.994170 0.437117 0.137005 Normal NA NA 0.0 <NA> 0.013111 -1.000000 -1.000000 -1.000000 -1.000000
369443 1650.009595 7.133600 6.422659 0.994170 0.437117 0.137005 Normal NA NA 0.0 <NA> -1.000000 0.024471 -1.000000 -1.000000 -1.000000

1030994 rows × 16 columns

m2.train(prepared_data)
[[831878    603]
 [ 15049  20906]]
Accuracy = 0.9820
m2.persist()
2020-09-17 17:09:34,127 INFO h1st.model_repository.model_repository: Saving metrics property...
2020-09-17 17:09:34,129 INFO h1st.model_repository.model_repository: Saving model property...
'01EJF8PXNE6VT0SGJM0RHF12Y6'

Putting everything together in a h1.Graph and running through graph.predict() on a single file looks like this.

class NoOp(h1.Action):
    def call(self, command, inputs):
        pass

graph = h1.Graph()
graph.start()\
     .add(WindowGenerator())\
     .add(h1.Decision(MsgFreqEventDetectorModel().load(),
                      decision_field="WindowInAttack",
                      result_field="event_detection_results"))\
     .add(yes=GradientBoostingMsgClassifierModel().load(),
          no=NoOp())
graph.end()

results = graph.predict({"df": df})
results.keys()
2020-09-17 20:38:30,155 INFO h1st.model_repository.model_repository: Loading version 01EJFJEB89MNS65B4CK714TT4B ....
2020-09-17 20:38:30,160 INFO h1st.model_repository.model_repository: Loading version 01EJF8PXNE6VT0SGJM0RHF12Y6 ....
dict_keys(['window_starts', 'event_detection_results', 'injection_window_results'])

The confusion matrix for message-level classification looks like this.

import sklearn.metrics
print(sklearn.metrics.confusion_matrix(results['injection_window_results']["Label"] == "Attack", 
                                       results['injection_window_results']["MsgIsAttack"]))
[[354997      1]
 [    46  15193]]

Now let’s evaluate the whole graph against the test set, especially focusing on the event-level TPR & FPR since they are crucial in the safe-mode deployment use case.

from util import evaluate_event_graph

evaluate_event_graph(graph, prepared_data['test_files'])
============
Event-level confusion matrix
[[8567    0]
 [  16 1089]]
Event TPR = 0.9855, FPR = 0.0000
(8567, 0, 16, 1089)

Now that’s something! Event-level FPR=0.0% with zero false positives!

(Note that the provided attack samples was created on a subset of the driving trips, but you should able to do a more thorought evaluation by running against synthetic attacks created from the all driving trips dataset, and the results should be the same: zero false positive at event-level.)

The message-level accuracy should be nearly the same because we used the same classifier. However the decomposition leads to separation of concerns and requirement for these two use cases. We’re much more comfortable with the solution now both in terms of accuracy as well as robustness and explainability.

Another significance worth pointing out here is that we get multiple output streams from H1st.Graph: event-level outputs and msg-level outputs, exactly what we need for two different use cases we highlighted: safe-mode triggering and post-mortem analysis.