Data: The fuel that powers Machine Learning

Jun 02nd 2022

Why (and which) data is essential to create a reliable Machine Learning model?

Machine Learning Blog Post Series – 4
By Shazia Saqib

 

MACHINE LEARNING BLOG SERIES 

1: Machine Learning & Cybersecurity – An Introduction

2: The main concepts of AI and Machine Learning

3: Why we need Machine Learning in Cybersecurity, and how it can help

 

Data is probably one of the most important and valuable commodities in modern day society. Data analytics in machine learning is being used to capture otherwise unseen patterns in almost all industries, from manufacturing to agriculture, from scientific institutions to government organizations. It can even be used to predict disasters before they actually happen.

Companies with a data-first mentality will have the chance to reimagine and reinvent their business. [1]

 

In our last article, “Why do we need Machine Learning in cybersecurity and how can it help? ”, we outlined a general introduction to the state of cyber security. We explained why the analytics and automation power of Machine Learning (ML) can help cover the blind spots   in cybersecurity by recognizing  the hidden patterns to identify attacks and automatically mitigate them. We also explored 19 of the most prominent AI use cases in Cybersecurity listed by Gartner.

Download your complimentary Gartner Report

 

Machine learning models mainly rely on four primary data types. These include numerical, categorical, time-series, and text data.

Machine Learning models generally use 4 types of data: numerical data, categorical data, time-series data, text data

Source: The importance of machine learning data

 

Numerical data, or quantitative data, is measurable data such as distance, age, weight, or the cost of an electricity bill.

Text data is simply words, sentences, or paragraphs that can provide some level of insight to machine learning models.

Categorical data is sorted based on shared characteristics. Social class, gender, and hometown are a few examples of categorical data.

Time series data consists of data points indexed at specific points in time. More often than not, this data is collected at consistent intervals.

As cyber-attacks are growing exponentially, there is a need to detect the patterns of an attack and find a correlation between reported attack volume in consecutive days[2].

How to build and evaluate a Machine Learning model

The data is probably the most critical component of a reliable machine learning model. Everything from the model creation process to the actual predictions of the model is dependent on the data that is used as input.

Machine Learning flow starts with reliable data, and continues with feature extraction and model engineering to produce the most reliable output.

 

Feature extraction:

Feature extraction is when the features are selected and translated into a mathematical form that the model can understand. When you build a model, you have numerous parameters or “features” that can be used to predict the desired outcome. Feature selection identifies the essential features and eliminates the irrelevant or redundant ones using dimensionality reduction.

VMRay uses a supervised machine learning model, which means that the most relevant features are picked among numerous potential indicators such as URL string entropy, white space percentage, etc.

 

Model engineering:

A model is a parameterized function through which we can map inputs to outputs. The models are created through a careful and meticulous engineering and experimentation process of selecting, validating and evaluating the models.

VMRay trains its models by creating a Machine Learning workflow, which ensures explainability for the significant parts of the process that contributed to the prediction, such as sample set collection, feature engineering, feature weights and inference, etc.

In most cases, the success of model engineering is a matter of avoiding overfitting or underfitting and finding the optimized balance between the accuracy and false positives of the outcome. It’s also essential to create a generalizable model: one that will be almost equally reliable when it encounters a new dataset.

The complexity of the model can be controlled by experimenting with the number of hyperparameters. In the regression example on the figure below, you can see why an optimal level of model complexity is needed. When the model finds the exact match of input and output in the training dataset (an overfitting example), it would be hard to apply this model to external data. On the other hand, if the model is too simple, then it will not be able to perform well even with the training data, which basically means underfitting.

When building a machine learning model, it's important to avoid overfitting and underfitting, and find an optimal point in between.

In short, we need a model that does not “memorize” the mathematical function that links the input data to the prediction, rather we need it to “learn” the underlying function.

To achieve this, we need a process called “Model Evaluation”, which means testing and improving the model to see if it performs well both with the training set and with new data.

This stage is where the quality of data and expertise of those who build the model pays off.

For model evaluation, the data set is divided into two subgroups. The first one -generally the bigger group- is the training set, while the second is called the validation set. The model is trained to find the function that ties the input and output within the training set, and then this function is tested with the validation set, which is outside the data upon which the model is trained.

There are different methods to evaluate the performance on the validation set and find the optimal balance, such as minimizing cost function and optimizing accuracy (correct prediction) compared to validation accuracy. The closer the performance within the training set and validation set, the more generalizable the model is.

Accuracy and loss functions show how a model performs within the training set and on the validation set.

Source: An Introduction to Statistical Learning, Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani

 

Another very important tool for model evaluation is the Confusion Matrix, where we can have both a general overview and conduct a deeper analysis of the balance of accuracy and FP.

Confusion matrix classifies all outcomes as False Positive, true positives, false negatives and true negatives to evaluate the balance between accuracy and false results.

Confusion Matrix

 

For example, when evaluating a model that predicts whether a sample is malware or not, we can use this tool to calculate:

Recall: How much of the actual malware is caught, calculated by dividing predicted malware to all malware:  Recall=TP/(TP+FN)

Precision: How much of the “positive” predictions are correct, calculated by dividing “correctly predicted malware” to all predicted malware: Precision = TP / (TP+FP)

Accuracy: How much of the predictions are correct, basically, true predictions divided by all predictions: Accuracy = (TP+TN)/(TP+FP+TN+FN)

Recall, Precision and Accuracy are calculated to evaluate the performance of machine learning models.

 

Output:

Within the VMRay platform, the machine learning module is an additional layer that is fed by the input derived from VMRay’s cutting-edge dynamic analysis technologies. The output of the Machine Learning model, then, contributes to the unique VTI (VMRay Threat Identifiers) scoring methodology, as a separate VTI rule.

Users can view the ML prediction on the detailed VTI list, although it does not deliver the ML prediction as a “standalone verdict”. By displaying the ML outcome as one of the many identifiers, VMRay balances the ML prediction with the decisions of 20+ unique detection technologies. The users can not only display what the ML engine predicts among the VTIs, but also can adjust the impact of that prediction on the overall verdict.

What it takes to create a reliable model

When we speak about what is the most important requirement for Artificial Intelligence and Machine Learning, it always comes to the input data. It’s the data that makes the difference because data is the raw material of ML.

75% of executives don't trust their data on a high level, and 40% of enterprise data is inaccurate, incomplete or unavailable

 

An HFS Study shows that 75% of executives do not have a high level of trust in their data[4]. According to another study from Gartner, 40% of enterprise data is either inaccurate, incomplete, or unavailable[5]. So, the first thing needed to create the Machine Learning model is to find the most trustworthy input. And this is where VMRay excels. VMRay provides the model with the highest quality input in three aspects:

 

VERACITY:

“Access to good data is one of the major challenges of AI/ML development.” says Gartner, highlighting the importance of accurate and trustworthy data.

VMRay’s core technologies and the innovations they keep creating and introducing with every release enable the platform to see the true face of the enemy. While VMRay Analyzer analyzes a file or URL, the sample displays its genuine behavior, as even the most evasive sample is unaware of being observed. Thus, VMRay can bring the most accurate data to the table, which is very much needed to build a reliable Machine Learning Model. In short, the input that VMRay uses to feed, train, and validate the model is accurate and noise-free data.

 

SCOPE:

Covid19 showed us once again that the data from the past loses relevance in a short time because the world is changing at an exponential pace. As a result of this ongoing disruptive transformation, a new concept is emerging: “small and wide data”, where “small” refers to the increasing importance of relevant, to-the-point data instead of big volumes of it.  As per Gartner, “70% of organizations will shift their focus from big to small and wide data by 2025” [6].

VMRay specializes in what matters the most: the types of threats that others miss. The unknown, zero-day threats that the cybersecurity world is not yet aware of, and the sophisticated and targeted attacks that use evasive techniques to remain hidden. This means that VMRay’s expertise and data are to-the-point, when it comes to detecting undetectable threats.

 

RANGE:

To create  a reliable and generalizable machine learning model that helps detect new threats, you need data that includes a wide variety of threat types, targets, vectors, and behavior patterns.

VMRay, has a diverse client portfolio in terms of verticals, regions, and company sizes. We’re working with top companies: 4 of the top 5 global tech giants, 14 out of Fortune 100 largest companies, 17 of the World’s Most Valuable Brands. In addition to the private companies, VMRay works with more than 50 critical Government customers from 17 countries. This adds a huge range to our expertise and know-how.

And VMRay Analyzer logs every necessary detail about the malicious behavior in each step of its execution. This adds enormous breadth to the data.

 

A reliable machine learning model can be created with reliable, relevant and wide range of data.

 

Summary

VMRay Analyzer analyzes samples using cutting-edge advanced detection technologies that observe and report the “actual” behavior of malicious samples and generate accurate and noise-free outputs. This is why VMRay Analyzer is trusted by leading private and public organizations to cover the blind spots, and validate the alerts and false positives of their existing security tools and systems.

VMRay’s Machine Learning model works as a module of this already strong technology platform and gets the highest quality input directly from the dynamic analysis: reliable, relevant, and wide range of data. And these are exactly the qualities needed to create, validate, and evaluate a trustworthy machine learning model.

 

For further information
DOWLOAD THE WHITEPAPER

Machine Learning with VMRay Analyzer

Machine Learning
in Advanced Threat Detection

Download the
White Paper

Read more about why AI is needed in cyber security, what it takes to create the best machine learning models, and how VMRay’s approach makes a difference.

Download the White Paper

Explore VMRay’s
Cutting-edge Technologies

Explore 20+ unique technologies that enable VMRay to detect unknown threats and sophisticated attacks.

Learn more

artificial intelligence in cybersecurity

Gartner’s Use-Case Prism
for AI in Cybersecurity

Read the blog series exploring how Machine Learning should be created to bring additional detection capabilities.

View your free Gartner report

Calculate how much malware false positives are costing your organization:
Malware False Positive Cost Calculator