Why does Machine Learning & Big Data play a significant role in modern cybersecurity?

In category CyberSecurity | September 26, 2016

Disclaimer: I’m not an expert in the field of Machine Learning. I’m an absolute beginner of Machine Learning. I’m keen on it, and progressively adopting it myself. This article is just my thought as to why Machine Learning plays a significant role in modern cybersecurity.

Digital Transformation drives the wave

According to Wiki, digital transformation is the changes associated with the application of digital technology in all aspects of human society. The transformation means to changes of the business and people action, in resulting to the better outcome. The term is not new. It’s been for a quite long time in the world of information technology. Organizations are embracing digital transformation to change the business landscape. The number of buzzwords you have heard every day: BYOD, IoT, Cloud, Big Data & Analytics.


I used to work in Singapore within two years, having a chance to actually realize the explosion of BYOD. When in a train, don’t be surprised by how people bow their heads to mobile devices. They seem not to know each other. Even they are friends, they still seem. The culture of “Bowing down to the Device” may describe exactly the BYOD trend in Singapore, and in Vietnam as well. One of the world’s most famous slogans is “Connecting People” may not be true today because nobody actually looks at each other in a natural conversation.


Internet of Things becomes Things of Internet. Everything is going to be connected to the Internet and becomes much smarter. Imagine in your home, you forget to turn off the air conditioner, a smart device can check and alert you while triggering a process to turn the air conditioner off. Look at a few car manufacturers, they are developing a smart car ever historically that can be autonomous on a zigzag road. A driver now just simply need to keep his hands on steering wheel to concentrate on driving, while most of the other actions he needs such as asking the car to suggest nearest supermarket, or picking up a call can be done by smart devices. When the IoT comes in, data shall exponentially grow, creating a big data revolution. The investment to infrastructure for handling such so-called big data is extremely hard. Hence, people look at cloud adoption. It’s like they borrow the infrastructure of another provider to handle big data in a manner of subscription model.

Challenges in cyber-security

Talking about the era of digital is not that short within a page or in a day. Even this topic is supposed to creating a big data of information about it. The wave of digital transformation contributes significantly to number of different things. First, it’s Big Data. Every piece of information is considered data. And Data is not just a plain-text file, or just a structured database you do in your IT life. Data can be a video, picture, a document or even a voice recorded during a conference call. Securing such a diversity of data gives you a big challenge. Organizations need to be more open to connect with partner and external systems, surely giving more opening doors to attackers. The working environment is so dynamic and becomes unpredictable. To elaborate how challenging the modern cybersecurity is, I want to give two examples indicating two common problems of cybersecurity today: Malnet and Internet Fraudulence.

Malnet (Malware Distribution Network) is a big and dynamic network of malware circulating over the Internet, designed to deliver massively number of malicious activities to Internet user. The more the Internet user, the more dynamic the malnet.


The illustration below gives you an example of how a typical malnet works. A victim firstly reaches the website via search engine result (e.g. Google, Bing). He clicks on an iframe which looks informative (e.g. up-to-date stock information, weather prediction, advertisement banner…). The landing page returns back an HTTP 3XX in which the first level referral is executed, redirecting the victim to another URL. A next request from the victim is generated, hitting to a group of servers playing as front-end proxy or redirection server to drive the victim to the malnet. The redirection may be repeated a few times going through a few groups of redirection servers, before getting into the malnet. The malnet finally sends malicious code to the victim which he has successfully executed.

Preventing malnet distribution servers requires the ability to identify the polymorphic model of malware and URL, because the victim gets directed through several times until he reaches the malicious code. Today, identifying if the URL is safe is not easy. There has to be the implementation of URL reputation system to eliminate all the malicious URLs. This sounds easy, doesn’t it? Can you imagine how many safe URLs you are going to add into your URL reputation system?

The next example is online payment with digital card (mostly Credit Card). People go shopping online and use their credit card to rapidly process the payment. The more people go with online payment, the broader the door is open for attackers who do something to grab your card information. The application must have the ability to verify if there is fraudulence. In the past there was not much of information which is like:


Look at the table above, the rule seems to be like if the transaction amount is large than 2,000 USD and the name starts with the level “C” then set the fraudulent = TRUE. But if today the information is collected like the one below


The rule to identify the fraudulence becomes more complicated than the first sample. There should not be the same rule saying that if somebody pays over 2,000 USD with whose name starting by “C” is a case of fraudulence.

By means of two examples, you would realize that the big challenge is to identify and predict pattern (in both malnet and fraudulence) that has never existed. How does your system know if a request send to your web application does not comprise of a set of malicious code? Signature-based classification is cool but it not that easy to handle. Another example to help you see the challenge is detecting malicious request sent to your web application from the Internet. Everything is dynamic and polymorphic. Payload has changed the way it works. In a nutshell, the detection and prevention in the modern cybersecurity becomes harder ever!

Machine Learning approach

Look at the first example – malnet, there should be a mechanism to enable your firewall/security system to learn more about the URL whether it is malicious or benign. Yes, it’s how machine learning is determined. It the case of building the URL reputation system the application (let’s say it’s the one that prevents malware) there is a pre-classified data set to train the system to be more smart to know if a URL is malicious or benign. Features shall be then extracted from the set of pre-defined URLs with specific characteristics of a URL. For example, the number of character and the number of special character (non-alphabetical) in a URL. In theory, it sounds simple. But the fact that it’s complicated. There are number of steps to produce and train the detection & rating model with training set and the selection of machine learning algorithm is the key to success.

For the online payment, the fraudulence detection is implemented maturely in most of big online payment system provider such as PayPal, Visa, MasterCard or so on. First, with Machine Learning, there is a selection of data. The more data you have the better result. Selecting right data to process is very important. For example, the time when someone uses credit card, the location he uses, even the age and sex.


The PROS in Machine Learning applied for this case is that it can effectively learn complex fraudulent patterns and can handle large volume of data with big data technology. Moreover, with machine learning technique, the application can predict new types of fraudulence which the traditional expert-driven approach couldn’t. However, the CONS is the sample of data and training set which contributes to an effective detection is not enough today, and perhaps results highly rate of false positive.

There are number of recommended machine learning algorithms for such cybersecurity per my research

  • Random Forest
  • Expectation-Maximization
  • Naïve Bayes
  • K-Means

Machine Learning often comes with big data & analytics to offer an end-to-end solution. Without big data & analytics, applied machine learning doesn’t make so much of sense.


Machine Learning does play a significant role in the modern cybersecurity. In real-world scenario, Machine Learning is used to detect fraudulence, to identify suspicious domain to prevent malware, to classify anomaly requests and activities in SIEM (Security Information Event Management) to model threat, and more ideas you can read here http://www.mlsecproject.org/#open-source-projects

Today there are number of big vendors providing Machine Learning and Big Data Analytics that you can experiment, including:

  • SAS Analytics Suite
  • RapidMinder Studio
  • Alteryx Analytics
  • SAP Predictive Analysis
  • Oracle Advanced Analytics
  • Microsoft Azure Cortana Intelligence
  • Google Machine Learning

Additional References:


Leave a Reply