• Home / AI / How Does Data Science Help In Cyber Security?

How Does Data Science Help In Cyber Security?

  • August 6, 2021

Data science brings a logical structure to unstructured data. Data scientists use machine or deep learning algorithms to compare normal and abnormal patterns. In cybersecurity, data science helps security teams distinguish between potentially malicious network traffic and safe traffic.

Applications of data science in cybersecurity are relatively new. Many companies are still using traditional measures like legacy, antiviruses, and firewalls. This article reviews the relationship between data science and cybersecurity and the most common use cases.

Cybersecurity Before Data Science

Large organizations have a lot of data moving throughout their network. The data can originate from internal computers, IT systems, and security tools. However, these endpoints do not communicate with each other. The security technology responsible for detecting attacks cannot always see the overall picture of threats.

Before the adoption of data science, most large organizations used the Fear, Uncertainty, and Doubt (FUD) approach in cybersecurity. The information security strategy was based on FUD-based assumptions. Assumptions about where and how attackers may attack.

With the help of data science, security teams can translate technical risk into business risk with data-driven tools and methods. Ultimately, data science enabled the cyber-security industry to move from assumption to facts.

The Relationship Between Data Science and Cybersecurity

The goal of cybersecurity is to stop intrusions and attacks, identify threats like malware, and prevent fraud. Data science uses Machine Learning (ML) to identify and prevent these threats. For instance, security teams can analyze data from a wide range of samples to identify security threats. The purpose of this analysis is to reduce false positives while identifying intrusions and attacks.

Security technologies like User and Entity Behavior Analytics (UEBA) use data science techniques to identify anomalies in user behavior that may be caused by an attacker. Usually, there is a correlation between abnormal user behavior and security attacks. These techniques can paint a bigger picture of what is going on by connecting the dots between these abnormalities. The security team can then take proper preventative measures to stop the intrusion.

The process is the same for preventing fraud. Security teams detect abnormalities in credit card purchases by using statistical data analysis. The analyzed information is then used to identify and prevent fraudulent activity.

How Data Science Has Changed Cybersecurity

Data science had a profound effect on cybersecurity.  This section aims to explain key impacts of data science in the field of cybersecurity.

Intrusion, Detection, and Prediction

Security professionals and hackers always played a game of cat-and-mouse. Attackers used to constantly improve their intrusion methods and tools. Whereas security teams improved detection systems based on known attacks. Attackers always had the upper hand in this situation.

Data science techniques use both historical and current information to predict future attacks. In addition, machine learning algorithms can improve an organization’s security strategy by spotting vulnerabilities in the information security environment.

Establishing DevSecOps cycles

DevOps pipelines ensure a constant feedback loop by maintaining a culture of collaboration. DevSecOps adds a security element to DevOps teams. A DevSecOps professional will first identify the most critical security challenge and then establish a workflow based on that.

Data scientists are already familiar with DevOps practices because they use automation in their workflows. As a result, DevSecOps can easily be applied to data science in a process called DataSecOps. This type of agile methodology enables data scientists to promote security and privacy continuously.

Behavioral analytics

Traditional antiviruses and firewalls match signatures from previous attacks to detect intrusions. Attackers can easily evade legacy technologies by using new types of attacks.

Behavior analytics tools like User and Entity Behavior Analytics (UEBA) use machine learning to detect anomalies and potential cyberattacks. If, for example, a hacker stole your password and username, they may be able to log into your system. However, it would be much harder to mimic your behavior.

Data protection with Associate Rule Learning

Associate Rule Learning (ARL) is a machine learning method for discovering relations between items in large databases. The most typical example is market-based analysis. ARL shows relations between items that people buy most frequently. For example, a combination of onions and meat may relate to a burger.

ARL techniques may also recommend data protection measures. The ARL studies the characteristics of existing data and alerts automatically when it detects unusual characteristics. The system constantly updates itself to detect even the slightest deviations in the data.

Backup and data recovery

New backup technologies are leveraging machine learning to automate repetitive backup and recovery tasks. Machine learning algorithms are trained to follow the priorities and requirements of security plans.

Backup and recovery systems based on ML can help incident response teams organize workspaces and resources. For example, ML tools can access and recommend the necessary equipment and locations for a particular business recovery plan based on the company’s needs.

Static Analyzers

Static analyzers used in the identification of programming errors have been an area of research since at least the 1970s where publications, such as Johnson 1978, led to the development of Lint, the first C source code program checker. Since the creation of Lint, programmers have created a variety of static analysis tools, some of which focus on security bugs, including open source tools like FindBugs and SPLint and commercial tools like Coverity Prevent and Fortify SCA. Compilers and IDEs like clang, gcc, and Visual Studio have begun to incorporate simple static tools to detect and warn about common defects and vulnerabilities. GitHub recently introduced a static analysis feature that lets developers check code in their repositories for specific defects using a query language called CodeQL.

Vulnerabilities reported by static analysis tools are very often false positives, in which the tool reports a security vulnerability when none exists in the code, and false negatives, in which the tool fails to report a security vulnerability that exists in the code, are of course unbounded. The most common complaint developers have about static analysis tools is that they produce too many false positives. Dynamic analysis tools are also problematic, among other reasons, because they depend on environmental particulars that are subject to change with time, and typically require performance reduction and added code complexity in the operating environment.

Static analysis tools build a model of the program from the source code, then analyze that model using a database of rules to identify potential defects and vulnerabilities. While the open source tools typically perform local analysis of individual functions, commercial tools frequently use interprocedural analysis of the interaction between functions. To handle software libraries, static analysis tools either need to analyze their source code or have a set of rules specified for the functions in that library, indicating which functions are sources of potentially dangerous input, which filter input to make it safe, and which are potential dangerous sinks that will result in a vulnerability if they receive dangerous input.

Models generated by static analysis tools typically include control flow and data flow graphs. Complex programs often have too many potential flow control paths to be able to be analyzed in a reasonable amount of time, so these static analyzers use pruning algorithms to limit the numbers of paths examined. However, the heuristics used by such algorithms are fallible and cause static analysis tools to generate false negative and false positive results. This means that there is a tradeoff between the precision and the scalability of static analysis tools. While small, toy programs could be analyzed precisely, static analysis of large, real world programs must sacrifice precision to analyze the code in a reasonable time.

While there are many tools available and research which has been applied to this field, the nature of the problem assures that no static analyzer will be perfectly effective., According to Rice’s theorem it is undecidable whether any particular partial function computes a non-trivial property.  The consequence of this theorem is that all static analysis tools are inherently imperfect in their ability to diagnose vulnerabilities. Because there is a wealth of contextual data built into the code itself, recent advances in machine learning can be applied to improve the classification accuracy of vulnerable software components.

Unresolved Issues and Machine Learning Powered Advancements

The complexity of the problem assures there are unresolved issues in the field and this is where statistical analysis and machine learning will eventually provide improvements.

  1. Ground truth of vulnerability
  2. Effective classification of code elements

The most difficult issue in approaching cybersecurity from the standpoint of data science is framing the problem and having assurance of high quality input data. A model cannot be correctly trained if the target Type I or Type II error rates are too high, and the token elements of the code also need to be accurately categorized in order to be used as an input. The code inputs need to be structured in a consistent tabular way in order to facilitate analysis.

To address the problem of assured ground truth in the data sets, (Walden 2014) compiled a high quality data set of vulnerabilities in PHP web applications. This data set includes 223 confirmed vulnerability instances including multiple examples of code injection, CSRF, XSS, path traversal, and authorization issues. The authors used this data set to test the effectiveness of text analysis as a means of classification, and found it to be more effective than using software metrics to predict which files contained vulnerabilities.

In the closely related realm of program synthesis, Allamanis uses graph neural networks to train a model to identify misused variables. (Miltiadis Allamanis, 2018) describes how to construct graphs from source code and evaluated its method for doing so against the tasks of predicting the name of a variable given its usage and selecting the correct variable that should be used in any given token slot of the program.  This research represented source code as graphs using different edge types (such as Child, NextToken, LastRead, LastWrite, ComputedFrom, etc.) describing syntactic and semantic relationships between different tokens. The graph representation of the program instantiates conditions based on its Abstracted Syntax Trees and updates all edges simultaneously based on the state information of the node feature vectors.  Following on this work, Allamanis (R. Shin, 2019) used a similar graph network approach for code hole completion. The model uses a general graph-based generative procedure in order to construct its Abstracted Syntax Tree sequentially, by expanding one node at a time from the identified graph learned in the target programming language grammar.

The opportunity for improvement in consistently identifying vulnerabilities with machine learning methods should be to decipher pruning heuristics currently applied by expert human coders. A machine learning method which reverse engineers these rules and applies them in a consistent manner as discovered over large data sets should improve on the state of the art which is too human dependent and may incorrectly limit the control flow graph search space incorrectly. Perhaps detailed statistical research related to programming element tokens, and the creation of intermediate linguistic structures in order to abstract differences in source code language syntax and could help identify common graph patterns among vulnerabilities.


Cyber attacks are always evolving, and no one knows what form they will take in the future. Data science enables companies to predict possible future threats based on historical data with technologies like UEBA. Intrusion Detection Systems (IDS) use regression models to predict potential malicious attacks. Data science can leverage the power of data to create stronger protection against cyberattacks, and data losses.

Happy Learning 🙂