IIoT Dataset 2024 | Datasets | Research | Canadian Institute for Cybersecurity | UNB

Global Site Navigation (use tab and down arrow)

Canadian Institute for Cybersecurity

CIC APT IIoT Dataset 2024

CIC Advanced Persistent Threat Dataset for IIoT 2024 (CICAPT-IIoT2024)

The main goal of this research is to provide cybersecurity researchers focusing on APT detection tasks with a dataset collected for an APT campaign in an industrial internet of things (IIoT) environment. To achieve this, we designed an attack scenario based on the operations of the APT29 attack group. We implemented the attacks in an IIoT environment and collected both Provenance logs and Network traffic data.

The main contributions of our research are as follows:

  • Creating a novel and comprehensive APT attack dataset captured within the IIoT environment. This dataset is generated using a hybrid testbed consisting of real and simulated IIoT components to demonstrate the complexity and diversity of modern technology systems;

  • The dataset contains more than 20 distinct attack techniques divided into eight main attack tactics that map into the APT attack scenarios, inspired by the APT29 campaigns. This APT scenario enhances the dataset’s effectiveness in APT detection research;

  • To evaluate the effectiveness of machine learning (ML) algorithms in APT-detection tasks, we applied several ML models on the CICAPT-IIoT dataset and analyzed their performance using a provenance-based detection framework.

CICAPT-IIoT2024 dataset testbed

We have developed a simulation testbed to create a controlled environment that supports IIoT research and is particularly effective for simulating APT scenarios. This testbed, built on the Brown-IIoTbed framework architecture, incorporates a mix of virtual and physical components to accurately reflect the complexity and dynamics of real-world IIoT systems.

At the core of our testbed is the NS3 network simulator, which operates on an Ubuntu host. The setup includes two Ubuntu virtual machines and two Kali Linux virtual machines, all hosted on a system running NS3. Additionally, the testbed features two Raspberry Pi devices and two IoT sensors. Raspberry Pi1 is equipped with OpenPLC and utilizes the Modbus protocol for communication, while Raspberry Pi2 functions as a WiFi access point to facilitate enhanced connectivity for the IoT sensors.

Dataset description

The dataset contains two datatypes: Provenance data and network logs and each of these datatypes are collected during two phases of the experiment. The provenance data files are in CSV format and contain the nodes and edges of the provenance graph. Each node in the provenance data is assigned a unique 32-digit ID, which is utilized by the edge entries to establish connections between
nodes in the graph.

Besides the IDs, the provenance data files comprise 32 features in total. However, due to the heterogeneous nature of nodes and edges that are all in a single file, not all features apply to every node or edge type, resulting in many fields being populated with NaN values. The provenance data includes two main node types: Process and Artifact. The Artifact node type is further categorized into various subtypes such as file, directory, network socket, link, and unknown, the latter being used for provenance node types that do not fit into the existing subtypes. The common edge types in the provenance graph are: ``Used" (from Process to Artifact), ``WasGeneratedBy" (WGB; from Artifact to Process), ``WasTriggeredBy" (WTB; from Process to Process), and ``WasDerivedFrom" (WDF; from Artifact to Artifact).

The other data type in the dataset is the network logs captured using NS3 during the experiments and stored in pcap format. These pcap files can be further processed into CSV format and various features can be extracted from these files. The last file in the dataset is the Attack Information file, which contains all necessary information about the attacks performed during the experiments in phase 2. This information includes attack time, attack PID, and the category of attack. This file helps the researchers to further analyze the dataset behaviour during the attacks.

Dataset directories

  • Provenance logs: This section contains CSV files of the labelled provenance logs from each experiment phase. Each CSV file includes all provenance entities, both nodes and edges. There is also a sample provenance subgraph extracted from the phase2 provenance graph to provide an overview of the general graphs.

  • Network traffic: This directory contains the pcap files documenting the network traffic captured during the experiments. It contains both individual pcap files captured by NS3 and a merged pcap file that has all packet logs for each phase into a single file. Additionally, there is a CSV file that contains the packet features extracted using the pcap2csv.py script, offering a structured analysis of network activity.

  • Supplementary material: In this directory we share some of the source codes used to generate and process the dataset. The pcap2csv is the script used to generate the network traffic CSV files. The Sample_analysis.ipynb is the notebook showing a simple way to create the provenance graph and using node2vec to create node embeddings. This folder also contains Attack_info.csv that provides information about the attack steps in the APT campaign. This file is extracted from MITRE Caldera reports.

APT attack phases and techniques used in the dataset

Tactic Technique ID Attack Type APT Group
Collection T1074 Data Staged: Local Data Staging APT28, APT29, APT39, APT3
T1005 Data from Local System Andariel, APT28, APT29
T1119 Automated Collection APT1, APT28, Chimera
T1113 Screen Capture APT28, APT39, Carbanak
T1115 Clipboard Data APT29, APT29, APT38
Exfiltration T1560 Archive Collected Data: Archive via Utility APT28, APT29, APT32
T1041 Exfiltration Over C2 Channel Lazarus, APT3, APT32
Command and Control T1105 Ingress Tool Transfer Lazarus, APT29, APT3
Persistence T1546 Event Triggered Execution APT28, APT29, APT3
T1136 Create Account: Local Account Dragonfly, FIN13, APT29
Discovery T1087 Account Discovery: Local Account APT1, APT3, Chimera
T1016 System Network Configuration Discovery: Internet Connection Discovery FIN13, Gamaredon, APT29
System Network Configuration Discovery: Wi-Fi Discovery Magic Hound, Wizard Spider
T1033 System Owner/User Discovery Chimera, Dragonfly, APT3
T1518 Software Discovery HEXANE, MuddyWater
T1069 Permission Groups Discovery: Local Groups Chimera, HEXANE, APT29
T1082 System Information Discovery Chimera, APT3, APT32
T1083 File and Directory Discovery APT28, APT29, APT32
T1018 Remote System Discovery Chimera, APT29, APT32
Credential Access T1552 Unsecure Credentials: Credentials In Files APT3, APT33, FIN13
Unsecure Credentials: Bash History -
T1555 Credentials from Password Stores: Credentials from Web Browsers APT33, APT39, HEXANE
Lateral Movement T1021 Remote Services: SSH APT29, APT39, Lazarus
Defence Evasion T1036 Masquerading: Right-to-Left Override APT28, APT29, Dragonfly
T1485 Data Destruction APT38, Gamaredon, Lazarus

Using the dataset

Webinar explanation about CIC IoT datasets: "From Profiling to Protection: Leveraging Datasets for Enhanced IoT Security" by Dr. Sajjad Dadkhah, Assistant Professor and Cybersecurity R&D Team Lead, with Q&A by Sumit Kundu.

YouTube video: CICAPT-IIOT: A Provenance-Based APT Attack Dataset for IIOT Environment by Erfan Ghiasvand, Cybersecurity Software Developer, Canadian Institute for Cybersecurity with introduction and Q&A by Sumit Kundu.

Acknowledgments

The authors would like to thank the Canadian Institute for Cybersecurity (CIC) and National Research Council Canada (NRC) for its financial and educational support.

Citation

E. Ghiasvand, S. Ray, S. Iqbal, S. Dadkhah, A. Ghorbani. "Resilience Against APTs: A Provenance-Based IIoT Dataset for Cybersecurity Research," - (Submitted to ESORICS 2024 Conference).

E. Ghiasvand, S. Ray, S. Iqbal, S. Dadkhah, and A. A. Ghorbani. "CICAPT-IIOT: A provenance-based APT attack dataset for IIoT environment," preprint, July 2024.

Download the dataset