CIC APT IIoT Dataset 2024

CIC Advanced Persistent Threat Dataset for IIoT 2024 (CICAPT-IIoT2024)

The main goal of this research is to provide cybersecurity researchers focusing on APT detection tasks with a dataset collected for an APT campaign in an industrial internet of things (IIoT) environment. To achieve this, we designed an attack scenario based on the operations of the APT29 attack group. We implemented the attacks in an IIoT environment and collected both Provenance logs and Network traffic data.

The main contributions of our research are as follows:

Creating a novel and comprehensive APT attack dataset captured within the IIoT environment. This dataset is generated using a hybrid testbed consisting of real and simulated IIoT components to demonstrate the complexity and diversity of modern technology systems;
The dataset contains more than 20 distinct attack techniques divided into eight main attack tactics that map into the APT attack scenarios, inspired by the APT29 campaigns. This APT scenario enhances the dataset’s effectiveness in APT detection research;
To evaluate the effectiveness of machine learning (ML) algorithms in APT-detection tasks, we applied several ML models on the CICAPT-IIoT dataset and analyzed their performance using a provenance-based detection framework.

CICAPT-IIoT2024 dataset testbed

We have developed a simulation testbed to create a controlled environment that supports IIoT research and is particularly effective for simulating APT scenarios. This testbed, built on the Brown-IIoTbed framework architecture, incorporates a mix of virtual and physical components to accurately reflect the complexity and dynamics of real-world IIoT systems.

At the core of our testbed is the NS3 network simulator, which operates on an Ubuntu host. The setup includes two Ubuntu virtual machines and two Kali Linux virtual machines, all hosted on a system running NS3. Additionally, the testbed features two Raspberry Pi devices and two IoT sensors. Raspberry Pi1 is equipped with OpenPLC and utilizes the Modbus protocol for communication, while Raspberry Pi2 functions as a WiFi access point to facilitate enhanced connectivity for the IoT sensors.

Dataset description

The dataset contains two datatypes: Provenance data and network logs and each of these datatypes are collected during two phases of the experiment. The provenance data files are in CSV format and contain the nodes and edges of the provenance graph. Each node in the provenance data is assigned a unique 32-digit ID, which is utilized by the edge entries to establish connections between
nodes in the graph.

Besides the IDs, the provenance data files comprise 32 features in total. However, due to the heterogeneous nature of nodes and edges that are all in a single file, not all features apply to every node or edge type, resulting in many fields being populated with NaN values. The provenance data includes two main node types: Process and Artifact. The Artifact node type is further categorized into various subtypes such as file, directory, network socket, link, and unknown, the latter being used for provenance node types that do not fit into the existing subtypes. The common edge types in the provenance graph are: ``Used" (from Process to Artifact), ``WasGeneratedBy" (WGB; from Artifact to Process), ``WasTriggeredBy" (WTB; from Process to Process), and ``WasDerivedFrom" (WDF; from Artifact to Artifact).

The other data type in the dataset is the network logs captured using NS3 during the experiments and stored in pcap format. These pcap files can be further processed into CSV format and various features can be extracted from these files. The last file in the dataset is the Attack Information file, which contains all necessary information about the attacks performed during the experiments in phase 2. This information includes attack time, attack PID, and the category of attack. This file helps the researchers to further analyze the dataset behaviour during the attacks.

Dataset directories

Provenance logs: This section contains CSV files of the labelled provenance logs from each experiment phase. Each CSV file includes all provenance entities, both nodes and edges. There is also a sample provenance subgraph extracted from the phase2 provenance graph to provide an overview of the general graphs.
Network traffic: This directory contains the pcap files documenting the network traffic captured during the experiments. It contains both individual pcap files captured by NS3 and a merged pcap file that has all packet logs for each phase into a single file. Additionally, there is a CSV file that contains the packet features extracted using the pcap2csv.py script, offering a structured analysis of network activity.
Supplementary material: In this directory we share some of the source codes used to generate and process the dataset. The pcap2csv is the script used to generate the network traffic CSV files. The Sample_analysis.ipynb is the notebook showing a simple way to create the provenance graph and using node2vec to create node embeddings. This folder also contains Attack_info.csv that provides information about the attack steps in the APT campaign. This file is extracted from MITRE Caldera reports.

APT attack phases and techniques used in the dataset

Tactic	Technique ID	Attack Type	APT Group
Collection	T1074	Data Staged: Local Data Staging	APT28, APT29, APT39, APT3
	T1005	Data from Local System	Andariel, APT28, APT29
	T1119	Automated Collection	APT1, APT28, Chimera
	T1113	Screen Capture	APT28, APT39, Carbanak
	T1115	Clipboard Data	APT29, APT29, APT38
Exfiltration	T1560	Archive Collected Data: Archive via Utility	APT28, APT29, APT32
Exfiltration	T1041	Exfiltration Over C2 Channel	Lazarus, APT3, APT32
Command and Control	T1105	Ingress Tool Transfer	Lazarus, APT29, APT3
Persistence	T1546	Event Triggered Execution	APT28, APT29, APT3
Persistence	T1136	Create Account: Local Account	Dragonfly, FIN13, APT29
Discovery	T1087	Account Discovery: Local Account	APT1, APT3, Chimera
	T1016	System Network Configuration Discovery: Internet Connection Discovery	FIN13, Gamaredon, APT29
	T1016	System Network Configuration Discovery: Wi-Fi Discovery	Magic Hound, Wizard Spider
	T1033	System Owner/User Discovery	Chimera, Dragonfly, APT3
	T1518	Software Discovery	HEXANE, MuddyWater
	T1069	Permission Groups Discovery: Local Groups	Chimera, HEXANE, APT29
	T1082	System Information Discovery	Chimera, APT3, APT32
	T1083	File and Directory Discovery	APT28, APT29, APT32
	T1018	Remote System Discovery	Chimera, APT29, APT32
Credential Access	T1552	Unsecure Credentials: Credentials In Files	APT3, APT33, FIN13
	T1552	Unsecure Credentials: Bash History	-
	T1555	Credentials from Password Stores: Credentials from Web Browsers	APT33, APT39, HEXANE
Lateral Movement	T1021	Remote Services: SSH	APT29, APT39, Lazarus
Defence Evasion	T1036	Masquerading: Right-to-Left Override	APT28, APT29, Dragonfly
Defence Evasion	T1485	Data Destruction	APT38, Gamaredon, Lazarus

Using the dataset

Webinar explanation about CIC IoT datasets: "From Profiling to Protection: Leveraging Datasets for Enhanced IoT Security" by Dr. Sajjad Dadkhah, Assistant Professor and Cybersecurity R&D Team Lead, with Q&A by Sumit Kundu.

YouTube video: CICAPT-IIOT: A Provenance-Based APT Attack Dataset for IIOT Environment by Erfan Ghiasvand, Cybersecurity Software Developer, Canadian Institute for Cybersecurity with introduction and Q&A by Sumit Kundu.

Acknowledgments

The authors would like to thank the Canadian Institute for Cybersecurity (CIC) and National Research Council Canada (NRC) for its financial and educational support.

Citation

E. Ghiasvand, S. Ray, S. Iqbal, S. Dadkhah, A. Ghorbani. "Resilience Against APTs: A Provenance-Based IIoT Dataset for Cybersecurity Research," - (Submitted to ESORICS 2024 Conference).

E. Ghiasvand, S. Ray, S. Iqbal, S. Dadkhah, and A. A. Ghorbani. "CICAPT-IIOT: A provenance-based APT attack dataset for IIoT environment," preprint, July 2024.

Download the dataset

Global Site Navigation (use tab and down arrow)