CIC SGG Dataset 2024

CIC Statically Generated Graphs for Malware Analysis (CIC-SGG-2024)

Control flow graphs (CFGs) and function call graphs (FCGs) have become pivotal in providing a detailed understanding of program execution and effectively characterizing the behaviour of malware. These graph-based representations, when combined with graph neural networks, have shown promise in developing high-performance malware detectors.

As part of our work, we generate CFGs and FCGs for the BODMAS, DikeDataset, and pe-machine-learning-dataset datasets using the angr Python library. Additionally, we also provide embeddings of graphs and explanations for use in machine learning tasks. Below is an example of the pipeline used to generate the graphs from our work.

Static pipeline image

Furthermore, this contributes to the field of towards the task of graph classification on large graphs with many samples in the field of graph learning. Common datasets for this task consist of many samples with few hundred nodes on average, or few samples with few thousand and many classes. This work proposes many graphs of large size, some with hundreds of thousands of nodes and edges, and two classes, malicious and benign.

Dataset directories

We recognize two main audiences for this work: one, researchers in the field of malware detection and analysis, and two, researchers in the field of graph-based machine learning. The former may be interested in the Attribute Graphs whereas the latter may be interested in the Embedded and Explained Graphs, all of which are described below.

Attribute Graphs (cfgs_fcgs): This directory contains the output objects from the angr python library, where each sample is saved as a pickle file containing both the CFG and FCG of a given binary sample as well as other information output by angr. Samples, grouped into sub-directories based on their respective datasets they were generated from. A CSV file (cfgs_fcgs_map.csv) is included that contains the label (0 benign and 1 malicious), type (CFG or FCG), dataset (DikeDataset, BODMAS, or pe-machine-learning-dataset), hash (unique sha256 hash of the original binary file), number_nodes, number_edges, number_weakly_connected_components, and file_size (bytes).
Embedded Graphs (ebds): This directory contains the embedded versions of graphs in the Attributes Graphs directory. It contains the sub-directories AE (Assembly Embedding), generated from CFGS, and FNE (Function Name Embedding), generated from FCGs. Similarly, this directory contains a CSV file (ebds_map.csv) that maps the same attributes listed in the cfgs_fcgs_map.csv file, with the replacement of the type feature for ebd (AE or FNE).
Explained Graphs (exps): This directory contains node and edge-based explanations of the ebd graphs generated from models we train in our work that highlight important areas that contribute to a particular prediction. These graphs have the extension “.exp” instead of ".pkl” to differentiate them from the Attribute Graphs with the same name. However, these are in fact pickle files. This directory also contains a CSV file (exps_map.csv) that maps the same set of attributes listed in (ebds_map.csv) and includes a predicted attribute (0 benign and 1 malicious). See the src/index.py file for an example of accessing these and other attributes.
Examples (src): This directory contains several example files as well as additional mapping information and installation requirements. We include a simple example in index.py that demonstrates how to load the various sample types, cfg, fcg, ebd and exp, as well as how to access some of their attributes. Additionally, we also include a simple example, using a subset of our dataset, to train a simple GCN. The example references a dataset.py file, also included, that contains a PyTorch Geometric Dataset class that can be used to work with large datasets like ours. We also include an additional CSV file (original_to_map.csv) mapping original raw source binary paths in their respective datasets to a given hash value we use in our dataset. Lastly, we include a requirements file (requirements.txt) for installing with PIP. We found installing angr version 9.2.89 is especially important in order to work with our samples.
Null & Inaccessible Samples (.null): During specific phases in the pipeline (i.e., generation and embedding), a given sample may create conditions where the generation or embedding operations may run out of memory and subsequently killed the Operating System (OS). When this occurs, data is not written to the file, even though a handle is opened by the OS, causing the file to be created with 0 bytes. In the second case, the file may contain data, but during loading the exception "EOFError: Ran out of input." will be raised, it is still unclear exactly why this occurs. Additionally, samples loaded onto a cpu vs cuda device may occasionally fail. We include these samples only for completeness. However, we verify that all samples in the main dataset load successfully using a cpu device. This is mainly to reduce problems for others during training.

Importantly, all samples with the same filename, excluding the file extension, originate from the same binary file. Derived from the sha256 hash, depending on if the original file required arming or not. However, only files generated from the same underlying lineage are comparable. For example, a file in exps/AE/DikeDataset is only comparable with a file in ebds/AE/DikeDataset and the corresponding graph type, in this case CFG, within the file in cfgs_fcgs/DikeDataset.

Using the dataset

Please refer to the src directory for an example of how to start working with the dataset in index.py. We refer users to angr, PyTorch Geometric, and NetworkX for further information on working with the underlying dataset objects and libraries.

Isomorphic samples

We are fully aware of the presence of isomorphic samples within the dataset. We knowingly include these samples, not only for completeness, but importantly because the sample graphs, while isomorphic, do not originate from "true" duplicates with respect to the original binary samples. We leave it to the end user to decide how to handle such samples. We understand that removing isomorphic graphs may be of particular interest for graph-based machine learning whereas in malware analysis it may not be a concern.

One general approach to know which samples are isomorphic, with high probability, is simply to compare the number of nodes, edges, and components in the graphs based on the provided CSV files and then test for isomorphism.

There are many speculative reasons for the presence of isomorphic samples. Presumably, malware authors may alter source code, or the binary itself, to perturb its signature and evade detection while also leaving the underlying CFG/FCG intact. Additionally, some samples may originate from the same malware family and thus have the same CFG/FCG.

Acknowledgments

L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, and G. Wang, “Bodmas: An open dataset for learning based temporal analysis of pe malware,” in 2021 IEEE Security and Privacy Workshops (SPW), pp. 78–84, IEEE, 2021.

G.-A. Iosif, “Dikedataset,” 2021. Accessed on February 27, 2024.

Practical Security Analytics LLC, “Pe malware machine learning dataset,” 2024. Accessed: 2024-08-06.

License

The CIC-SGG-2024 dataset is publicly available for researchers. If you are using our dataset, you must cite our related research paper that covers important details related to its usage and application.

Citation

H. Mohammadian, G. Higgins, S. Ansong, R. Razavi-Far, A. Ghorbani. "Explainable Malware Detection through Integrated Graph Reduction and Learning Techniques," preprint, October 2024.

Curated by Griffin Higgins, please direct questions to griffin.higgins@unb.ca. Only questions with the heading subject "CIC-SGG-2024" will be guaranteed to receive a reply.

Download the dataset

Global Site Navigation (use tab and down arrow)