The SBAN dataset is designed to support software code mining and security research by aligning multiple representations of software into a unified benchmark. It includes over 3.7 million real-world software samples, each represented across four layers:
This multi-layered structure enables the effective training of machine learning models for tasks such as code understanding, malware detection, and software summarization.
SBAN was constructed by collecting a diverse range of both benign and malicious software from real-world sources. For each software instance, the dataset provides aligned data at the binary, assembly, source code, and natural language levels. Natural language descriptions were written or verified by domain experts to ensure accuracy and relevance, making the dataset especially useful for training large language models (LLMs) that require rich, aligned multimodal inputs.
The SBAN dataset can be used to develop advanced code intelligence systems and robust cybersecurity solutions. It also serves as a scalable and realistic benchmark for evaluating AI models in cross-modal software analysis.
The SBAN dataset is designed for a variety of research and application domains. Below are the primary directions and their use cases:
SBAN supports training and evaluation of LLMs focused on software analysis. Its multi-layered, cross-modal format enables models such as CodeBERT, CodeLlama, and GPT variants to learn from diverse representations of software.
Key use cases:
Targeted at researchers working in low-level software behaviour and malware detection, particularly using binary and assembly layers.
The authors gratefully acknowledge the Canadian Institute for Cybersecurity for its financial and academic support. Their resources and expertise were instrumental in the creation of the SBAN dataset.
If you use the SBAN dataset in your research, please cite the following paper:
H. Jelodar, M. Meymani, S. Bai, R. Razavi-Far, and A. A. Ghorbani, “Sban: A framework & multi-dimensional dataset for large language model pre-training and software code mining,” in Proceedings of the 2025 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, 2025.