SBAN Dataset

The SBAN dataset is designed to support software code mining and security research by aligning multiple representations of software into a unified benchmark. It includes over 3.7 million real-world software samples, each represented across four layers:

Binary code
Assembly code
Source code
Natural language descriptions

This multi-layered structure enables the effective training of machine learning models for tasks such as code understanding, malware detection, and software summarization.

SBAN was constructed by collecting a diverse range of both benign and malicious software from real-world sources. For each software instance, the dataset provides aligned data at the binary, assembly, source code, and natural language levels. Natural language descriptions were written or verified by domain experts to ensure accuracy and relevance, making the dataset especially useful for training large language models (LLMs) that require rich, aligned multimodal inputs.

The SBAN dataset can be used to develop advanced code intelligence systems and robust cybersecurity solutions. It also serves as a scalable and realistic benchmark for evaluating AI models in cross-modal software analysis.

Key contributions

Introduction of SBAN: A large-scale, multi-layer dataset containing over 3.7 million software samples from both benign and malicious sources.
Unified Multi-Modal Format: SBAN is the first dataset to align binary, assembly, source code, and natural language in a single, coherent structure—enabling cross-modal learning.
LLM-Oriented Design: Tailored for training and evaluating LLMs on tasks such as malware detection, program understanding, and code summarization.
Scalable Benchmark: Offers aligned, real-world software representations to support the development of AI systems capable of understanding code across multiple dimensions.

Dataset use cases

The SBAN dataset is designed for a variety of research and application domains. Below are the primary directions and their use cases:

1. Large language models (LLMs)

SBAN supports training and evaluation of LLMs focused on software analysis. Its multi-layered, cross-modal format enables models such as CodeBERT, CodeLlama, and GPT variants to learn from diverse representations of software.

Key use cases:

Multimodal Pretraining: Train LLMs to understand and integrate multiple layers of software representation.
Cross-Modal Translation: Build models to translate between binary, source code, and natural language.
Explainable Malware Detection: Use aligned layers to detect and explain malware behaviour using both code and textual reasoning.

2. Cybersecurity and malware analysis

Targeted at researchers working in low-level software behaviour and malware detection, particularly using binary and assembly layers.

Acknowledgements

The authors gratefully acknowledge the Canadian Institute for Cybersecurity for its financial and academic support. Their resources and expertise were instrumental in the creation of the SBAN dataset.

Citation

If you use the SBAN dataset in your research, please cite the following paper:

H. Jelodar, M. Meymani, S. Bai, R. Razavi-Far, and A. A. Ghorbani, “Sban: A framework & multi-dimensional dataset for large language model pre-training and software code mining,” in Proceedings of the 2025 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, 2025.

Download this dataset