Assessing performance of any detection approach requires experimentation with data that is heterogeneous enough to simulate real traffic to an acceptable level. The lack of such data sets available for evaluating botnet detection approaches is well known in the field mostly due to a number of challenges that have been repeatedly emphasized in the literature [1], [2]. We constructed such dataset paying a close attention to the following challenges:
Generality: Unfortunately, most of the existing botnet datasets have generality issue, i.e., they mostly include data from a few botnets (usually two or three samples). Limited in nature (detectors developed in these environments only reflect a small number of characteristics describing a very specific botnet behaviour), these approaches are impractical and ineffective in a face of novel threats.
Realism: The effectiveness of the developed approach in practice is highly dependent on realistic botnet traffic traces used for its evaluation. Botnet traffic is usually generated/ captured in a controlled environment. Providing a resilient environment (not detectable by the botnet) in which a botnet performs all its intended malicious functionality is not trivial. In addition to resiliency, collection period must be long enough to allow dormant bots to exhibit their functionality.
Representativeness: Another problem with generating botnet data is an ability of collected network traffic traces to reflect real environment a detector will face during deployment. Due to privacy concerns gathering background data in a real production environment is not feasible in most cases, as a result traffic is either simulated or gathered in a controlled environment. To overcome these challenges, we create an evaluation set combining non overlapping subsets of the following data:
To merge these data traces in one unified data set we employed so called overlay methodology [1], one of the most popular methods for creating synthetic datasets. Malicious data is usually captured by honeypots or through infecting computers with a given bot binary in a controlled environment [9].
Botnet traces can be merged with benign data by mapping malicious data to either machines existing in the home network or machines outside of the current network [1]. Considering the wide range of IP addresses in the traces, we mapped botnet IPs to the hosts outside of the current network using BitTwist packet generator [10]. Malicious and benign traffic were then replayed using TCPReplay [11] and captured by TCPdump [12] as a single dataset.
Botnet name | Type | Portion of flows in dataset
The resulting set was divided into training and test datasets that included 7 and 16 types of botnets, respectively. Tables 1 and 2 detail distribution and type of botnets in each dataset. Our training dataset is 5.3 GB in size of which 43.92% is malicious and the remainder contains normal flows. Test dataset is 8.5 GB of which 44.97% is malicious flows. We added more diversity of botnet traces in the test dataset than the training dataset in order to evaluate the novelty detection a feature subset can provide.
Botnet name | Type | Portion of flows in dataset
The full research paper outlining the details of the dataset and its underlying principles:
E. Biglar Beigi, H. Hadian Jazi, N. Stakhanova and A. A. Ghorbani, "Towards effective feature selection in machine learning-based botnet detection approaches." Communications and Network Security (CNS), 2014 IEEE Conference on. IEEE, 2014.
References
[1] A. J. Aviv and A. Haeberlen, “Challenges in experimenting with botnet detection systems,” in USENIX 4th CSET Workshop, San Francisco, CA, 2011.
[2] M. Tavallaee, N. Stakhanova, and A. A. Ghorbani, “Toward credible evaluation of anomaly-based intrusion-detection methods,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 40, no. 5, pp. 516–524, 2010.
[3] D. Zhao, I. Traore, B. Sayed, W. Lu, S. Saad, A. Ghorbani, and D. Garant,“Botnet detection based on traffic behavior analysis and flow intervals,”Computers & Security, 2013.
[4] “The honeynet project, french chapter”.
[5] G. Szab ́o, D. Orincsay, S. Malomsoky, and I. Szab ́o, “On the validation of traffic classification algorithms,” in Passive and Active Network Measurement. Springer, 2008, pp. 72–81.
[6] Lawrence berkeley national laboratory and icsi, lbnl/icsi enterprise tracing project. lbnl enterprise trace repository, 2005.
[7] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Toward developing a systematic approach to generate benchmark datasets for intrusion detection,” Computers & Security, vol. 31, no. 3, pp. 357–374, 2012.
[8] S. Garcia, “Malware capture facility project," retrieved July 03, 2013.
[9] M. Stevanovic and J. M. Pedersen, “Machine learning for identifying botnet network traffic,” Networking and Security Section, Department of Electronic Systems, Aalborg University, Tech. Rep., 2013.
[10] Bit-Twist, “Libpcap-based ethernet packet," retrieved July 10, 2013.
[11] A. Turner and M. Bing, “Tcpreplay: Pcap editing and replay tools for* nix”. sourceforge. net, 2005.
[12] “Tcpdump and libpcap,” retrieved July 23, 2013.