We are providing a new Android malware dataset, namely CICMalDroid 2020, that has the following four properties:
We managed to collect more than 17,341 Android samples from several sources including VirusTotal service, Contagio security blog, AMD, MalDozer, and other datasets used by recent research contributions (the sources have been cited in the paper).
The samples were collected from December 2017 to December 2018. It is significant for cybersecurity researchers to classify Android apps with respect to the malware category for taking proper countermeasures and mitigation strategies. Hence, our dataset is intentionally spanning between five distinct categories: Adware, Banking malware, SMS malware, Riskware, and Benign. Each malware category is briefly described as follows:
Mobile Adware refers to the advertising material (i.e., ads) that typically hides inside the legitimate apps which have been infected by malware (available on the third-party market). Because the ad library used by the malware repeats a series of steps to keep the ads running, Adware continuously pops up ads (even if the victim tries to force-close the app). Adware can infect and root-infect a device, forcing it to download specific Adware types and allowing attackers to steal personal information.
Mobile Banking malware is a specialized malware designed to gain access to the user’s online banking accounts by mimicking the original banking applications or banking web interface. Most of the mobile Banking malware are Trojan-based, which is designed to infiltrate devices, to steal sensitive details, i.e., bank login and password, and to send the stolen information to a command and control (C&C) server.
SMS malware exploits the SMS service as its medium of operation to intercept SMS payload for conducting attacks. The attackers first upload malware to their hosting sites to be linked with the SMS. They use the C&C server for controlling their attack instructions, i.e., send malicious SMS, intercept SMS, and steal data.
Riskware refers to legitimate programs that can cause damage if malicious users exploit them. Consequently, it can turn into any other form of malware such as Adware or Ransomware, which extends functionalities by installing newly infected applications. Uniquely, this category only has a single variant, mostly labeled as "Riskware" by VirusTotal.
All other applications that are not in categories above are considered benign which means that the application is not malicious. To verify the maliciousness, we scanned all the benign samples with VirusTotal.
We analyzed our collected data dynamically using CopperDroid, a VMI-based dynamic analysis system, to automatically reconstruct low-level OS-specific and high-level Android-specific behaviors of Android samples. Out of 17,341 samples, 13,077 samples ran successfully while the rest failed due to errors such as time-out, invalid APK files, and memory allocation failures.
All the APK files are first executed in CopperDroid, and the run-time behaviors are recorded in log files. The output analysis results of CopperDroid are available in JSON format for easy parsing and additional auxiliary information. The analysis results are classified into three big groups:
We loaded all 13,077 analysis results where about 12% of the JSON files failed to be opened mostly due to “unterminated string". The final remaining Android samples in each category are as follows:
Since the sizes of the categories are not equal, we balance the number of samples in each category before splitting them into the training and test bins for analyzing using AI techniques. To use all the samples equally likely, we randomly shuffle the dataset in each category before balancing the samples.
The CICMalDroid2020 dataset consists of the following items and is publicly available for researchers.
50,621 extracted features for 11,598 APK files comprising static information, such as intent actions, permissions, intent consts, permissions, files, method tags, sensitive APIs, services, packages, receivers, etc.
If you are using our dataset, you need to cite our research paper which outlines the details of the dataset and its underlying principles:
Samaneh Mahdavifar, Andi Fitriah Abdul Kadir, Rasool Fatemi, Dima Alhadidi, Ali A. Ghorbani; Dynamic Android Malware Category Classification using Semi-Supervised Deep Learning, The 18th IEEE International Conference on Dependable, Autonomic, and Secure Computing (DASC), Aug. 17-24, 2020.
Samaneh Mahdavifar, Dima Alhadidi, and Ali A. Ghorbani (2022). Effective and Efficient Hybrid Android Malware Classification Using Pseudo-Label Stacked Auto-Encoder, Journal of Network and Systems Management 30 (1), 1-34.
Acknowledgement
The authors would like to express their gratitude toward Dr. Lorenzo Cavallaro and Feargus Pendlebury (Systems Security Research Lab, King’s College London) for generously analyzing a large number of Android APKs in CopperDroid.