CIRA-CIC-DoHBrw-2020

Canadian Institute for Cybersecurity (CIC) project funded by Canadian Internet Registration Authority (CIRA)

Domain Name System (DNS) is one of the early and vulnerable network protocols which has several security loopholes that have been exploited repeatedly over the years. DNS abuse has always been an area of great concern for cybersecurity researchers. However, providing security and privacy to DNS requests and responses is still a challenging task as attackers use sophisticated attack methodologies to steal data on the fly.

To overcome some of the DNS vulnerabilities related to privacy and data manipulation, IETF introduced DNS over HTTPS (DoH) in RFC8484, a protocol that enhances privacy and combats eavesdropping and man-in-the-middle attacks by encrypting DNS queries and sending them in a covert channel/tunnel so that data is not hampered on the way. Nonetheless, unavailability of a representative dataset is the key obstacle to evaluate the techniques that capture DoH traffic in a network topology.

This research work proposes a systematic approach to generate a typical dataset to analyze, test, and evaluate DoH traffic in covert channels and tunnels. The main objective of this project is to deploy DoH within an application and capture benign as well as malicious DoH traffic as a two-layered approach to detect and characterize DoH traffic using time-series classifier.

The final dataset includes implementing DoH protocol within an application using five different browsers and tools and four servers to capture Benign-DoH, Malicious-DoH and non-DoH traffic. Layer 1 of the proposed two-layered approach is used to classify DoH traffic from non-DoH traffic and layer 2 is used to characterize Benign-Doh from Malicious-DoH traffic. The browsers and tools used to capture traffic include Google Chrome, Mozilla Firefox, dns2tcp, DNSCat2, and Iodine while the servers used to respond to DoH requests are AdGuard, Cloudflare, Google DNS, and Quad9.

1. Introduction

In CIRA-CIC-DoHBrw-2020 dataset, a two-layered approach is used to capture benign and malicious DoH traffic along with non-DoH traffic. To generate the representative dataset, HTTPS (benign DoH and non-DoH) and DoH traffic is generated by accessing top 10k Alexa websites, and using browsers and DNS tunneling tools that support DoH protocol respectively. At the first layer, the captured traffic is classified as DoH and non-DoH by using statistical features classifier. At the second layer, DoH traffic is characterized as benign DoH and malicious DoH by using time-series classifier.

Non-DoH: Traffic generated by accessing a website that uses HTTPS protocol is captured and labeled as non-DoH traffic. In order to capture ample traffic to balance the dataset, thousands of websites from Alexa domain are browsed.

Benign-DoH: Benign DoH is non-malicious DoH traffic generated using the same technique as mentioned in non-DoH by using Mozilla Firefox and Google Chrome web browsers.

Malicious-DoH: DNS tunneling tools such as dns2tcp, DNSCat2, and Iodine are used to generate malicious DoH traffic. These tools can send TCP traffic encapsulated in DNS queries. In other words, these tools create tunnels of encrypted data. Therefore, DNS queries are sent using TLS-encrypted HTTPS requests to special DoH servers.

A notion of packet clumps is used to reduce the dimensionality of data and remove insignificant packets such as acknowledgements and packets too small to carry data. A clump of packets is defined as a sequence of one or more consecutive packets of a network flow (having the same source and destination) in the same direction to create a new and succinct representation of our data. The rationale for this step is to combine these packets to find the application traffic scattered between several packets in the process of TLS segmentation and IP fragmentation. A threshold timeout value for clumps is also considered so that two packets with a greater time difference do not end up in the same packet clump.

It is imperative to mention that malicious traffic can be tunneling as well as non-tunneling but only tunneled malicious traffic is generated for this research.

2. Infrastructure and implementation

The network diagram used to capture the traffic for the dataset is presented in Figure 1. Firstly, normal web browsing activity that includes non-DoH HTTPS and benign DoH is simulated using web browsers. Secondly, malicious DoH is generated by a combination of tools used to create DoH tunnels. Traffic generated by all these tools is captured for pre-processing and training the classifiers.

Figure 1: Network topology used to capture

As clearly seen from Figure 1, we set up web servers to capture layer 1 data, and benign DNS servers, malicious DNS server and DoH server to capture data at layer 2. The web browsers were configured to use various public DoH resolvers. To connect to Firefox, we also used GeckoDriver which is an intermediary between Firefox and tools that interacts with Firefox. Similarly, for Google Chrome, we used chrome driver to communicate with the browser. Traffic is captured between the DoH proxy and DoH server using tcpdump.

We developed DoHLyzer, a DoH traffic flow generator and analyzer for anomaly and attack detection and characterization. DoHLyzer is a script written in python that uses Scapy to read pcap files or sniff packets online.

A tool named DoH Data Collector is developed to simulate different DoH tunneling scenarios and capture the resulting HTTPS traffic. In each instance of simulation, a new DoH tunnel is made over the underlying network according to different parameters used in a scenario such as:

To generate enough data, the clients used in the simulation were run simultaneously on 10 servers, all connecting to a single C2 server posing as a DNS nameserver. A central controller based on another server controls the timing of simulations. Table 1 lists the IP addresses used to generate non-DoH, benign DoH and malicious DoH traffic at both layers of the project methodology.

DoH Server: Adguard, Cloudflare, Google, Quad9
DNS Tunneling Tool: Iodine, DNS2TCP, DNScat2
Tunneling Client and Server Configurations: Settings such as the delay between sending requests and DNS record types used.
Transmission Rate: Random value between 100 B/s to 1100 B/s
Duration

Table 1: IP addresses used to generate traffic

Destination IPs used for accessing public DoH servers (all TLS packets to these hosts are DoH packets):	1.1.1.1 8.8.4.4 8.8.8.8 9.9.9.9 9.9.9.10 9.9.9.11 176.103.130.131 176.103.130.130 149.112.112.10 149.112.112.112 104.16.248.249 104.16.249.249
Source IP used to connect to websites (Google Chrome):	192.168.20.191
Source IPs used to connect to websites (Mozilla Firefox):	192.168.20.111 192.168.20.112 192.168.20.113
Source IPs used to create DoH tunnels:	192.168.20.144 192.168.20.204 192.168.20.205 192.168.20.206 192.168.20.207 192.168.20.208 192.168.20.209 192.168.20.210 192.168.20.211 192.168.20.212

3. Capturing data and final dataset

Based on the defined scenario in previous section, we implemented the infrastructure and captured the traffic. Table 2 presents the details of packets and flows captured by using browsers/tools and DoH servers.

Table 2: Dataset details

Browser/tool	DoH Server	Packets	Flows	Type
Google Chrome	AdGuard Cloudflare Google DNS Quad9	5609K 6117K 5878K 10737K	105141 132552 108680 199090	HTTPS (Non-DoH and Benign DoH)
Mozilla Firefox	AdGuard Cloudflare Google DNS Quad9	4943K 4299K 6413K 4956K	50485 90260 138422 92670	HTTPS (Non-DoH and Benign DoH)
dns2tcp	AdGuard Cloudflare Google DNS Quad9	1281K 3694K 28711K 8750K	5459 6045 17423 138588	Malicious DoH
DNSCat2	AdGuard Cloudflare Google DNS Quad9	1301K 12346K 48069K 9309K	5369 9230 11915 9108
Iodine	AdGuard Cloudflare Google DNS Quad9	3938K 5932K 73459K 22668K	11336 14110 12192 8975

4. Feature extraction

DoHMeter is a tool developed in Python to extract statistical and time-series features from the captured PCAP files. It produces a CSV file as output. The data in that file is labeled flow-wise based on the IP addresses of the servers used in the network diagram (Figure 1). Table 3 lists the 28 statistical features extracted from captured traffic.

Table 3: List of extracted statistical traffic features

Parameter	Feature
F1	Number of flow bytes sent
F2	Rate of flow bytes sent
F3	Number of flow bytes received
F4	Rate of flow bytes received
F5	Mean Packet Length
F6	Median Packet Length
F7	Mode Packet Length
F8	Variance of Packet Length
F9	Standard Deviation of Packet Length
F10	Coefficient of Variation of Packet Length
F11	Skew from median Packet Length
F12	Skew from mode Packet Length
F13	Mean Packet Time
F14	Median Packet Time
F15	Mode Packet Time
F16	Variance of Packet Time
F17	Standard Deviation of Packet Time
F18	Coefficient of Variation of Packet Time
F19	Skew from median Packet Time
F20	Skew from mode Packet Time
F21	Mean Request/response time difference
F22	Median Request/response time difference
F23	Mode Request/response time difference
F24	Variance of Request/response time difference
F25	Standard Deviation of Request/response time difference
F26	Coefficient of Variation of Request/response time difference
F27	Skew from median Request/response time difference
F28	Skew from mode Request/response time difference

5. Using the dataset

If you want to use the AI techniques to analyze, you can download our generated data (CSV) file and analyze the network traffic. If you want to use a new feature extractor, you can use the raw captured files (PCAP) to extract your features. And then, you can use the data mining techniques for analyzing the generated data.

YouTube video: DNS over HTTPS by Dr. Gurdip Kaur

6. License

You may redistribute, republish, and mirror the CIRA-CIC-DoHBrw-2020 dataset in any form. However, any use or redistribution of the data must include a citation to DoHMeter and the following research paper outlining the details of captured DoH traffic:

Mohammadreza MontazeriShatoori, Logan Davidson, Gurdip Kaur, and Arash Habibi Lashkari, “Detection of DoH Tunnels using Time-series Classification of Encrypted Traffic”, The 5th IEEE Cyber Science and Technology Congress, Calgary, Canada, August 2020

If you are interest in CIRA-CIC-DoHBrw-2020, you may also be interested in the BCCC-CIRA-CIC-DoHBrw-2020 dataset made available by our colleagues at the Behaviour-Centric Cybersecurity Center, York University.

Download the dataset

Global Site Navigation (use tab and down arrow)