Tunneling for Transparency:
A Large-Scale Analysis of End-to-End Violations in the Internet

Taejoong Chung, Alan Mislove, David Choffnes

Northeastern University

Paper Overview

Detecting violations of application-level end-to-end connectivity on the Internet is of significant interest to researchers and end users; recent studies have revealed cases of HTTP ad injection and HTTPS man-in-the-middle attacks. Unfortunately, detecting such end-to-end violations at scale remains difficult, as it generally requires having the cooperation of many nodes spread across the globe. Most successful approaches have relied either on dedicated hardware, user-installed software, or privileged access to a popular web site. In this paper, we present an alternate approach for detecting end-to-end violations based on Luminati, a HTTP/S proxy service that routes traffic through millions of end hosts. We develop measurement techniques that allow Luminati to be used to detect end-to-end violations of DNS, HTTP, and HTTPS, and, in many cases, enable us to identify the culprit. We present results from over 1.2M nodes across 14K ASes in 172 countries, finding that up to 4.8% of nodes are subject to some type of end-to-end connectivity violation. Finally, we are able to use Luminati to identify and measure the incidence of content monitoring, where end-host software or ISP middleboxes record users' HTTP requests and later re-download the content to third-party servers.

This paper will be published at IMC'2016 (Internet Measurement Conference) and you can download our paper here

Luminati is the paid HTTP/S proxy service that routes traffic via exit nodes. Clients of Luminati can use an API to automate requests, as well as express preferences over which Hola client will be selected to route their traffic. Luminati clients are charged on a per-GB basis, and all Luminati traffic is first routed via a Hola server before being forwarded to a Hola user's client.
We used Luminati to explore an alternative approach to detecting end-to-end connectivity violations in edge networks, which allows us to achieve measurements from nearly one million end hosts simultaneously without requiring users to install our software or hardware. By using Luminati, we can route HTTP/S traffic via many of the Hola nodes, and gain visibility into their networks.
Using Luminati, we demonstrate how a large-scale HTTP/S proxy service can be used to measure end-to-end connectivity violations in DNS, HTTP, and HTTPS. We develop techniques that allow us, in most cases, to identify the party responsible for the violations (i.e., the user’s DNS resolver, an ISP middlebox, software on the user’s machine, and etc.). This allows researchers to conduct measurements at the scale of approaches deployed by popular web sites, and avoids the overhead of having to convince users to install custom software or hardware. (For more details, you can visit to https://luminati.io)
Below, we make our four kinds of datasets (NXDOMAIN Hijacking, HTTP Content Modification, SSL Certificate Modification, and Content Monitoring) public. For more details regarding to our methodologies and results, please take a chance to look our paper

DNS NXDOMAIN Hijacking

To detect NXDOMAIN Hijacking, we make the exit nodes to issue DNS resolution queries to our DNS server, and deliberately return a NXDOMAIN response to see whether they receive a NXDOMAIN response or content.
Using this methodology, we measured a total of 753,111 unique exit nodes from 167 countries and 10,197 ASes. We found that these exit nodes are configured to use a total of 33,446 unique DNS servers. We observed that 717,311 of the exit nodes (95.2%) do not experience NXDOMAIN hijacking, but the other 35,800 exit nodes (4.8%) have their response intercepted.

Name	Type	Size	SHA-256 Hash (Uncompressed)
nonnx-domain-list.txt	txt	65 MB	`ca73419fadb2f483109baaad2c01cf143bc8caa3fadc793531b7f21323d6ed3b`
nx-domain-list.txt	txt	4.7 MB	`36915856aacb670a7fab5090af3238df2c25d16c5ca823e9a207adab8ff3c717`
dataset-description.txt	txt	767 B

/NXDOMAIN-Dataset/

# This example shows the 5 exit nodes who use "68.105.29.76" DNS server, but their responses are hijacked.
grep "68.105.29.76" nxdomain_list.txt | head -5

HTTP Content Modification

We simply fetch content from our Web server via an exit node, and check whether the content we received is the same as what we sent. For this experiment, we fetch four different pieces of content through each exit node: a 9KB HTML page, a 39KB JPEG image, a 258KB unminified Javascript library, and a 3 KB un-minified CSS file.
Using Luminati, We measured 49,545 exit nodes in 12,658 ASes across 171 countries. We detected HTML content modification for 472 exit nodes (0.95%), image modification for 694 (1.4%), JavaScript modification for 45 (0.09%), and CSS modification for 11 (0.002%).

Name	Type	Size	SHA-256 Hash (Uncompressed)
content_nonmodification_list.txt	txt	11 MB	`9f68a3f88a7fb5a5898b6ee3f010c0e1579e27841110991235384a61146e7f59`
content_modification_list.txt	txt	403 KB	`1119110d12b0c5264dc4ca1b56cce1cddf5949891cbd04bab55caf083f7ea028`
dataset_description.txt	txt	744 B

/ContentModification-Dataset/

# This example shows the 5 exit nodes who are in AS29180 and their received contents (image) are modified (compressed)
grep img content_modification_list.txt | grep 29180 | head -5
20160504,82.132.244.0/22,29180,gb,great britain,img,13236,444
20160505,82.132.224.0/22,29180,gb,great britain,img,13236,256
20160505,82.132.236.0/22,29180,gb,great britain,img,13236,472
20160505,82.132.224.0/22,29180,gb,great britain,img,13236,396
20160505,82.132.236.0/22,29180,gb,great britain,img,13236,472

We used the HTTP CONNECT method with the super proxy, which tunnels all TCP port 443 traffic between the exit node and our measurement client, including the TLS handshake. We completed a TLS handshake and record the SSL certificates presented; we then terminated the connection (we do not actually download any content). As certificate replacement may target individual web sites, we chose three different classes of sites to test:

Popular sites: We choose the most popular 20 sites from each country’s Alexa Ranking that supports HTTPS.
International sites: We choose the web sites of 10 U.S. universities where IMC’16 PC members are affiliated.
Invalid sites: We choose three sites with intentionally invalid certificates, including a self-signed certificate, an expired certificate, and a certificate having incorrect Common Name

Using this methodology, we measured a total 807,910 exit nodes in 10,007 ASes and 115 countries.19 Among these exit nodes, we find that 4,540 of them (0.05%) received at least one modified certificate. Interestingly, we find that not every certificate is modified, indicating that certificates can be selectively replaced.

Name	Type	Size	SHA-256 Hash (Uncompressed)
certs-test-fail.txt	txt	15 MB	`f8de86cf1835d47d5d8e9b176da106246e344e2dd11514819ad0b5c3ec7b24f1`
certs-test-ok.txt	txt	407 MB	`31999d88d7cad333213c38db60f665058a6f4b237f2cbb5fc883c1c34e6c2955`
dataset-description.txt	txt	622 B

/SSL-Dataset/

This example shows the 5 exit nodes and chain of certificates who received modified certificates
grep "expired.badssl.com" certs-test-fail.txt | head -5

Content Monitoring

Another concerning form of end-to-end violation is content monitoring, or cases where middleboxes are silently observing content that users are downloading for the purpose of scanning content or otherwise controlling access. While content modification is easy to detect (e.g., via block pages), content monitoring is significantly more difficult to detect, as there is (by definition) no change to the content itself. However, we discovered we can detect certain types of content monitoring based on unexpected requests arriving at our measurement server.
From our measurement, we measured a total of 747,449 exit nodes, and observed that 11,234 (1.5%) of them resulted in multiple, unexpected requests. These unexpected requests came from 424 unique IP addresses that were different from the exit nodes.

Name	Type	Size	SHA-256 Hash (Uncompressed)
monitoring-webserver.txt	txt	191 MB	`6aefe6ab1425ae0cbefb6542e59a264a08188520536139d2a046d5b7ee142d83`
dataset-description.txt	txt	342 B

/Monitoring-Dataset/

This example shows the 5 exit nodes where their web behaviors are monitored by TrendMicro
grep "150.70.176.0/20" monitoring-webserver.txt | head -5

Do you have any questions, comments or concern? Feel free to send us an email to Taejoong Chung

Tunneling for Transparency:
A Large-Scale Analysis of End-to-End Violations in the Internet

Taejoong Chung, Alan Mislove, David Choffnes

Northeastern University

Paper Overview

This paper will be published at IMC'2016 (Internet Measurement Conference) and you can download our paper here

What is Luminati?

DNS NXDOMAIN Hijacking

Dataset

HTTP Content Modification

Dataset

SSL Certificate Replacement

Dataset

Content Monitoring

Dataset

Contact

Tunneling for Transparency: A Large-Scale Analysis of End-to-End Violations in the Internet

Taejoong Chung, Alan Mislove, David Choffnes

Northeastern University

Paper Overview

This paper will be published at IMC'2016 (Internet Measurement Conference) and you can download our paper here

What is Luminati?

DNS NXDOMAIN Hijacking

Dataset

HTTP Content Modification

Dataset

SSL Certificate Replacement

Dataset

Content Monitoring

Dataset

Contact

Tunneling for Transparency:
A Large-Scale Analysis of End-to-End Violations in the Internet