Tunneling for Transparency:
A Large-Scale Analysis of End-to-End Violations in the Internet

Taejoong Chung, Alan Mislove, David Choffnes

Northeastern University

Paper Overview

Detecting violations of application-level end-to-end connectivity on the Internet is of significant interest to researchers and end users; recent studies have revealed cases of HTTP ad injection and HTTPS man-in-the-middle attacks. Unfortunately, detecting such end-to-end violations at scale remains difficult, as it generally requires having the cooperation of many nodes spread across the globe. Most successful approaches have relied either on dedicated hardware, user-installed software, or privileged access to a popular web site. In this paper, we present an alternate approach for detecting end-to-end violations based on Luminati, a HTTP/S proxy service that routes traffic through millions of end hosts. We develop measurement techniques that allow Luminati to be used to detect end-to-end violations of DNS, HTTP, and HTTPS, and, in many cases, enable us to identify the culprit. We present results from over 1.2M nodes across 14K ASes in 172 countries, finding that up to 4.8% of nodes are subject to some type of end-to-end connectivity violation. Finally, we are able to use Luminati to identify and measure the incidence of content monitoring, where end-host software or ISP middleboxes record users' HTTP requests and later re-download the content to third-party servers.

This paper will be published at IMC'2016 (Internet Measurement Conference) and you can download our paper here

What is Luminati?

Luminati is the paid HTTP/S proxy service that routes traffic via exit nodes. Clients of Luminati can use an API to automate requests, as well as express preferences over which Hola client will be selected to route their traffic. Luminati clients are charged on a per-GB basis, and all Luminati traffic is first routed via a Hola server before being forwarded to a Hola user's client.
We used Luminati to explore an alternative approach to detecting end-to-end connectivity violations in edge networks, which allows us to achieve measurements from nearly one million end hosts simultaneously without requiring users to install our software or hardware. By using Luminati, we can route HTTP/S traffic via many of the Hola nodes, and gain visibility into their networks.
Using Luminati, we demonstrate how a large-scale HTTP/S proxy service can be used to measure end-to-end connectivity violations in DNS, HTTP, and HTTPS. We develop techniques that allow us, in most cases, to identify the party responsible for the violations (i.e., the user’s DNS resolver, an ISP middlebox, software on the user’s machine, and etc.). This allows researchers to conduct measurements at the scale of approaches deployed by popular web sites, and avoids the overhead of having to convince users to install custom software or hardware. (For more details, you can visit to https://luminati.io)
Below, we make our four kinds of datasets (NXDOMAIN Hijacking, HTTP Content Modification, SSL Certificate Modification, and Content Monitoring) public. For more details regarding to our methodologies and results, please take a chance to look our paper

DNS NXDOMAIN Hijacking

To detect NXDOMAIN Hijacking, we make the exit nodes to issue DNS resolution queries to our DNS server, and deliberately return a NXDOMAIN response to see whether they receive a NXDOMAIN response or content.
Using this methodology, we measured a total of 753,111 unique exit nodes from 167 countries and 10,197 ASes. We found that these exit nodes are configured to use a total of 33,446 unique DNS servers. We observed that 717,311 of the exit nodes (95.2%) do not experience NXDOMAIN hijacking, but the other 35,800 exit nodes (4.8%) have their response intercepted.

Name Type Size SHA-256 Hash (Uncompressed)
nonnx-domain-list.txt txt 65 MB ca73419fadb2f483109baaad2c01cf143bc8caa3fadc793531b7f21323d6ed3b
nx-domain-list.txt txt 4.7 MB 36915856aacb670a7fab5090af3238df2c25d16c5ca823e9a207adab8ff3c717
dataset-description.txt txt 767 B

/NXDOMAIN-Dataset/

  • # This example shows the 5 exit nodes who use "68.105.29.76" DNS server, but their responses are hijacked.
  • grep "68.105.29.76" nxdomain_list.txt | head -5
  • 20160413,72.209.128.0/18,22773,us,united states,68.105.29.76,22773,usnx14387334,data/0413/us/usNX14387334,360 20160413,72.204.100.0/22,22773,us,united states,68.105.29.76,22773,usnx14114461,data/0413/us/usNX14114461,316 20160413,24.255.128.0/17,22773,us,united states,68.105.29.76,22773,usnx14075223,data/0413/us/usNX14075223,336 20160414,68.0.64.0/18,22773,us,united states,68.105.29.76,22773,usnx145852847,data/0414/us/usNX145852847,280 20160414,72.204.68.0/22,22773,us,united states,68.105.29.76,22773,usnx142195077,data/0414/us/usNX142195077,312

HTTP Content Modification

We simply fetch content from our Web server via an exit node, and check whether the content we received is the same as what we sent. For this experiment, we fetch four different pieces of content through each exit node: a 9KB HTML page, a 39KB JPEG image, a 258KB unminified Javascript library, and a 3 KB un-minified CSS file.
Using Luminati, We measured 49,545 exit nodes in 12,658 ASes across 171 countries. We detected HTML content modification for 472 exit nodes (0.95%), image modification for 694 (1.4%), JavaScript modification for 45 (0.09%), and CSS modification for 11 (0.002%).

Name Type Size SHA-256 Hash (Uncompressed)
content_nonmodification_list.txt txt 11 MB 9f68a3f88a7fb5a5898b6ee3f010c0e1579e27841110991235384a61146e7f59
content_modification_list.txt txt 403 KB 1119110d12b0c5264dc4ca1b56cce1cddf5949891cbd04bab55caf083f7ea028
dataset_description.txt txt 744 B

/ContentModification-Dataset/

  • # This example shows the 5 exit nodes who are in AS29180 and their received contents (image) are modified (compressed)
  • grep img content_modification_list.txt | grep 29180 | head -5
  • 20160504,82.132.244.0/22,29180,gb,great britain,img,13236,444
    20160505,82.132.224.0/22,29180,gb,great britain,img,13236,256
    20160505,82.132.236.0/22,29180,gb,great britain,img,13236,472
    20160505,82.132.224.0/22,29180,gb,great britain,img,13236,396
    20160505,82.132.236.0/22,29180,gb,great britain,img,13236,472

SSL Certificate Replacement

We used the HTTP CONNECT method with the super proxy, which tunnels all TCP port 443 traffic between the exit node and our measurement client, including the TLS handshake. We completed a TLS handshake and record the SSL certificates presented; we then terminated the connection (we do not actually download any content). As certificate replacement may target individual web sites, we chose three different classes of sites to test:

  1. Popular sites: We choose the most popular 20 sites from each country’s Alexa Ranking that supports HTTPS.
  2. International sites: We choose the web sites of 10 U.S. universities where IMC’16 PC members are affiliated.
  3. Invalid sites: We choose three sites with intentionally invalid certificates, including a self-signed certificate, an expired certificate, and a certificate having incorrect Common Name

Using this methodology, we measured a total 807,910 exit nodes in 10,007 ASes and 115 countries.19 Among these exit nodes, we find that 4,540 of them (0.05%) received at least one modified certificate. Interestingly, we find that not every certificate is modified, indicating that certificates can be selectively replaced.
Name Type Size SHA-256 Hash (Uncompressed)
certs-test-fail.txt txt 15 MB f8de86cf1835d47d5d8e9b176da106246e344e2dd11514819ad0b5c3ec7b24f1
certs-test-ok.txt txt 407 MB 31999d88d7cad333213c38db60f665058a6f4b237f2cbb5fc883c1c34e6c2955
dataset-description.txt txt 622 B

/SSL-Dataset/

  • This example shows the 5 exit nodes and chain of certificates who received modified certificates
  • grep "expired.badssl.com" certs-test-fail.txt | head -5
  • 20160414,216.152.160.0/20,11081,cw,curacao,expired.badssl.com,0
    20160414,90.184.0.0/15,39554,dk,denmark,expired.badssl.com,0
    20160414,90.184.0.0/15,39554,dk,denmark,expired.badssl.com,1
    20160414,86.52.0.0/16,197288,dk,denmark,expired.badssl.com,0
    20160414,2.104.0.0/13,3292,dk,denmark,expired.badssl.com,0

Content Monitoring

Another concerning form of end-to-end violation is content monitoring, or cases where middleboxes are silently observing content that users are downloading for the purpose of scanning content or otherwise controlling access. While content modification is easy to detect (e.g., via block pages), content monitoring is significantly more difficult to detect, as there is (by definition) no change to the content itself. However, we discovered we can detect certain types of content monitoring based on unexpected requests arriving at our measurement server.
From our measurement, we measured a total of 747,449 exit nodes, and observed that 11,234 (1.5%) of them resulted in multiple, unexpected requests. These unexpected requests came from 424 unique IP addresses that were different from the exit nodes.

Name Type Size SHA-256 Hash (Uncompressed)
monitoring-webserver.txt txt 191 MB 6aefe6ab1425ae0cbefb6542e59a264a08188520536139d2a046d5b7ee142d83
dataset-description.txt txt 342 B

/Monitoring-Dataset/

  • This example shows the 5 exit nodes where their web behaviors are monitored by TrendMicro
  • grep "150.70.176.0/20" monitoring-webserver.txt | head -5
  • 66.55.112.0/20,13802,1460899911.677191,bm,bermuda,150.70.176.0/20,16880,1460900006.318172 83.141.64.0/19,25441,1460647563.867181,ie,ireland,150.70.176.0/20,16880,1460647649.073324 62.152.29.0/24,8544,1460860325.104590,cy,cyprus,150.70.176.0/20,16880,1460860385.051470 180.94.69.0/24,55330,1460895286.689661,af,afghanistan,150.70.176.0/20,16880,1460895379.816939 82.72.0.0/15,9143,1460766651.884828,nl,netherlands,150.70.176.0/20,16880,1460766675.049640

Contact

Do you have any questions, comments or concern? Feel free to send us an email to Taejoong Chung