tstearns.com/notes

Where Does My Data Go?

January 2018

Where does my data go when making requests across the Internet? That is a difficult but often-asked question.

One approach to visualizing geographic packet paths made its way onto Hacker News recently: https://stefansundin.github.io/traceroute-mapper/. It is disappointingly inaccurate. Let's look at the example traceroute output given by the above website, which allegedly travels from Sweden through Switzerland:

 7  se-lla.nordu.net (109.105.102.93)  0.574 ms
 8  se-fre.nordu.net (109.105.97.113)  13.105 ms
 9  s-b3-link.telia.net (213.248.97.17)  16.265 ms
10  s-bb4-link.telia.net (80.91.253.226)  13.546 ms
11  kbn-bb3-link.telia.net (62.115.139.167)  22.249 ms
12  nyk-bb1-link.telia.net (62.115.141.99)  115.487 ms
13  ash-bb4-link.telia.net (213.155.133.9)  126.166 ms
14  las-b3-link.telia.net (213.155.137.59)  194.745 ms

Getting Started

In order to determine where the data actually goes, we need to examine three properties:

  1. The network that carries the traffic.
  2. The routers within that network that reveal points along the path of the data.
  3. The undersea cables and underground fiber that connect the routers.

With some human intuition, #1 and #2 are not too difficult to interpret in the above traceroute. NORDUnet takes the traffic from lla (Luleå) to fre (Stockholm) and hands it off to Telia, which takes the traffic through kbn (København, aka Copenhagen), then crosses into the USA through nyk (New York City), ash (Ashburn Virginia), and finally las (Los Angeles). Here's a plot using Karl Swartz's Great Circle Mapper, which is a handy tool for all manner of global plotting (though note that the black lines connecting the cities are rough estimates, as we don't really know which cable and fiber routes were used):

Path from gcmap.com

An astute reader might question how these associations are made. Some are obvious, such as lla, which is the IATA airport code for Luleå (using airport codes is a common pattern for router names). But others are not so clear. For example:

  1. How do we know that fre is in Stockholm? Perhaps it refers to the Frescati neighborhood where Stockholm University is located, but that's just a guess.
  2. Why does Telia use las for Los Angeles, which not only does not seem like an obvious abbreviation, but is the airport code for nearby Las Vegas? I have no idea.

But I do know that NORDUnet looking glass and Telia looking glass claim that fre is Stockholm and las is Los Angeles.

Geolocation Strategies

Using looking glass servers is not a practical solution for automated discovery of geolocation. So let's dig a little deeper into the ways we can trace the geography of our network traffic:

  1. Reading router names as described above is easy. But as mentioned, it requires human intuition, and doesn't work when network operators don't tag their routers with geographic labels. For more background on interpreting traceroutes, see Richard Steenberger's NANOG 62 presentation.
  2. Measuring ping latencies to triangulate a router's location may be possible. But latency is determined by many factors, and can rarely give precise results, as shown in a recent research paper.
  3. Alternately, IP addresses can simply be looked up in public registries (for example, Regional Internet Registries or BGP route servers). This is tempting to automate, but very error-prone, as registries typically list the location of the organization that registers the address, rather than the location of the network equipment. It appears to be the method used by the Hacker News article. The idea is simple: Telia is a European company, or more precisely, the IP addresses used in that particular route are registered via Europe's RIPE registry. One could argue that Switzerland is roughly the geographic center of Europe, so Switzerland is used as a placeholder for all of Telia's traffic, even traffic in the USA, simply because the routers belong to a network registered in Europe.

Imperfect Registries

Despite being the least accurate of the above three methods for geographic purposes, public registry information still can be useful for understanding network ownership and relationships. Hurricane Electric (HE), a large global network operator, runs bgp.he.net, a very useful tool for looking up network information using HE's BGP tables. For more automated tooling, the Prefix Whois Project runs a WHOIS server that returns information about IP addresses given its BGP tables. Just be careful to know the limits of those databases when using them, particularly with respect to geolocation data.

Unfortunately, misinterpretation of geographic granularity can lead to unintended consequences. Another popular IP geolocation service, MaxMind, uses the center of the USA to represent locations belonging to American carriers that it can't identifty, and the center is in Kansas (in other words, Kansas is America's Switzerland, but without the Alps). The unlucky Kansas residents of the latitude/longitude given by MaxMind as a placeholder for the USA received so many threats from misguided vigilantes that they sued MaxMind in 2016, and MaxMind subsequently moved their lat/long pair to a nearby lake.

Invisible Hops

All of the above discussion assumes knowledge of the IP addresses of routers that carry our traffic. Venerable traceroute is the most commonly used tool to gather these addresses, but it can only show routes in one direction: from source to destination. Packets run in a circuit, so the response from destination back to source is also relevant. The only way to determine that reverse path is to use a looking glass server. The NANOG presentation linked above has more notes on reverse path difficulties.

To take that difficulty a step further, sometimes both forward and reverse path IP addresses are hidden because of a technology called Multiprotocol Label Switching (MPLS), which many carriers use for long-haul traffic. It severely complicates traceroute interpretations.

Unstable Routes

Modern traffic-routing techniques introduce ambiguity into geolocation analysis. For example, "virtual IPs" are assigned to different machines at different times. Even more elusive are Anycast IPs that simultaneously belong to multiple machines in multiple locations. The use of Anycast has risen as CDNs such as Cloudflare become more popular, and large websites such as Google run their own Anycast networks, so that IP addresses can no longer be thought of as the addresses of machines, but instead as abstract identifiers that indicate which global network our traffic should be delivered to.

Routes can change even without Anycast. Network operators make and break relationships between one another, and occassionally misconfigure their routers, causing traffic to take different paths at different times. In extreme cases, BGP hijacking (either accidental or malicious) can cause highly suboptimal routes to be chosen, like recent incidents that sent US and EU-bound traffic through Russia first. In other words: running traceroute today won't reveal where our traffic will travel tomorrow.

Conclusion

Finally, even if we are certain about the location of the routers used to send and receive our data, we haven't addressed the question of which cables are used to carry the data between routers. For undersea routes, TeleGeography maintains a Submarine Cable Map, and for terrestrial fiber routes, individual carriers often provide maps of their own circuits ... but making use of that information requires a lot of manual sleuthing.

Understanding the geographic path that data takes is fraught with peril. The above techniques are usually sufficient to give a rough sense of where the packets flow, but more important than knowing how to use those techniques is knowing when and how they may deceive us. Happy tracerouting!