Thinking About Internet History

30/12/2023
15-minute read

I could have happily been a librarian.

I spent the glorious summer of 1988 as an intern in my university’s library system, learning what it was like to work in the various departments. I got to sample it all: the patient craft of the book restorers, the voices of the dead in Manuscripts and Archives, the exotic tastes of the special collections managers, the ever-present fear of halon suffocation in the sealed glass cages of the rare book collection.

But what I remember most was the impossible challenge of the acquisitions department. In this pre-Internet era, a major research library might aspire to acquire a percent or two of the world’s printed output each year, using paper and pencils, phone calls and fax machines.

Figuring out what to acquire, and how to integrate it into the collection, was a largely manual process that relied on the personal global networks of dozens of professional librarians, driven by the bibliographic hunger of thousands of affiliated academics. And this was just one university library out of thousands in the world, all working toward that same goal, which would outlive any of the people who worked on it: to sample the most important parts of the world’s recorded knowledge, index and organize it, and make it available to researchers, forever.

In short, a research librarian’s task was (and is) both gloriously inspiring, and hilariously impossible. If you stopped to think about it for even a moment, you’d give up and choose another profession that paid better. The people who successfully dedicated long careers to the library system were therefore very special people.

An Argument for Internet History

I thought about these librarians recently as I was contemplating what should be a much easier challenge: piecing together the recorded history of the Internet as a resource for future historians.

The Internet has now grown to something like a billion connected hosts, and is inextricably woven together with politics, economics, public health, education, .. in short, civilization. An artifact that valuable has to fend off continuous capture by money and power interests. To avoid accidentally breaking what made it great, we need to understand and communicate about that greatness using more than anecdotes – we’re going to need to build an evidence-based case for preserving the Internet.

Historians who study our times will desperately want to quantify the effects the Internet had on all aspects of society, in all the places on earth, as the network of networks came ashore and become central to every aspect of everyone’s life. But what primary sources will they use to understand the Internet’s expansion during our lifetimes?

Three Steps

If we want to make sure the Internet’s story is preserved for future scholars in a quantifiable way, and pull together the data to defend it against irreversible damage, we basically have three big collective tasks to undertake before we all forget how it worked:

Preserve the history, by gathering the irreplaceable records of how the Internet grew
Curate the history to interpret it and make it accessible and meaningful for future scholars
Explore the history, creating tools and visualizations that everyone can enjoy and celebrate

What follows is a brief overview of what I think that process might entail, mostly in the form of notes to myself to help me figure out what I should be working on in 2024. If any of this strikes a chord with you, drop me a note and I’ll keep you in the loop.

Step 1: Preservation

Today’s Internet consists of about a billion communicating hosts (things that have their own IP address), arranged into about a million routed networks (groups of IP addresses with common reachability), collectively managed by voluntary interconnection among about a hundred thousand autonomous systems (organizations that are responsible for their own Internet routing policy).

Fortunately, the Internet is somewhat self-documenting, because it can’t help talking about itself constantly. BGP (Border Gateway Protocol) encourages all of those autonomous systems to continuously whisper and compare notes with each other about the best ways packets should traverse the global network to reach their destinations.

A timestamped recording of this “BGP whispering” provides what human historians always hope to find when they visit Manuscripts and Archives: a contemporaneous record of exactly what the participants in history were experiencing, in their own voice, at the moment when it happened.

That means that if you play a set of BGP update streams back today, you can reconstruct a version of the “state of the Internet” as it appeared at any second in history – at least, to a level of fidelity constrained by the number of independent BGP perspectives you managed to record from around the world on that day. At least in terms of reconstructing the interprovider relationships, which document how all the IP addresses are connected to each other administratively, this job is, dare I say, moderately straightforward.

Some Good News and Bad News about BGP Preservation

The good news is that there are at least two major surviving repositories of historical BGP data that we can combine to get the best understanding of Internet history: the Routing Information Service (maintained by the RIPE NCC in Amsterdam, starting in 1999) and the Oregon Routeviews project (from 2001, with some data back to 1997). But if anything were to happen to these projects, due to disaster or financial constraints, their data would be literally irreplaceable.

The existential threat to our common history is not hypothetical. Renesys, the company I cofounded in 2000, once managed a third significant repository with more than a decade of historical BGP data, including hundreds of BGP peers hand-selected to offer complementary perspectives to the RIPE and Oregon peer sets. Sadly, those datasets did not survive multiple corporate acquisitions and are now believed to have been lost.

In 2024, I’d like to find additional homes for the repos of BGP data that survive, for the sake of preservation. In the wise words of the librarians, Lots Of Copies Keep Stuff Safe.

What Else Should We Preserve?

Having preserved the basic shape of interdomain routing, there are plenty of other historical Internet datasets that we’d like to have in order to put flesh on the bones, particularly when it comes to interpreting what all those billion hosts were actually doing within society as the years ticked by.

Every time someone in the Internet records a measurement (resolves a DNS domain to an IP address, runs a ping to see whether a given host is alive, runs a traceroute to see what path the packets are taking to reach a given host, retrieves a web page to see how long content takes to arrive), they’ve performed a natural experiment that can never be run again.

Several foundational repositories of active measurements exist, including CAIDA’s Ark project (since 2007), RIPE’s Atlas (since 2010), and the MLab trace set (since 2013). There may be earlier sets of measurements that were collected by the public, or by individual network operators, that could be used to push the horizon of active performance measurement back before 2007. If you have any of these dusty tarballs of data lurking in your backup tapes, please consider their preservation – you almost certainly have unique observations of how packets actually crossed the historical Internet over time.

Besides active measurements, of course, we’ll also need to preserve the records of registry data — who each of these network resources was assigned to on each day in history, from ARIN, RIPE, and APNIC — as well as anything we can find about the DNS names that were associated with each IP address on a given day. These are the collective clues to what all these Internet hosts were up to, as well as providing clues to where on earth they were likely located.

Reconstructing the Internet as a Point-in-Time Database

Finally, all this DNS and registry data is strongly ephemeral, meaning that it can change from day to day without warning. That makes it imperative that we keep track of the time of each of our ephemeral observations, if we want to later build credible metrics for things like the density of Internet hosts within a given region.

Recall that in the 2010s, IPv4 exhaustion triggered waves of sales and international reassignments of network address blocks, so that (for example) a block of network addresses that had been hosting DSL customers in Romania might vanish from the Internet for a while, and then reappear serving web pages in a datacenter in Saudi Arabia. Internet geography changes quickly, so we don’t just need a geolocation map of all the IP addresses, and some sense of what each IP address was being used for. We also need to know what that map looked like on each day in history over decades, as the hosts and resources associated with each IP address moved around and changed their functions.

Are you disheartened yet? No? Are you excited about scouring the world to find and fit together all the pieces of this enormous space-and-time puzzle? Excellent, you’re showing signs of “librarian spirit.” Let’s keep going!

Step 2: Expose the Narrative

Once we’ve succeeded in preserving all of our endangered digital datasets, we can get down to the business of curation and interpretation. Most Internet measurement research has focused on the operational questions of the here and now: monitoring for slowdowns and shutdowns within and between providers, figuring out how the Internet is routing traffic around that damage. Questions of historical evolution tended to be secondary.

I predict that we can find new ways to look at the Internet through historical eyes, to get past this “operations trap.” If we regard the Internet as just another complex process that was informing, and being informed by, everything else that was going on in society at the time, we’re going to need some stable metrics and metaphors that we can carry forward through time for a couple decades, as input to all the other models of what was going on simultaneously.

If we do it right, we might even be able to perform statistical tests for information transfer among our time series, and we might finally be able to answer some tantalizing social science questions: does the evolving structure of your Internet environment exert a quantifiable influence on the growth of your economy, or the probability of violence against civilians, or levels of voter participation, or secondary school graduation rates, or life expectancy?

Rethinking Measurement and Starting Over

Here’s the fun part: I don’t know with certainty what metrics-and-metaphors I would choose to extract from the historical data to characterize Internet structure, if I were starting over today with the raw stuff. Everything we did to date was either operationally oriented (“the Internet is broken! now it’s fixed!”) or focused on geopolitically reductive metrics that don’t really describe how the Internet works (“the Russian Internet grew by 12% in the last decade!”)

I am personally to blame for promulgating many of the latter sorts of statistics. Over and over and over in my career, I’ve made presentations that purported to compare one national Internet with another, to see who was “growing faster,” and who was “lagging behind.” We did this in part to exhort slow-growing, low-diversity parts of the Internet to grow faster, and it’s true that national regulatory environments (and the central role of national providers in many countries) do induce some parts of the Internet to behave in ways that are country-specific. But I hope that for future historians’ sake, we can find better ways to preserve geographic intuition without falling into the cognitive trap of somehow regarding national Internet footprints as just another sovereign border to be defended.

Documenting Historical “Slices” of Internet Activity

My guess about where this is going is that historians will show up with questions about what I call workload slices of the Internet: what was it like in a given time and place for a specific set of users having a specific client-server experience on the Internet? This is similar in spirit to the way, for example, the OONI team categorizes Internet impairment according to the category of website or communications protocol that becomes unreachable to users in a particular part of the Internet, due to someone’s censorship.

Again, we return to the idea of careful curation of all the ‘flesh on the bones’ of interdomain routing: the idea that we will have preserved a lot of the clues about what all those Internet hosts were being used for on a particular day. Registry data gives us a first clue about the organizations to whom resources were assigned; DNS data give added clues about what roles the individual hosts were performing, how networks were organized functionally, perhaps even where the hosts were physically located. We’ll have lots of contemporaneous evidence from people who made, or retained, maps of their organizations’ networks, or who crawled the base of available content. It may be possible to reconstruct the functional host-level Internet footprints of particular companies, or regions, or industries, at specific points in time.

Some of these “workload slices” will be very specific in time and place for people who want to understand the Internet connectivity coincident with historical events. What was it like for Chinese academic users in 2009 to use Google search? What was it like for mobile users in Cairo who tried to get to Wikipedia in 2011? What was it like for the financial sector in South America to connect to Bloomberg and Reuters throughout the 2000s? How diversely hosted were Ethereum nodes in 2020, or Mastodon servers in 2023, relative to Internet consumers around the world? Some of these slices are a bit on the nose – we may be able to map the embedding of hosts in the Internet, and visualize the interprovider connectivity that would have supported a given workload slice, but without a lot of related active measurement data from the period, detailed answers about very narrow slices of user experience may never be knowable.

This will be hard curation work, not research that will be easy or (often) automatable. So we’ll need to let potential users of this data guide us to the worthwhile problems to study, and show them where to dig, if they want to help excavate patterns of interest from the data we’ve managed to preserve.

Regional Internet Connectivity

To generate time series of broader interest for our historians’ perspective, perhaps we could open the lens a bit on these slices to consider broader trends in regional interprovider connectivity. For example, we might construct random-workload slices that could usefully approximate actual regional experiences, while also being consistently computable over time given our available measurements.

For example, we might ask: what would typical paths look like for a randomly selected enduser in the Middle East, connecting to a randomly selected server hosted somewhere else within the same region? If this random selection were run over and over, the statistics about the set of available paths might converge to something approximating the actual consumer experience years ago (which we can no longer measure directly).

Between our random client and server, was there a good supply of relatively dense, direct, low-latency local connections through Internet Exchange Points, or would they be subjected to long, roundabout paths through exchange points in Western Europe or Singapore? We might begin by approximating these sorts of in-region and cross-region distributions of potential connectivity on each day throughout history, based on randomly sliced workload models. Then we could validate those models against the vastly smaller set of actual contemporaneous measurements that might exist from the period.

Regional Internet Stability

Once we have the daily snapshots of how the Internet generally works to connect each region with itself, and with other regions, we could start computing longer-term metrics representing things like the diversity and stability of the available distribution of paths for a given Internet workload slice.

In terms of the latency between two regions, or within a region, we could then ask: how many “stable modes” are there for daily Internet experience, looking back over the trailing twelve months? Does the Internet experience change frequently, or is it more or less stable? How often do new modes appear, based on the addition of new kinds of connectivity serving the region? These might be the sort of ‘process metrics’ that end up having predictive power outside the technical domain, rather than the simpler structural metrics. But I mistrust my intuition here, because I’m thinking like a technologist, rather than being guided by the more interesting questions that might be posed by researchers outside the Internet community, looking in.

Remember, no single model, metric, or metaphor is going to work consistently over a period of decades to characterize the growth of the Internet. Workloads on today’s Internet are vastly different from workloads in 2000, as we’ve lived through everything from the runout of IPv4, to the growth of IPv6, the rise of cloud computing, the growth of content distribution networks at the edge, and the centrality of specific content megaproviders like Google and Facebook. How we summarize that change into eras, how we add new metrics and metaphors to our descriptive blend, or retire old ones as being less meaningful .. these are great long-term curation questions to engage in with historians. For now, we can afford to be agnostic.

Step 3: Explore and Celebrate the Internet’s History.

This is the payoff for the librarian’s work. The reason we fight to preserve and curate the history of the Internet as a technological artifact is to help make the case for its preservation to a public that (to be fair) barely understands how the Internet works its magic. Today’s Internet works unbelievably well in no small part because of the specific conditions under which it grew and evolved, under multistakeholder governance rather than a multilateral treaty system, often valuing decentralized openness and innovation where a centralized set of authorities might have preferred to prioritize safety, predictability, and control.

Once we’ve preserved the history of the Internet, and we’ve enlisted thoughtful scientists who can help us quantify some of the Internet’s social benefits (net of social costs), we’ll need tools to help tell those stories. Visualizations mostly, perhaps immersive walkthroughs, certainly the kind of interactive exhibits that data journalists use to inform and entertain. Our investments in making these datasets available will open the door to a vastly larger collaboration with artists, journalists, and visual storytellers.

That’s as far as I’d like to speculate about the work that I’d like to get underway for the coming year. We can confidently predict that just as the Internet has changed society, society will surely continue to change the Internet, through some competing combination of top-down regulation and bottom-up innovation and popular demand.

For those who care about the Internet’s future, the race is now on to be better librarians of its history, so that we can preserve and tell the story of what made it great.

To stay in touch about these or other research ideas, find me on Mastodon or drop me a line.