Internet History: Next Steps
Thanks to everyone who responded to my earlier post about the need to preserve and curate the history of Internet measurement for future historians.
Based on the interesting conversations so far, I thought I would go a step further and collect some thoughts on how this might actually be made actionable, in the form of something that could evolve into an Internet History Initiative. You can now:
- register your interest as a potential supporter or contributor: Google Form
- visit the status page: https://internethistoryinitiative.org
- follow on Mastodon: https://cooperate.social/@IHI
Our goal is to collectively figure out how to index and curate the history of bulk Internet measurement datasets, preserve them against loss, and interpret their collective legacy for future generations of historians.
Three common responses to my first piece were:
Isn’t this already the job of (existing repository), let’s just support them
Yes, the first job for an IHI (and for everyone in the Internet measurement community) should be supporting and celebrating the projects that have preserved this data since the day they collected it.
Routeviews and RIPE, in particular, need a lot of love from their respective communities to keep doing what they do, and the job of an IHI would be to make sure everyone understands the value of the irreplaceable datasets they maintain for all of us to study.
That would be really expensive, who’s going to pay for it?
(“That” being the costs of making really large datasets more persistent and usable.)
Ask any librarian: persistence is expensive, and it requires continuous investment.
A core goal of the IHI would be advocacy for measurement — making sure the originating institutions receive sufficient collection management resources to cement and extend their historical legacy. Then finding enough resource on top of that to improve survivability and availability and integrated access across multiple collections, beyond any single project.
I’ve heard from many of you already, offering to point the way toward resources to support these goals, so thank you. Let me circle back after we’ve refined the ideas a bit in collective conversation.
Don’t forget to include (this other fantastic, large dataset)!
Ah, here are my people. We haven’t even started yet and there’s more to do! It’s a good reminder that whatever we build together needs to scale to include lots of measurement products, not just the historical BGP and traceroute datasets. We need to capture the emergence of things like the RPKI, which have changed the way the Internet operates. And we need the ‘meat on the bones’ that will help historians interpret what hosts were doing with all those IP addresses, and where those hosts were located throughout history.
Let’s take a quick look at the potential order of operations for getting an Internet History Initiative off the ground, starting with the challenges I mentioned in my previous post.
Challenge #1: Preservation
Hold onto this one for a moment, and we’ll come back to it. You’ll see why.
Challenge #2: Curation
A reasonably small number of older organizations currently have large amounts of original Internet measurement data on hand. They make it available at old, well-known URLs that tend not to change for many years, so the well-known URLs are the most persistent identifier available. The content is too large to trivially mirror, and since the data are used by a relatively small research community, there often aren’t many (any) available mirror sites.
There’s also very little metadata available, separate from the data artifacts themselves; generally, the URL encodings capture the most important aspects, which are things like collector locations and the time range (yyyymmdd.hh), through some combination of yyyy/mm/dd paths and filename encodings.
To pick an example at random, here’s a set of BGP updates from the route-views2.oregon-ix.net BGP collector, generated 22 years ago, with 15 minutes of data starting at 23:15 PST on 11 January 2002.
At Routeviews, the URL encodes collection time in the local timezone of the collector (here, Pacific Standard Time, UTC-8, since Routeviews didn’t change to UTC for most of its collectors until 2003). That means that the data encoded in the file cover (approximately) the 15-minute UTC interval starting at 07:15 on 12 January 2002.
Here’s a roughly contemporaneous set of observations from the original rrc00 route collector within RIPE RIS:
In this case, if someone wanted to study (for example) an 8 hour window of Internet history from all available sources, starting at midnight UTC on 12 January 2002, they’d have to do some curatorial work up front, before any data is retrieved:
- identify the available products (here, BGP RIBs and update dumps)
- track them down to their originating institutions (here, RIS and Routeviews)
- identify all of the subproducts within each institution (eg., BGP collectors that were live)
- synthesize sets of institution-specific URLs that correspond to the study window
- identify tools to uncompress and unpack the data formats represented by all those files
This is work that can’t be avoided, but if we’re smart, we can do it once, automate it, and let future researchers skip these steps and get right to work on the data.
Persistent digital identifiers for measurement artifacts
One stepping stone to making this work automated and repeatable (and, over time, making the underlying artifacts more available) would be the introduction of persistent digital identifiers for each of our historical artifacts. There are many competing identifier standards, but for a variety of reasons, I think Archival Resource Keys (ARKs) are a reasonable fit.
“All problems in computer science can be solved by another level of indirection, except for the problem of too many layers of indirection.” –Robert Wheeler, by way of Butler Lampson
The basic idea behind an ARK is to put a layer of indirection between the retrieval resource (the old familiar URL that you’d use to grab the gigabytes) and its ‘permanent identity’ within our global collection of Internet measurement datasets. That opens the door to being able to associating metadata with each accessioned artifact to support efficient search and filtering, and having the ability to point the user toward different homes for the same historical artifact, depending on their location (for example, pointing to the original source, or to a cached replica preserved somewhere else).
There’s nothing particularly magic about an ARK, or indeed about any form of digital identifier. At best, it represents an institutional promise to invest the energy to maintain the persistent connection between an identifier and the underlying artifact, and to keep the associated metadata current and correct.
Resolving digital identifiers to underlying digital resources
The magic of ARKs is in the resolver step, and for the IHI we might construct a time-aware resolver library, to make the common case easy: generating and retrieving a stream of artifacts that represent all of the available BGP updates collected during a given time window, across all collector locations, across all contributing institutions.
A good resolver would probably also support suffix-based content negotiation; for example, a researcher might prefer to retrieve a BGPDump text file generated from the original binary MRT format, if both were available as alternative resolutions for a single ARK. The single ARK maintains confidence in the connection between the original and alternative formats for “the same” measurement set, which will remain intact and unchanged.
Decentralizing the resolution step (was re: Challenge #1, Preservation)
So the first step for an Internet History Initiative would be to formulate the namespace
for the standard sorts of archival products that are out there, and then work with each
hosting institution to create ARKs and associated metadata for the objects in their collection,
while pulling a reference instance of each artifact in order to create an offsite copy, along
with its verifying metadata (data format, MD5 checksum).
Without pulling down the reference instance, archivists might see this as violating a kind of important rule: only create persistent keys for things whose persistence you are responsible for (a guiding principle sometimes stated as “curate your own stuff, not other people’s stuff”).
But one of the overriding goals of getting all of these already-public collections into one curatorial namespace will be to let the community do the curatorial work, and avoid imposing extra work onto the individual collections. Another will be to set up a path to improving the persistence of the collections through replication.
As already mentioned, for example, we can add content-distribution functionality to the ARK resolution layer: based on preference or user location, allow a single ARK to resolve to any of a number of URLs, each guaranteed to return an identical copy of the same historical digital artifact, thus opening the door to higher availability.
Finally, we’d probably want to decentralize and replicate the resolver layer itself, to make sure that the death of a single Internet History project wouldn’t somehow spell the end of Internet History. The map of ARKs to artifact URLs should be something that anyone can pull from a public repository and rebuild for themselves. We could publish updates to the accession map over ActivityPub. We could push the resolver map into IPFS. Perhaps we could federate the storage of the artifacts themselves to improve their multisite persistence. The topic needs some energetic exploration.
Challenge #3: Interpretation
So, returning to the original motivation: say that we want to study an 8 hour period of
Internet history beginning at midnight UTC on 12 January 2002. We’d want our system to
generate ARKs representing all of the known historical data products that cover that
window, of all different flavors. The associated metadata would let us select tools
(or spin up servers with the toolchains in place) to let us process those datasets.
If the data in question have been mirrored, there may be many independent-but-identical archival copies of the data we need, to let us spread our demand for that data over all the available copies, or to be able to pick a mirror that’s ‘close’ to the place where we will be processing the data.
Researchers will then run their code over the artifacts to generate their own derived data products. Eventually, if they are useful and stable, Those data products in turn will need to find a home. Perhaps the original institutions provide some space for that as well; perhaps these products live on in the researchers’ own repositories (along with their local copies of the data they pulled to do the reduction), all accessioned with their own ARKs, all made discoverable and replicable by other researchers within the IHI. The metadata of derived copies can include the toolchain information you’d need to replicate the reduction process yourself, as well as the ARKs the reduction was sourced from. Warm fuzzy feelings all around.
In a perfect world, the repositories for the data would also offer compute services very close to the data, to make working with it easier. In some scenarios, it may actually be cheaper for institutions to allow researchers to run their extraction/reduction functions close to the data, than to pay the bandwidth costs to push copies of the data to researchers over the Internet. There’s room for innovation here. But then we’re back to “who pays.”
Conclusion: Next Steps
At this point, people from various CDNs and cloud platforms will be lining up to fill my inbox with helpful outrage. There’s nothing new under the sun when it comes to content distribution strategies, high availability storage solutions, cloud computing. Everything on the wish list here could be accomplished by buying into one centralized platform or another, and they’d probably be glad to sponsor the effort.
It’s true, commercial CDN and cloud platform offerings are almost certain to be among the availability solutions that we draw on at some point to preserve these datasets. But at the end of the day, I think it would be unwise (and expensive, and not in the Internet spirit) to put any single one of them at the heart of the Internet History challenge from the start. (And yes, we keep circling back to Who Pays For It, don’t we?)
So there you have it. I’d like to invite the community to step up and tell me about your ideas for decentralized/federated alternative approaches to curation and high availability of large scientific datasets. I’d really like to find a distributed, self-hosted solution that doesn’t make any single person or institution the Single Point of Failure for preserving the critical datasets that make up the Internet’s history.