18 October 2022

The Frisian web archived

At a seminar on the new round of internet extensions, I struck up a conversation with Kees Teszelszky, curator at the KB, National Library in The Hague. Not being the most logical person to come across in this place, we got talking. Kees, as “curator of digital collections”, turns out to be more involved than average with domain extensions. More specifically: with .frl, the extension by and for Frisians.

This is because the KB has started a project to map and archive the Dutch web. Unlike “offline media” like books and newspapers, information on websites is ephemeral and there is no obligation to archive. As an increasing part of our lives takes place online, it is important to also preserve these digital sources for future research into our present day. Or, as Kees himself says: “What you and I publish online, that is the digital heritage of the future.”

The Frisian domain as a pilot

However, the entire Dutch web is extremely large. To gain experience and define the best approach, the KB decided to start a pilot with a geographically and culturally almost as clearly demarcated part of the Dutch internet: the Frisian domain. Here, the word “domain” should be read as “area” rather than “domain name”.

Initially, the aim is not to archive everything. The aim is a collection of scientific value and that need not be exhaustive. On the other hand, whatever is going to be archived must be archived in full. So not a few pages of a website that you go crawl at a later time at your convenience (as, for instance, The Internet Archive does), but a complete walk-through of the website repeated at regular intervals.

How do you decide which websites are interesting? That starts with an overview of all Frisian websites. The existence of its own internet extension .frl is golden information here: in fact, the list of all registered .frl domains is simply retrievable. We looked into whether Frisian domains under other domain extensions (e.g. .nl and .com) could also be included in an unambiguous way. However, this proved to be so complex that the pilot was narrowed down to .frl.

The list of websites to be archived was then made on the basis of algorithms. In addition, the public was asked to supply domain names that, in their opinion, are indispensable in an archive of the Frisian web. The starting point is that they should give an insight into language, culture, geography or history.

A typically Dutch difficulty is that the KB is not allowed to archive just any website: to do so, the owner must first be notified and have no objections. The KB therefore advocates more up-to-date legislation around digital archiving, as almost all other countries do.

The results

The .frl project is not yet over. Not all websites have yet been “crawled” and (partly due to corona restrictions in recent years) a collection description and final database have not yet been established. Currently, the counter stands at almost 10,000 archived websites.

When the crawling is complete, the two main data sets available are the archived websites themselves and a large collection of metadata. These two sources allow scientists to select the data relevant to them for their own research. This, however, only within the walls of the KB given the restrictive Dutch legislation.

One example of research already done while building the archive is linguistic in nature: it turns out that there is a huge variation in language use within Friesland itself. Whether this would also have been found in research within traditional media remains to be seen!

Want to know more?

Want to know more about the KB’s web archiving project? On their website, there are three very accessible (Dutch) videos about the past, present and future of web archiving and, of course, the web collections themselves.

Translated with DeepL