Open source code’s thousand-years Arctic digs

The world’s open source code finally arrived at the Arctic Circle last July 8, despite delays caused by the new coronavirus pandemic.

Last February 2, GitHub took a snapshot of all active public repositories on GitHub to archive in the GitHub Arctic Code Vault, which was introduced in 2019 along with the GitHub Archive Program. The mission is to preserve open source software for future generations by storing your code in an archive built to last a thousand years.

Over the last several months, archive partners Piql wrote 21TB of repository data to 186 reels of piqlFilm (digital photosensitive archival film). The original plan was for GitHub’s team to fly to Norway and personally escort the world’s open source code to the Arctic, but as the world continued to endure a global pandemic, they had to adjust our plans. They stayed in close contact with their partners, waiting for the time when it was safe to travel to Svalbard.

The code’s journey began in Piql’s facility in Drammen, Norway where the boxes with 186 film reels were shipped to Oslo Airport and then loaded into the belly of the plane which provides passenger service to Svalbard. Svalbard, roughly 600 miles (1,000 km) north of the European mainland, just recently opened up to visitors from countries within the Schengen Area and the European Economic Area.

From coal to code

The code landed in Longyearbyen, a town of a few thousand people on Svalbard, where the boxes were met by a local logistics company and taken into intermediate secure storage overnight. The next morning, it traveled to the decommissioned coal mine set in the mountain, and then to a chamber deep inside hundreds of meters of permafrost, where the code now resides fulfilling their mission of preserving the world’s open source code for over 1,000 years. 

Millions of developers around the world contributed to the open source software now stored in the Arctic Code Vault. To recognise and celebrate these contributions, GitHub designed the Arctic Code Vault Badge, which is shown in the highlights section of a developer’s profile on their platform.

For this project, one partner is the Internet Archive (IA) a non-profit digital library which provides free public access to collections of digitised materials. In partnership with the GitHub Archive Program, the IA started its ongoing archive of GitHub public repositories last April 13. 

At present, IA is using a two-pronged approach. First, their well-known Wayback Machine is accessing and archiving raw GitHub data as WARCs, or Web ARChive files. As of mid-July, they have archived some 55TB of data.

Second, they have the goal of making entire archived GitHub repositories available via “git clone,” while also keeping repo comments, issues, and other metadata easily accessible on the web. This second initiative is well underway and initial archiving is expected to commence this month.

Another partner is Software Heritage Foundation, a non profit, multi-stakeholder initiative launched by Inria in collaboration with UNESCO with the goal to collect, preserve and share the source code of our software commons. They already archive more than 130 million projects, with their full development history. Of these, 100 million are from GitHub.

Thanks to the collaboration announced at GitHub Universe 2019, the archival engine is being improved with the goal to keep it up to speed with GitHub’s growth, but if the project a developer is interested in, or its latest version, is not archived yet, they can trigger its archival immediately on

Still another partner is Project Silica, which is developing the first storage technology designed and built from the media up for cloud-scale storage of long-lived data. By leveraging recent discoveries in ultrafast laser optics, data is stored in quartz glass, through a process that permanently changes the physical structure of the glass material.

Quartz glass is a durable storage media that offers unparalleled data lifetimes of upwards of tens of thousands of years. It is resilient to electromagnetic interference, water, and heat, making it the ideal storage medium for ensuring the world’s open source software is forever preserved for future generations.

As a partner in the GitHub Archive Program, Project Silica is committed to driving storage innovation, and developing a storage technology that addresses the need for a sustainable and reliable storage technology for the world’s long-lived data. We’ve archived 6,000 of the world’s most popular repositories as a proof of concept for future archives.  

The Tech Tree

Every reel of the archive includes a copy of the “Guide to the GitHub Code Vault” in five languages, written with input from GitHub’s community and available at the Archive Program’s own GitHub repository. In addition, the archive will include a separate human-readable reel which documents the technical history and cultural context of the archive’s contents, which is called the Tech Tree.

Inspired by the Long Now’s Manual for Civilization, the Tech Tree will consist primarily of existing works, selected to provide a detailed understanding of modern computing, open source and its applications, modern software development, popular programming languages, etc. It will also include works which explain the many layers of technical foundations that make software possible — microprocessors, networking, electronics, semiconductors, and even pre-industrial technologies. This will allow the archive’s inheritors to better understand today’s world and its technologies, and may even help them recreate computers to use the archived software.

Encapsulating the world’s cultural context and technical history is a challenging prospect, and the Tech Tree is expected to evolve and iterate over time. Soon to be published to the Archive Program’s GitHub repository is a very initial draft list of works selected for the Tech Tree, along with, importantly, a request for community input.