Long-term archiving of web files

High wire is developing an interesting distributed system for online scientific journals that will preserve content for hundreds of years or more

High wire press from stanford universities libraries has an ambitious goal. Already it is a platform that makes accessible more than 170 scientific periodicals. In mid-february there were 134143 articles available in the archive, which were freely accessible, in total there were 615602 articles. High wire press connects articles, authors, and citations across journals through links, offers a search engine, multimedia displays, additional information, and all archived ies, but also manages the many different access requirements for content that requires a subscription. To strengthen the appeal of online journals, high wire press has made a plan: to ensure the long-term continuity of scholarly articles and to guarantee, at least from the perspective of the 22. Century that all content will be updated and remain accessible.

For publications, the internet is an unprecedented opportunity to publish documents of all kinds relatively cheaply, quickly, and easily, to link them together, to supplement or update them with other documents, and to keep them accessible in an archive. The web is still growing explosively and is becoming a decentralized, unorganized and still largely uncensored world library in which documents of all possible origins can be viewed from anywhere in the world.

What seems to be the final realization of the dream of creating a library where all the world’s knowledge can be found, however, has its downsides. When ptolemy founded the library of alexandria, his goal was to acquire all the books in the world at that time and bring them into the library. This was, as it were, still a manageable task for today’s situation. The books and scrolls that arrived from all over the world were catalogued and systematically classified. But, as we know, the library burned down and nothing was left of the hundreds of thousands of documents, a trauma that again in the age of printing led to new projects such as the encyclopedias to save people from falling into darkness. The printing of books in itself opened up the possibility of better preservation of knowledge, because now many copies were available at the same time on relatively durable material. Even if many or almost all of the books were destroyed, one copy, which was also readily accessible to a scribe, was sufficient to preserve the knowledge stored in it.

Nevertheless, all carrier materials have one thing in common: they are perishable, decay, disintegrate and become illegible. And this process of low durability has, strangely enough, intensified just at the time when more and more mountains of information are being pushed towards people. "Modern" paper has a shelf life of perhaps 80 years, until it decomposes through acidity, photographs, properly stored, may last 100 years, audio, video or data tapes only 20 years. Digitization has only exacerbated the problems, because digital data is not only dependent on the carrier materials on which it is stored, but also on the software and hardware needed to read it. Whereas in the case of printed publications there was, to a certain extent, only one standard wheel, in the case of information technology the innovations chase each other at the same time as the quantity of what is archived because of its easy capture and cheap storage. After only a short time, computers and peripheral devices could become not only obsolete, but, like the drives for 5 1. High "agility" is required to keep the information available. Migration is when files are moved from one carrier to another. This has to happen every few years – and with ever-growing mountains of data, despite increasing capacity, is crippling available resources.

The web is no exception to the alzheimer’s disease of digital media, which may only be just beginning. In itself an exemplary universal library, permanently updated and made accessible by search engines, the internet is nevertheless particularly perishable. The average life span of a web page is just a few weeks, then it has disappeared, been replaced or updated. This may not always or often be a great loss, but information decay is a pressing problem for those institutions that want to make archives available not only in the short term, but for as long as possible. Of course there are already services like www.Alexa.Com? Moreover, it is by no means guaranteed that alexa will actually store all documents, and no one knows how long, in turn, such a service will reliably exist. In addition, the "external storage" copyright problems added.

Libraries have the task of collecting and archiving publications independently of publishers and making them available to the public. Until now, this has mainly been done with physically archivable documents in one place. However, online publications must not only be stored in digital form, but their original functionality, such as cross-references to other pages and content, should also be preserved. The quality of online publications consists, for example, as in the case of high-wire, of an automatic forward and backward linking of the articles, which can be "passive" storage on media such as cds, tapes or microfilms. John sack, director of high wire, mentions the difficulties that have to be solved, especially for online publications, if one wants to keep the information available and accessible via the internet for a long time: "online journals are particularly complicated because they involve temporary access or subscription terms, a continuously growing collection of articles and links, and the ‘traditional’ problems of changing technology for user devices, servers, programs, and storage."

There are trivial problems to be solved, for example, if someone has subscribed to an online journal for a certain period of time, but still wants and needs to have access to the articles included in the payment after the subscription has ended. The grosser difficulties, however, are different, according to sack: "there is a huge problem with the evolving technology. It is necessary to archive the past editions as they were, and they must remain compatible with the editions that follow them, with the server systems and the software, as well as with the software of the users. To ensure continuity despite unpredictable change over short and long periods of time, redundant procedures are essential." sometimes scientific journals are discontinued, the website disappears and the servers are taken offline. Websites may be temporarily inaccessible, files may be lost or damaged. Journals may be bought by another publisher or completely changed. Online journals that are currently free of charge may be subject to a fee. Copyright regulations may change …

The claim, as described by michael keller, publisher of high wire press, is rough: "stanford libraries is truly committed to the continuity of scholarly archiving. At a minimum, we look at the archival program from the perspective of the 22. Century, not only in terms of storing bits and bytes, but also in terms of accessing. Once a publisher has commissioned high wire to provide access to a particular article or document, the scholarly community should be able to trust that the article or document will remain freely accessible as is for at least a decade or a generation." a prerequisite for this is the continued migration of formats, standards, and media, but stanford university libraries has developed a new model for internet archiving with funding from nsf and in collaboration with sun microsystems: "irreversible publishing" or lockss (lots of copies keeps stuff safe). As early as summer 2000, it will be used for a first test and ultimately demonstrate that it is sufficient for individual customers, universities and libraries to subscribe to the online edition of a journal.

To enable the desired continuity and make it largely independent of a single entity, lockss acts as a self-organized, platform-independent, open-source system with no central control for storing online documents on local, interconnected servers. Content is stored on a local pc using a linux operating system through the common squid caching technique, which is said to be relatively inexpensive because it requires only a slow processor, a small amount of memory, but large hard disks. For example, a computer storing the journal of biological chemistry, the most comprehensive of the journals offered by high wire, for five years would require a 100mhz pentium processor, 32 mb of ram, and two 16gb hard disk drives.

Automatically stored and updated are journals subscribed to by a library, for example, with the appropriate access rights. However, users only retrieve the cached files from their computers if the publisher no longer makes them available on its server. Since ideally content is stored on many computers, there should always be multiple copies of files that are damaged, lost or otherwise inaccessible via the internet, in order to restore them and make them accessible. The more libraries use lockss, the more securely files can be archived according to the principle of letterpress printing. The cache of a journal on each of the networked lockss computers is set to check whether there is a minimum number of copies in the overall system by querying the other lockss computers. If the cache detects that there is not enough, a request for additional cache is started. It is intended that 10 percent of the memory should be reserved for this purpose. If there is no more space available anywhere, the journal is considered to be "endangered" and humans must intervene.

Of course, lockss is only supposed to step in with the cache if files on a publisher’s website are no longer accessible, since they naturally want readers to go to their website. It is also still possible to include http commands that prevent caching, if publishers so desire. But of course the lockss computers have to be renewed and the storage space has to be expanded. The system requires that the journals largely retain their urls, have a logical structure and that the files are based on html formats and related formats such as gif or jpec. However, new html formats or completely different formats of the future should also be updated automatically.

And because it is an open source project, where everyone has access to the source code, it is hoped that in the testing phase and later programmers will suggest improvements and bug fixes. For this purpose, a newsgroup will be set up, moderated by david rosenthal of high wire, who will also control the official changes to the software. Later on, if lockss is successful, it is foreseen that a group will take over this task. Of course, the system can also be used for other purposes, such as the digital archives of the dot.Coms, which could then make lockss really profitable.

Like this post? Please share to your friends:
Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!: