Archiving+the+Web

= Archiving the Web = toc The Internet is one of the most difficult collections to archive in existence. It is problematic to archive the web for a number of reasons, but primarily because the Internet is constantly in a state of flux. Pages created today may not be available next week, but the content within those pages often has value historically. Archiving the web continues to be an ongoing process, that is not always successful. Methods of archiving the web are imperfect. Currently, most methods use various web crawlers to archive the web, something that can be overcome by using a robots.txt file to prevent capture. Lack of understanding regarding the importance of web preservation is leading many to avoid preserving copies of web sites or online articles that have been taken down. This leads to a concern regarding a potential “digital dark age;” a period of time when future historians will have difficulty understanding the current age due to a lack of preserved information. The Internet Archive, and other information centers are working to preserve the Internet using the current technology available. Unfortunately, much of the digital information created is not housed on the web, so it will be important for organizations to preserve these documents either in house, or to send them to an archive for preservation.

The Internet Archive and the Wayback Machine
[|The Internet Archive] is a non-profit organization originally founded in 1996 to build an Internet Library. The project includes text, images, video, software and archivedweb pages from all over the Internet. The page is subdivided into various topics, and allows users to access digital information that may otherwise have been lost. Among the various collections, the archive contains the Wayback Machine. One of several projects that archive web pages, this system allows users to enter a URL and find snapshots of the webpage on various dates; the Internet Archive has also curated several collections of web pages collected by the Wayback Machine to be viewed together based around specific topics or organizations. The Wayback Machine currently uses web crawlers to capture websites, so the Internet Archive encourages web builders and web site owners who want to ensure the preservation of a website to verify that any robots.txt files to permit web crawlers. The Wayback Machine does not allow users to retrieve information in any way except for URL. However, other curated collections are available through the site, and can be searched using keywords.

Archive-It[[image:archive-it.png align="right" caption="Image Captured Using Jing at Archive-It.org" link="@http://www.archive-it.org/"]]
Founded by the Internet Archive, this organization provides other archives and information centers with the information and software needed to create their own Internet archive. The Archive-It program is a subscription service used to allow archives and other information centers to archive born-digital information and websites. The Internet Archive, allowing institutions to begin using the program without having to make large expensive hardware purchase, stores the information. The Archive-It software creates public collections by default. Unlike the Internet Archive, collections created using the Archive-It service allows users to search by keyword. The software is available for use by any number of organizations or individuals. The organizations or individuals can use the software to create collections on specific subjects, to further research, or provide access to information to researchers on individual topics; something not currently available through the Wayback Machine. Archive-It software is available to be used by anyone for the purposes of gathering an archiving the web, making it a useful tool for researchers, in addition to web archiving.

Bibliotheca Alexandrina
The [|Library of Alexandria’s] Internet Archive serves in part as a back-up system for the Internet Archive based out of San Francisco, CA. Like the Internet Archive, the Bibliotheca Alexandrina uses the Wayback Machine to locate snapshots of websites, both current and no longer available. Although the system serves as a mirror to the original Internet Archive’s information, the focus of the new collections of the Bibliotheca Alexandrina are Africa and the Middle East. This collection includes video, photos, web pages and other born-digital content. In addition to Bibliotheca Alexandrina, a number of archives, libraries and information centers use Archive-It or similar software to archives portions of the web and feed that information to the Internet Archive. As it is a mirror of the Internet Archive, it can be used to find the same web page information if the Internet Archive is running slow. It can also serve as a backup if the Internet Archive were ever to crash.