CA1930e - A Snapshot Model for Web Archiving: Stanford East Asia Library’s Japanese Web Archive / Regan Murphy Kao

The article translated into Japanese ( http://current.ndl.go.jp/ca1930 )

 

Current Awareness
No.336 June 20, 2018


CA1930

A Snapshot Model for Web Archiving:
Stanford East Asia Library’s Japanese Web Archive


ゆく河の流れは絶えずして、しかも、もとの水にあらず。淀みに浮かぶうたかたは、かつ消えかつ結びて、久しくとゞまりたる例なし。世中にある、人と栖と、またかくのごとし(1)

While the river’s current is ceaseless, the passing water constantly changes. The bubbles that form on its surface, form and burst, never lasting long. People and their dwellings are the same(2).

”When Kamo no Chōmei (1155?-1216) in the Hōjōki described the products of human effort as being as ephemeral as forming and bursting river foam, he hit on a theme that will resonate with most scholars. Historians attempt to paint a picture of past time based on the residue of that constantly moving flow. Librarians attempt to preserve critical writings and artifacts that will give a multivalent glimpse of that moment in time. If the process of selection and preservation from an ever-moving target was never easy, with the emergence of the web, the river has become a torrent. The necessity to preserve – and to preserve a broad spectrum of conversations and perspectives – is all the more urgent.

Such was the genesis of the Japanese web archive program at Stanford University’s East Asia Library. We hoped to develop a model for web archiving that would ensure a glimpse of a wide range of topics that were perceived as important at the moment of archiving. It would aim to complement (and try not to duplicate) the web archiving work being performed by the National Diet Library and others(3). It seemed important to capture these critical conversations and to preserve some of the voices that would otherwise be lost to history. It was an experimental project. Here, I will describe the process we followed and some of the things we learned during these two years(4).

Attending to current trends in contemporary Japanese studies and to reoccurring topics in a variety news media outlets, our web archiving team(5) first identified key topics in current society, such as child care, employment, energy and robotics. We built long lists of websites on each topic and then whittled down the list to ensure that we had a variety of actively-updated sites from across the archipelago. We then sent notification letters to site owners informing them of our desire to archive their website and giving them the option to opt out(6). We responded to innumerable inquiries about the project and in the process honed the explanations that we posted on the project website(7). Then we determined whether the sites should be crawled monthly, quarterly or just once. We ran test crawls in Archive-It, the web archiving service from the non-profit Internet Archive, and performed reviews for quality assurance so that the real crawl size would be manageable. The process of building the collection continued throughout the two years. As new key topics arose, we added more sites. We tried to collect at least ten sites on each topic to ensure a variety of voices were preserved. Below are the topics and subtopics; the number of sites archived is in parentheses(8).

  • Politics & economics (54): political parties, financial institutions, political leaders, TPP, security
  • Employment & labor issues (46): labor unions, industrial cooperatives, irregular employment for youth, at-home work
  • Social issues (156): aging society, Okinawa, support for women, separate surnames for married couples, food destruction, baby drop box, social withdrawal, international help, LGBTQ, influential social figures, revitalization of rural areas, preservation of old towns, 2020 Tokyo Olympics, blog rankings (politics)
  • Child care & pregnancy (29): wait lists for child care, authorized nursery schools, egg freezing services
  • Academic & humanities (41): higher education, religious organizations, university departments (law, economics, sociology)
  • Environment & nature (75): resources, natural energy, anti-nuclear power generation, global warming, nuclear energy-related institutes
  • Science & technology (41): robots, advanced materials, research institutes
  • Internet & computers (30): net harassment, cyber security

To avoid confusion between archived sites and live sites, we enforced a six-month embargo for publishing the archived sites. The next step was to create metadata for each site so that once they were ingested into our digital repository, they would be discoverable from anywhere in the world through our search portal, SearchWorks(9). In late fall of 2018, we also plan to create an online exhibit focusing on the Japanese web archiving project that users will be able to explore.

We learned a lot in the process of working on this web archive. We were touched, truly moved, by the outpouring of gratitude when it was announced that we had archived the blog of Mao Kobayashi(10). We were humbled by the emails from site owners who thanked us and said that they would work even harder on their project. It was like no other collection development project that I have worked on.

The real-time response to our work was at once gratifying and concerning. As librarians, we were attempting to capture artifacts of this historical moment that would give an accurate picture for posterity. However, because of the necessity of notifying website owners, we risked affecting these artifacts, distorting them slightly as site owners became aware of role in representing this moment. Further, we came to realize that site owners perceived their selection into our archive as an endorsement of their work. We aimed only to give a broad glimpse of Japanese society, focusing on sites that were not preserved in other web archives. We further honed our notification letter to make this aspect of the project clear, but our awareness of this issue led to a degree of self-censorship. Given our position as a prominent, academic library based in the United States, we in no way wanted to be perceived as supporting certain more politically-heated conversations which were taking place on the web. In these cases, we opted to take advantage of the Internet Archive’s approach to archive sites without informing site owners through the use of the Wayback Machine(11). While those sites will not show up in our archive, our archive does indirectly capture some of these heated conversations so that reference to major issues would be available to future scholars. The web archive, Snapshot of Japan, 2016-2018 was an experiment in a “snapshot” model for web archiving(12).

We succeeded in archiving many important topics that would not have otherwise been archived. We succeeded in demonstrating that there are a lot of voices – important human voices – that will be left out of the historical picture without a concerted effort to preserve them. When I think about this project, the words of Kamo no Chōmei always come to mind. Kamo no Chōmei compared people and their dwellings to the forming and bursting bubbles that appear briefly on a river’s surface. If he envisioned such structures as ephemeral, imagine his surprise at the speed with which one of the most prolific products of current times – websites – appear and disappear.

As the torrent of time continues, the librarian’s role has become all the more important. The mission: to select, preserve, and make accessible the voices that will otherwise be forgotten. Let’s ensure that history can be told not only from one perspective, but from many.

Regan Murphy Kao, East Asia Library, Stanford University Libraries

 

(1)Kamo no Chōmei, Yoshida Kenko. Hōjōki. Tsurezuregusa. Edited by Nishio Minoru, Nihon koten bungaku taikei, vol. 30. (Tokyo: Iwanami Shoten, 1957), p23.

(2)My translation of the opening lines of the Hōjōki.

(3)When we first began the project, we solicited from local scholars suggestions for topics to archive. These suggestions helped create the first batch of websites that were added to the collection. However, many of these topics were already being archived by the National Diet Library. Therefore, as we moved forward, we honed our collection to focus on topics that did not duplicate other collections. It was not always possible, given that archived sites are often embargoed for a period of time.

(4)I wish to express my gratitude to each of the website owners who agreed to allow us to preserve their sites and also to the Freeman Spogli Institute for International Studies (FSI) and Stanford Libraries for providing funds for the project.

(5)Our core team of two was supported strongly by several different groups within the Stanford Libraries, in particular our partners in the Digital Library Systems and Services (DLSS), EAL and central technical services.

(6)We translated into Japanese the standard notification letter used by Stanford.

(7)For more information about the project, see
https://library.stanford.edu/eal/japanese-collections/japanese-collection-news/snapshot-japan-2016-2018, (accessed 2018-04-01).

(8)The number of sites is accurate as of May 1, 2018.

(9)Some examples of archived sites in our digital repository:
http://purl.stanford.edu/hm701kx7587, (accessed 2018-04-01).
http://purl.stanford.edu/cv228rc9730, (accessed 2018-04-01).

(10)See Stanford Library blog from August 1, 2017
https://library.stanford.edu/blogs/stanford-libraries-blog/2017/08/preserving-ephemeral-reflections-archiving-japanese-websites, (accessed 2018-04-01).

(11)The Wayback Machine, developed by the Internet Archive, allows any web user to enter a website in the “Save Page Now” box. Please see
http://archive.org/web/, (accessed 2018-04-01).
Our team used this feature from time to time when a website seemed historically important but not a good fit for our project.

(12)The prevalent model for web archiving projects conducted by subject experts is to select a single theme, event or content type (such as blogs, or news media).
See, for example, Webster (2017), Niu (2012), Brügger (2010, 2017). The “snapshot” model attempts to preserve a wide variety of topics perceived by the web archiving team as historically important and critical for giving future scholars a glimpse of contemporary society.

 

Ref:
Brügger, Niels. Web 25: Histories from the first 25 Years of the World Wide Web. New York: Peter Lang, 2017.

Brügger, Niels. Web History. New York: Peter Lang, 2010.

Lepore, Jill, “The Cobweb: Can the Internet be archived?” New Yorker, January 1, 2015.
https://www.newyorker.com/magazine/2015/01/26/cobweb, (accessed 2018-04-01).

Murphy Kao, Regan. “Preserving the ephemeral: Reflections on archiving Japanese websites.” Stanford Libraries Blog, August 1, 2017.
https://library.stanford.edu/blogs/stanford-libraries-blog/2017/08/preserving-ephemeral-reflections-archiving-japanese-websites, (access 2018-04-01).

Niu, Jinfang. “An Overview of Web Archiving” D-Lib Magazine, vol. 18 no 3/4, March/April 2012.
https://doi.org/10.1045/march2012-niu1, (accessed 2018-04-01).

Webster, Peter, “Users, technologies, organisations: Towards a cultural history of world web archiving,” in Web 25: Histories from the first 25 Years of the World Wide Web, ed. Niels Brügger (New York: Peter Lang, 2017), 179-99.