The original article published in Japanese ( https://current.ndl.go.jp/e2533 )
Current Awareness-E No.442
1 September, 2022
Release of the NDL Ngram Viewer: A Service to Visualize Full-Text Data
The National Diet Library (NDL) conducts research and studies to help develop library systems of the next generation. Recently, the NDL has worked on development and provision of experimental services that envision new uses of digitized materials.
In the fiscal year 2021, the NDL undertook an optical character recognition (OCR) text conversion project for digitized materials and created OCR text data for 2.47 million digitized materials (223 million images). This accounts for almost all digitized materials in the National Diet Library Digital Collections as of the end of 2020. As a service that utilize these OCR text data, the “NDL Ngram Viewer” was released from the NDL Lab website on May 31, 2022. As of August 2022, the service provides visualization function of search results for approximately 280,000 text data of books whose copyright protection period have expired.
“Ngram” refers to a method of dividing a string of characters into n units to facilitate search. In general, “ngram viewer” refers to services that use full-text data of books to visualize the frequency of specific words or phrases by publication dates. The service originated with the “Google Books Ngram Viewer” released in 2010. Google Books Ngram Viewer was designed as one of the ways to utilize full-text data accumulated through digitization and OCR processing of materials for the “Google Books” service. The service allows users to visualize frequency of use of words or phrases by publication dates for materials in English, French, Chinese, and other languages.
Similar efforts include the “Bookworm” project of HathiTrust (see E1389, CA1760). Another example is the project “Gallicagram.” This is an experimental service that allows users to visualize, with high functionality, frequency of words in full-text data listed in the digital library Gallica of Bibliothèque Nationale de France.These preceding Ngram viewers, however, do not support Japanese search queries as of 2022 with the exception of some collections in the “Bookworm” project. Furthermore, Japanese language uses more types of characters than other languages that mainly use alphabets. There are possibilities of OCR reading errors (for example, “己” and “巳” may be easily confused in OCR text with unclear printing). There may also be inconsistencies in notation that words to be treated as identical exist in several different forms (for example, difference of notations in “関ヶ原”, “関ケ原”, “関が原”). It would be desirable to have a function that allows searches for all these words at once.
Wildcard (*) search function of Google Books Ngram Viewer is an example of such function that enable search of all words at once. It allows users to include any string of characters in the search query. For example, by searching “in * to” in English full-text data, one can look for words that tend to fall between “in” and “to”. On the other hand, in Japanese full-text data, searching for phrases starting with “関” and ending with “原” will include not only “関が原” and “関ケ原”, but also phrases like “関白太政大臣藤原”. It is undesirable to have the keywords under search buried in clearly unnecessary keywords. Therefore, in addition to the wildcard function, the NDL Ngram Viewer enables search with “regular expression” where users can set detailed conditions that are not available in the function of the Google Books Ngram Viewer, such as character types and character lengths.
“Regular expression” is a way to simply notate multiple strings of characters considered to have the same pattern in one way following a rule. In the NDL Ngram Viewer, for example, if the user wish to look up keywords that contain any one character between “関” and “原”, one can search “関.原”. To expand the number of any characters to between one and three, one can search “関. {1,3}原”. If users know the exact characters to look up in advance, one can use “関(ケ|ヶ|が)原” to narrow down the search to three types “関ヶ原・関ケ原・関が原”. Because search results are linked to search query of the “Next Digital Library”, users can refer to full-text search results. This is an experimental search service available on the NDL Lab website, same as the NDL Ngram Viewer.
As of August 2022, the NDL Ngram Viewer covers only 280,000 books whose copyright protection periods have expired. Therefore, users should note that the population and frequency of appearance by years are both smaller for materials published after the 1950s. Also, errors do occur as string characters are limited to only those read by OCR. Users should consider the visualization results as a rough guide and combine the use with full-text search results. In December 2022, the National Diet Library Digital Collections will be renewed to expand the scope of materials available for full-text search. Around the same time, the NDL Ngram Viewer also plans to expand its scope.
Full-text search function using OCR text data enables users advanced information-seeking experience. However, as was the case with bibliographic search, there still remains a concern that users may miss materials that do not come up with search keywords because of variations in notation. We hope that by utilizing the new search function of the NDL Ngram Viewer, users are able to visually identify effective search keywords and reach more necessary materials.
Written by Aoike Toru
Digital Information Department
Digital Information Planning Division
Research and Development for Next-Generation Systems Office
Translated by Okada Aya
*Notes are not translated and remain in Japanese language.
Ref:
国立国会図書館デジタルコレクション.
https://dl.ndl.go.jp/
“1 令和3年度デジタル化資料のOCRテキスト化”. NDL Lab.
https://lab.ndl.go.jp/data_set/ocr/r3_line/
“NDL Ngram Viewerの公開について”. NDL Lab. 2022-05-31.
https://lab.ndl.go.jp/news/2022/2022-05-31/
NDL Lab.
https://lab.ndl.go.jp/
NDL Ngram Viewer.
https://lab.ndl.go.jp/ngramviewer/
“NDL Ngram Viewer”. NDL Lab.
https://lab.ndl.go.jp/service/ngramviewer/
Google Books Ngram Viewer.
https://books.google.com/ngrams/
bookworm: HathiTrust.
https://bookworm.htrc.illinois.edu/develop/
Gallicagram.
https://shiny.ens-paris-saclay.fr/app/gallicagram
次世代デジタルライブラリー.
https://lab.ndl.go.jp/dl/
“Next Digital Library (English)”. NDL Lab.
https://lab.ndl.go.jp/service/tsugidigi/tsugidigi_en/
総務部支部図書館・協力課. 講演会「HathiTrustの挑戦」<報告>. カレントアウェアネス-E. 2013, (230), E1389.
https://current.ndl.go.jp/e1389
田中敏. デジタル化資料の共同リポジトリHathiTrust―図書館による協同の取り組み. カレントアウェアネス. 2011, (310), CA1760, p. 14-19.
https://doi.org/10.11501/3485918