The original article published in Japanese ( https://current.ndl.go.jp/e2154 )
Current Awareness-E No.372
11 July, 2019
The National Diet Library Releases Next Digital Library
The National Diet Library (NDL) has been engaged in research for developing next-generation library system. As one outcome, NDL released a web service called “Next Digital Library” in the NDL Lab website on March 29, 2019.
The purpose of Next Digital Library is to test technological effectiveness of new functions expected to be implemented in library system of the next generation: full-text search, automatic image processing applying machine learning (AI technology) and IIIF API (see E1989). By experimentally releasing leading-edge functions in this web service and obtaining feedbacks from users and engineers, we believe that it will be easier to foresee what to expect when formally introducing the service. Registered in the service are part of materials of the Nippon Decimal Classification (NDC) Class 6 (Industry) released online in the National Diet Library Digital Collections after completing copyright protection periods. As of June 20, the service offers about 21,000 OCR processed materials. NDC Class 6 was chosen as the first set of resources to be released because its diverse materials included illustrations and was considered suitable for demonstrating machine learning technology. The service plans to gradually expand the scope of recorded materials. The following sections describe current functions of Next Digital Library with distinction of search and provision functions, and discuss future prospects.
As search functions, the service offers full-text search and illustration search.
Full-text search directly uses text data produced by OCR processing. Although there are still room for improvement in search accuracy, we were able to provide full-text search function without the cost of human resources. A detailed result page provides a snippet view of 100 characters before and after the search keyword, as well as links to image frames of pages with searched keyword.
Illustration search aims to provide an information-seeking method different from text search. By choosing an illustration, users are able to search other materials with similar illustrations. In order to realize this function, the service applies a method of machine learning called semantic segmentation, where the service automatically recognizes and colors layouts of materials such as “paragraphs” or “illustrations,” just like coloring books. Based on the results of automatic coloring, the service clips parts colored as “illustration,” extracts characteristics of the image, and displays other “illustration” with similar characteristics as search results.
As provision function, Next Digital Library offers functions to whiten backgrounds and to automatic process vertically oriented view.
Background whitening function helps users to easily read digitized illustrations of materials with deteriorated readability due to aging. This function uses pix2pix, a method of machine learning that learns conversion methods between images. To enhance readability, discolored images are automatically processed to whiten only the backgrounds.
An automatic processing function for vertically oriented view eases users to view horizontal two-page spread images in vertical screens of smartphones. Most images provided in the NDL Digital Collections are offered in horizontal two-page spread. This function also uses machine learning, combining a method to detect “center line” located in the middle the spread and a method to recognize the material’s outline. By automatically trimming parts outside the outline and clipping only the image of the material provided in the NDL Digital Collections, and by dividing the page at the center line, the service is able to provide one-page view suitable for vertical screens
For displaying images, the service uses IIIF API, a standard adopted by the NDL Digital Collections since 2018 for international interoperability of images. As viewer, the service uses customized Leaflet, a lightweight open source image viewer.
We intend to continue our research to develop and release potentially effective functions. We will also continue our effort to improve functions already released for higher performance by incorporating latest trends in technology and by expanding dataset for machine learning.
Our next mission is to release machine learning dataset used to develop Next Digital Library as well as source codes used for experiments, and to encourage their wide use outside NDL. We hope to capture engineers’ attention with current and future functions of Next Digital Library, and inspire them to create new services from the NDL dataset.
We believe that by incorporating insights from machine learning technology achieving remarkable progress, we will be able to shed new light on attractive features of data resources that institutions offering digital archives have organized and provided. We intend to continue our effort with the spirit of “practice what you preach.”
Written by Aoike Toru
Digital Information Department
Digital Information Services Division
Research and Development for Next-Generation Systems Office
Translated by Okada Aya
Ref:
https://lab.ndl.go.jp/
https://lab.ndl.go.jp/dl/
https://www.ndl.go.jp/jp/news/fy2019/__icsFiles/afieldfile/2019/04/04/pr190405_02.pdf
http://dl.ndl.go.jp/
http://iiif.io/
http://dl.ndl.go.jp/ja/help_iiif.html
https://conf2018.jadh.org/files/Proceedings_JADH2018.pdf
http://id.nii.ac.jp/1001/00192359/
https://arxiv.org/pdf/1802.02611.pdf
https://arxiv.org/pdf/1611.07004v1.pdf
https://leafletjs.com/
E1989
E2117