You are here

Home»English Article

E2744e – Release of Full Text Data for All Time Periods in the Database System for the Minutes of the Imperial Diet

The original article published in Japanese ( https://current.ndl.go.jp/e2744 )

Current Awareness-E No.490

October 31, 2024


Release of Full Text Data for All Time Periods in the Database System for the Minutes of the Imperial Diet

 

On August 23, 2024, the National Diet Library (NDL) released full text data of stenographic transcripts of the Imperial Diet for the prewar and wartime periods (November 1890 (Meiji 23) to August 1945 (Showa 20)) in the “Database System for the Minutes of the Imperial Diet.” These text data in Japanese language were created by OCR processing of approximately 270,000 pages of images that were available in the database that allows users to view images of stenographic transcripts of the Imperial Diet. Together with already released data from the postwar period (September 1945 (Showa 20) to March 1947 (Showa 22)), it is now possible to perform full-text searches and display the text of questions and answers, bills, and other recorded transcripts for the entire period of the Imperial Diet. This article introduces the efforts involved in the OCR text conversion.

●How the Full Text Data Was Created

Since the launch of its service in July 2005, Database System for the Minutes of the Imperial Diet had released part of full text data for the postwar period. Released in March 2006 was all full text data for the postwar period. This full text data for the postwar period was created by manual transcribing and proofreading.

On the contrary, the prewar and wartime text data newly released this time was created through OCR processing using AI technology, which has rapidly advanced in recent years, with almost no proofreading. This is a diversion of AI-OCR used in the project “2021 Text Conversion of Mass Digitalized Materials Using Commercial AI-OCR” which was conducted for approximately 2.47 million digitized items in the NDL Digital Collection. Although it had high recognition ability for printed Japanese text from the Meiji period onwards, the AI-OCR had not been trained specifically to recognize stenographic transcripts of the Imperial Diet. Therefore, we evaluated the accuracy of this AI-OCR prior to feeding the full text data into Database System for the Minutes of the Imperial Diet.

●Evaluation of AI-OCR Recognition Performance

In order to evaluate the recognition performance of AI-OCR for stenographic transcripts of the Imperial Diet, 100 pages were randomly selected from the 270,000 pages of prewar and wartime stenographic transcripts and were compared against correct text data created and proofread by hand. The accuracy of AI-OCR in terms of F-value (indicator of recognition performance) calculated for each image by characters was 0.983 (median). Although this is only a reference value because the quantity of correct text data is not sufficient to use as an overall evaluation value, the level of accuracy is such that recognition error was observed in about 2 out of 100 characters.

● Reading Order of Text Data

The layout of most stenographic transcripts of the Imperial Diet is in multiple columns. Because the OCR processing described above did not recognize these columns, the resulting text data was not a correct sentence. We therefore developed a program that can recognize the layout of lines and columns from the image of the document and the OCR processing results. Reading order of the text data was then rearranged according to recognized layout information.

● Segmenting and Matching Speech Units

Database System for the Minutes of the Imperial Diet has functions unique to stenographic transcripts. With information on speakers attached (name of the speaker and its pronunciation, affiliated faction, title and others), it is possible to perform full-text searches limited to a certain speaker or to download text data in units of speech. For this reason, it was necessary to develop a system to segment text data by speech units and match them with speaker information. Therefore, we also developed segmentation and matching program for speech units.

As a result of these efforts, we were able to prepare text data that is sufficiently useful for full text searches. We decided to release the data after improving the system to allow for continuous maintenance of data after the release.

● Features of Released Full-text Data

For easier reading of text data from the postwar period, katakana was converted to hiragana, and old-style characters were converted to JIS Level 1 and 2 kanji. However, because the prewar and wartime text data released this time has been converted directly from image data, katakana remains in katakana and data contains kanji outside the JIS Level 1 and Level 2 ranges.

However, in Database System for the Minutes of the Imperial Diet, one can search without distinguishing between katakana and hiragana or between different forms of the variant characters (by not putting a checkmark in “Exact Search” box in the advanced search screen). Unfortunately, in the process of text conversion this time, old-style characters of “教” and “清” and other variant characters outside the JIS Level 1 and Level 2 ranges could not be read, and are often replaced with “〓” (geta), which means unrecognizable. Therefore, by using “〓” as a substitute for old-style characters when searching, the information a user is looking for may be found (for example, use “〓育” for searching “教育”).

● Conclusion

Two months have passed since the release of prewar and wartime full text data. Responses in posts on X (former Twitter) are positive. We were able to release the data, even in incomplete form, thanks to convenience of AI and advancement of understanding on its limitations within society. For example, in some cases AI-OCR automatically infers Japanese language and generates words. As a result, modern words that did not exist during the Meiji and Taisho periods are mixed. Such cases are observed in the newly released prewar and wartime text data.

We have set up a dedicated form to receive corrections for any errors found in the full text data. Eliminating all errors, including typographical errors, omissions, and “〓”, is a very time-consuming and laborious task, like counting sand. We will continue to improve the data and gradually move closer to the ideal form.

Written by
Parliamentary Documents and Official Publications Division,
Research and Legislative Reference Bureau
and
Research and Development for Next-Generation Systems Office,
Digital Information Planning Division,
Digital Information Department

Translated by Okada Aya

*References are not translated and remain in Japanese language.

Ref:
“帝国議会会議録検索システムで全期間の本文テキストデータが利用できるようになりました(付・プレスリリース)”. NDL. 2024-08-23.
https://www.ndl.go.jp/jp/news/fy2024/240823_01.html
帝国議会会議録検索システム.
https://teikokugikai-i.ndl.go.jp/
“1 令和3年度デジタル化資料のOCRテキスト化”. NDL Lab.
https://lab.ndl.go.jp/data_set/ocr/r3_text/