Data Juice Story #3: National Library’s Search Index Goes on a Data Diet

The following is an excerpt from the book, "Data Juice: 101 Real-World Stories of How Organizations Are Squeezing Value From Available Data Assets."

The National Diet Library (NDL) in Tokyo serves as a repository of knowledge, language, and culture, collecting and preserving both traditional books and paper materials, as well as digital information from Japan and other countries worldwide. To achieve its objective of making this vast amount of information accessible to everyone, anywhere and anytime, the NDL embarked on the development of a digital archive. The development of the NDL Search began with the use of open-source software called Apache Hadoop, which was used to speed up full-text search indexing and automatic bibliographic grouping. The system used more than 30 Hadoop nodes and processed data volume at around 5TB, equivalent to tens of millions of items.

By leveraging online tools and technology, the NDL was able to capture and manage a huge volume of data, create a search index from all its documents, and reduce the time for manual searching of materials and information. The benefits of this digital archive are numerous, not only enabling library users and readers to access information more efficiently but also creating new knowledge by reusing existing information and building a usable knowledge infrastructure.

Creating a massive online volume of data is laudable, but the greater value lies in the ability to distill large sets of data down to meaningful essence. Any organization with large sets of data should expand access using alternative data structures such as graph to make the exploration of data more fluid. Large sets of data also lend themselves to the use of machine learning (e.g., clustering, classification) to guide users to the most significant bits of information more readily. Furthermore, organizations should capture usage metadata so that over time, they can highlight the most popular paths through the data by other users. Finally, governance processes and standards will be important to maintaining and increasing the value of the data and insights over time.

The NDL's digital archive serves as an excellent example of the power of data and analytics in managing large volumes of information. By leveraging open-source software and online tools, the NDL was able to create a search index that reduced manual searching and increased the efficiency of access to information. The benefits of this digital archive go beyond efficiency, as it enables the creation of new knowledge by reusing existing information and building a usable knowledge infrastructure.

For organizations with large sets of data, creating a massive online volume of data is just the first step. They must also leverage alternative data structures, machine learning, and capture usage metadata to extract meaningful insights from the data. Finally, governance processes and standards are crucial to maintaining and increasing the value of the data and insights over time. By following these steps, organizations can create a robust digital archive that not only streamlines access to information but also creates new knowledge and value.