News Article | December 23, 2015
Many cultural institutions have accelerated the development of their digital collections and data sets by allowing citizen volunteers to help with the millions of crucial tasks that archivists, scientists, librarians and curators face. One of the ways institutions are addressing these challenges is through crowdsourcing. In this post, I’ll look at a few sample crowdsourcing projects from libraries and archives in the U.S. and around the world. This is strictly a general overview. For more detailed information, follow the linked examples or search online for crowdsourcing platforms, tools or infrastructures. In general, volunteers help with: The Library of Congress utilizes public input for its Flickr project. Visitors analyze and comment on the images in the Library’s general Flickr collection of over 20,000 images and the Library’s Flickr “Civil War Faces” collection. “We make catalog corrections and enhancements based on comments that users contribute,” said Phil Michel, digital conversion coordinator at the Library. In another type of image analysis, Cancer Research UK’s Cellslider project invites volunteers to analyze and categorize cancer cell cores. Volunteers are not required to have a background in biology or medicine for the simple tasks. They are shown what visual elements to look for and instructed on how to categorize into the webpage what they see. Cancer Research UK states on its Web site that, as of the original publication of this story, 2,571,751 images have been analyzed. Both of the examples above use descriptive metadata or tagging, which helps make the images more findable by means of the specific keywords associated with — and mapped to — the images. The British National Archives runs a project, titled “Operation War Diary,” in which volunteers help tag and categorize diaries of World War I British soldiers. The tags are fixed in a controlled vocabulary list, a menu from which volunteers can select keywords, which helps avoid the typographical variations and errors that may occur when a crowd of individuals freely type their text in. The New York Public Library’s “Community Oral History Project” makes oral history videos searchable by means of topic markers tagged into the slider bar by volunteers; the tags map to time codes in the video. So, for example, instead of sitting through a one-hour interview to find a specific topic, you can click on the tag — as you would select from a menu — and jump to that tagged topic in the video. The National Archives and Records Administration offers a range of crowdsourcing projects on its Citizen Archivist Dashboard. Volunteers can tag records and subtitle videos to be used for closed captions; they can even translate and subtitle non-English videos into English subtitles. One NARA project enables volunteers to transcribe handwritten old ship’s logs that, among other things, contain weather information for each daily entry. Such historic weather data is an invaluable addition to the growing body of data in climate-change research. Transcription is one of the most in-demand crowdsourcing tasks. In the Smithsonian’s Transcription Center, volunteers can select transcription projects from at least 10 of the Smithsonian’s 19 museums and archives. The source material consists of handwritten field notes, diaries, botanical specimen sheets, sketches with handwritten notations and more. Transcribers read the handwriting and type into the Web page what they think the handwriting says. The Smithsonian staff then runs the data through a quality control process before they finally accept it. In all, the process comprises three steps: Notable transcription projects from other institutions are the British Library’s Card Catalogue project, Europeana’s World War I documents, the Massachusetts Historical Society’s “The Diaries of John Quincy Adams,” The University of Delaware’s, “Colored Conventions,” The University of Iowa’s “DIY History,” and the Australian Museum’s Atlas of Living Australia. Optical Character Recognition is the process of taking text that has been scanned into solid images — sort of a photograph of text — and machine-transforming that text image into text characters and words that can be searched. The process often generates incomplete or mangled text. OCR is often a “best guess” by the software and hardware. Institutions ask for help comparing the source text image with its OCR text-character results and hand-correcting the mistakes. Newspapers comprise much of the source material. The Library of Virginia, The Cambridge Public Library, and the California Digital Newspaper collection are a sampling of OCR-correction sites. Examples outside of the U.S. include the National Library of Australia and the National Library of Finland. The New York Public Library was featured in the news a few years ago for the overwhelming number of people who volunteered to help with its “What’s on the Menu” crowdsourcing transcription project, where the NYPL asked volunteers to review a collection of scanned historic menus and type the menu contents into a browser form. NYPL Labs has gotten even more creative with map-oriented projects. With “Building Inspector” (whose peppy motto is, “Kill time. Make history.”), it reaches out to citizen cartographers to review scans of very old insurance maps and identify each building — lot by lot, block by block — by its construction material, its address and its spatial footprint; in an OCR-like twist, volunteers are also asked to note the name of the then-existent business that is hand written on the old city map (e.g. MacNeil’s Blacksmith, The Derby Emporium). Given the population density of New York, and the propensity of most of its citizens to walk almost everywhere, there’s a potential for millions of eyes to look for this information in their daily environment, and go home and record it in the NYPL databases. Volunteers can also user the NYPL Map Warper to rectify the alignment differences between contemporary maps and digitized historic maps. The British Library has a similar map-rectification crowdsourcing project called Georeferencer. Volunteers are asked to rectify maps scanned from 17th-, 18th- and 19th-century European books. In the course of the project, maps get geospatially enabled and become accessible and searchable through Old Maps Online. Citizen Science projects range from the cellular level to the astronomical level. The Audubon Society’s Christmas Bird Count asks volunteers to go outside and report on what birds they see. The data goes toward tracking the migratory patterns of bird species. Geo-Wiki is an international platform that crowdsources monitoring of the earth’s environment. Volunteers give feedback about spatial information overlaid on satellite imagery or they can contribute new data. Gamification makes a game out of potentially tedious tasks. Malariaspot, from the Universidad Politécnica de Madrid, makes a game of identifying the parasites that lead to malaria. Their Web site states, “The analysis of all the games played will allow us to learn (a) how fast and accurate is the parasite counting of non-expert microscopy players, (b) how to combine the analysis of different players to obtain accurate results as good as the ones provided by expert microscopists.” Carnegie Melon and Stanford collaboratively developed, EteRNA, a game where users play with puzzles to design RNA sequences that fold up into a target shapes and contribute to a large-scale library of synthetic RNA designs. MIT’s “Eyewire” uses gamification to get players to help map the brain. MIT’s “NanoDoc” enables game players to design new nanoparticle strategies towards the treatment of cancer. The University of Washington’s Center for Game Science offers “Nanocrafter,” a synthetic biology game, which enables players to use pieces of DNA to create new inventions. “Purposeful Gaming,” from the Biodiversity Heritage Library, is a gamified method of cleaning up sloppy OCR. Harvard uses the data from its “Test My Brain” game to test scientific theories about the way the brain works. Crowdsourcing enables institutions to tap vast resources of volunteer labor, to gather and process information faster than ever, despite the daunting volume of raw data and limitations of in-house resources. Sometimes the volunteers’ work goes directly into a relational database that maps to target digital objects and sometimes the work resides somewhere until a human can review it and accept or reject it. The process requires institutions to trust “outsiders” — average people, citizen archivists, historians, hobbyists. If a project is well-structured and the user instructions are clear and simple, there is little reason for institutions to not ask the general public for help. It’s a collaborative partnership that benefits everyone. This blog was originally published on The Signal. Read the original post.
McKinney P.,National Library of New Zealand Te Puna Matauranga O Aotearoa |
Knight S.,National Library of New Zealand Te Puna Matauranga O Aotearoa |
Gattuso J.,National Library of New Zealand Te Puna Matauranga O Aotearoa |
Pearson D.,National Library of Australia |
And 6 more authors.
New Review of Information Networking | Year: 2014
In this article we introduce the work of the National and State Libraries Australasia Digital Preservation Technical Registry project. Any technical registry model must allow digital preservation analysts to understand the technical form of the content they are tasked with preserving, understand the capabilities they have in relation to that content, and reflect on the community position in relation to those capabilities. We believe the solution outlined here is well placed to deliver the information required to answer these questions, and in a manner that makes it easy to understand, reference and augment. The primary focus of this article is to describe the format model, which is the most radical part of the Digital Preservation Technical Registry. The flexibility the model provides delivers on all of the requirements outlined by the NSLA partners and project team members; this includes the ability to reference many layers constituting a format, including relationships between specifications and implementations of real-world formats. We seek input from members of the community on the model and suggestions for use cases and requirements that we have not envisaged. © 2014 Crown Copyright.
del Pozo N.,National Library of Australia |
Long A.S.,National Library of Australia |
Pearson D.,National Library of Australia
Library Hi Tech | Year: 2010
Purpose: The aim of this paper is to assist both the National Library of Australia and other institutions to think about digital objects in ways that will help to identify which preservation actions are most appropriate for a particular circumstance. It seeks to examine the basic nature of digital objects and how users interact with those objects. Design/methodology/approach: This article brings together and clarifies a number of key digital preservation theories. It proposes the concept of preservation intent: a clear articulation of a commitment to preserve an object, the specific elements of the object the should be preserved, and a clear time line for the duration of preservation. It investigates these concepts through simple and practical examples. Findings: The paper presents what the authors believe are some of the essential ideas and thinking about digital preservation. Practical implications: The paper will prove useful in clarifying some of the terminology and concepts to those who are in or are yet to be initiated into the "order". Originality/value: The paper brings together and clarifies some of the core ideas and theories in digital preservation, in order to better facilitate the minimisation of change in the digital objects stored by the National Library of Australia. © Emerald Group Publishing Limited.
News Article | July 21, 2011
The Library of Congress says it was not responsible for categorizing a WikiLeaks-related book as "extremist" and that it has decided to removed that label. A spokesman for the library told CNET today that it adopted that classification in its catalog automatically after another major library system--apparently the National Library of Australia--had applied it to a recent book about the document-leaking Web site. Librarians call this practice "copy cataloging." "Copy-cataloging was the method used for the book in question," Library of Congress spokesman John Sayers said. "With the huge quantity of material it catalogs each year--more than 365,000 books in fiscal 2010--the Library of Congress cannot review each record in advance of adding it to the catalog." About 18 percent of the Library of Congress' listings are copy-cataloged from other libraries, he said. Sayers said only one book, "Inside WikiLeaks," by estranged WikiLeaks spokesman Daniel Domscheit-Berg, had been incorrectly listed under the keywords "extremist Web sites." The same query now returns no books about WikiLeaks. The lone result is a 2008 book by an Italian professor titled "Hate on the Net: Extremist Sites, Neo-fascism On-line, Electronic Jihad." The controversy erupted in the last week after sharp-eyed Twitter users noticed that the National Library of Australia applied the "extremist Web sites" label to Domscheit-Berg's book, Julian Assange's as-yet unpublished autobiography, and another WikiLeaks-related book. After sufficient public pressure, the Australian library abandoned that characterization, as did the Library of Congress. Here's more from what Sayers, the spokesman, told CNET: To ensure that the Library of Congress Online Catalog is objective and nonjudgmental, all records in it are completed by staff at the Library of Congress or other libraries, not by publishers. Like all libraries, the Library of Congress benefits from sharing catalog records that are prepared in other libraries throughout the world... Both the Library of Congress and other libraries assign subject access points ("subject headings") from the Library of Congress Subject Headings, a database of more than 400,000 standardized headings that are based on "literary warrant," or terms actually occurring in materials received for library collections. We have mechanisms in place for post-load review, including daily reports from other libraries and library consortia, and we devote several professional staff to correcting catalog records. In this case, a conversation thread on Twitter alerted the Library that the record for "Inside Wikileaks" included an access point that the Library of Congress would not have used. Since the Twitter conversation brought the question to the attention of the Library's cataloging quality assurance staff, they corrected the catalog record immediately. The irony is that the Library of Congress' congressional overseers might not be so sensitive about the language used to describe a Web site that leaked hundreds of thousands of secret government documents. Rep. Peter King (R-N.Y.), the head of the House Intelligence Committee, asked the Justice Department to charge WikiLeaks editor Julian Assange under the Espionage Act, as did Senate Intelligence Committee chiefs Dianne Feinstein (D-Calif.) and Kit Bond (R-Mo.). Senate Homeland Security Chairman Joseph Lieberman (I-Conn.) publicly wondered why an indictment and extradition "hasn't happened yet." King went beyond calling WikiLeaks an "extremist" organization. Instead, he wrote a letter to Secretary of State Hillary Clinton saying that "WikiLeaks appears to meet the legal criteria" of a U.S.-designated terrorist organization, King wrote. He added: "WikiLeaks presents a clear and present danger to the national security of the United States." The Pentagon's criminal investigation of WikiLeaks--especially Assange, its frontman and spokesman--began last summer after the Web site published thousands of military dispatches from Afghanistan. The military probe continued with the distribution of confidential Iraq and State Department, and a federal grand jury is meeting in Alexandria, Va. The Library of Congress blocked access to WikiLeaks from its computers in December 2010, saying "applicable law obligates federal agencies to protect classified information," even though other federal agencies did not.
News Article | November 10, 2000
AUSTRALIA (ZDNet Australia) - News of the attack was announced on the Web site www.attrition.org, which lists Web server defacement statistics. "Fortunately they haven't done any damage, the Web site is still fully accessible through the main address," National Library of Australia director of Web services, Judith Pearce said. However, Pearce told ZDNet that the library needs to plug the hole as soon as possible as "any server that can be hacked into puts us at risk." The National Library of Australia's main Web site was left undamaged, however one particular server was defaced, which is a gateway to a Web-based application. The database application server was replaced by "H4x0r3d by: thepr0digy". Company sources were unaware of the meaning of this piece of text. Pearce told ZDNet that since the National Library in France was defaced earlier this year, the Australian organisation had a close look at its own service to see if the site was at risk of being hacked. "We were relieved to find the hacker was targeting NT servers," Pearce said. However, although the main part of the Web site is built on a Unix platform, the library overlooked the server providing access to the site through the Internet, which sits on the NT server. "The server was susceptible to attack as it sits on the Windows NT platform. The main Web site runs on a Unix system, which is harder to hack in to," Pearce said. According to Pearce, the Windows NT server is "certainly a downfall of the platform." "We're aware of the security weakness of the NT platform. We have to watch for patches making sure they are being applied," Pearce said. Either the hacker couldn't get into the Unix platform that the main Web site sits on or, "[the hacker] wasn't interested in doing significant damage," Pearce said.