Big news for researchers as DocumentCloud, a consortium and repository of primary source documents, announced a partnership with Thomson Reuters' high-volume text identification and tagging program OpenCalais.
DocumentCloud offers accessibility and easy sharing of original source documents from across the web by allowing users to search thousands of documents for nearly any criteria (date, person, location, etc) and facilitates 'document dives,' the collaborative search of a large volume of files by a group of users. Over twenty high-profile news and investigative journalism organizations (listed below) have joined the project, promising to contribute primary source materials enabling the public to explore connections across news reporters' source documents.
Aron Pilhofer, a DocumentCloud co-founder, discussed with Nieman Journalism Labs some of the collaborative and open-sourced foundations of the project, stressing the integral role of the participating publishers and organizations as testers and sources of feedback. This guarantees that news publishers will be able to take full advantage of the service while co-founder Eric Umansky assures that contributing journalists will enjoy a 'period of exclusivity' on the database to protect their work in exchange for making their sources public after publication. Furthermore, once the documents are made public, organizations will be able to do all this while keeping the documents--and readers--on their own sites. It is like a card catalog for primary source documents kept all over the globe, accessed and shared through the web.
For a project as ambitious as this, there are considerable technical issues regarding the "ability to ingest, to process, database, index and then republish metadata for what could eventually...amount to tens of thousands, hundreds of thousands, millions, possibly, of pages of printed or textual data," hence the natural partnership with Thomson Reuters new service: OpenCalais uses natural language processing (NLP) to "read" a document, instantly identifying and tagging the relevant people, places, companies, facts and events for improved search and navigation. This will make it easy for users to explore connections between newsmakers, corporations and events across documents and across the full collection of source materials.
"By using OpenCalais to tag documents, DocumentCloud will enable journalists, researchers and scholars to find otherwise hidden connections between people, companies and concepts across a body of documents." said Barak Pridor, CEO, ClearForest, the Thomson Reuters company that produces the OpenCalais service. "DocumentCloud will also enable the public to get the 'back story' behind the news and see for themselves the information that reporters use to get at the facts. It's an incredibly powerful tool for democracy and an important step in the ongoing evolution of citizen journalism."
This undoubtedly is a big step forward for cooperation among publishers and news sources in the digital realm, allowing for a greater degree of quality research and fact-checking not only among journalists but also for the general public, thereby increasing the standards of accountability and truthful reporting.
DocumentCloud is expected to launch privately at the end of this year, available for a testing period to the contributing organizations: ACLU National Security Project, Arizona Republic, The Atlantic, Center for Democracy and Technology / OpenCRS, Centre for Investigative Journalism (City University London), Center for Investigative Reporting / California Watch, Center for Public Integrity, Chicago Tribune, Dallas Morning News, Gotham Gazette, The Investigative Reporting Workshop at American University, The National Security Archive, The New York Times, New Yorker, MinnPost, MSNBC, Mother Jones, PBS NewsHour, ProPublica, St. Petersburg Times, Sunlight Foundation, Talking Points Memo, Voice of San Diego, Washington Post, and WNYC.