Google News Archive and the Fragility of Digital Resources

In my last blog post I claimed that Google News Archive “is the gold standard for digital newspaper research in Spokane.” It is the single most valuable resource I have used in my research of Spokane but unfortunately, as of sometime last week, it is broken. The keyword search function that used to pull relevant articles from the depths of the historical newspaper abyss can no longer save me from the microfilm reader. For example, when researching an infamous safe cracksmen named Clarence Miles, a simple keyword search for “Clarence Miles Spokane” used to turn up an abundance of articles. Now, just two articles surface. Fortunately the papers are still browseable on Google News Archive so if you have a link to a specific Spokane newspaper article on the archive it should still function.

I was frustrated and saddened when I learned this and I was not alone in my frustration. Historian Dr. Larry Cebula reacted to the news by using Twitter to ask Google’s CEO Sundar Pichai to help solve the problem:

Although this is a major setback, we will certainly survive. A researcher in Wisconsin said the loss of Google News Archive search function “was like going from playing chess on the internet to playing chess via the U.S. Postal Service.” It will slow us down, but it will not stop us. This is also a valuable learning experience. We, as digital historians, must remember how fragile our digital resources can be. The Utah Digital Newspaper program, one of the most successful newspaper digitization projects in the country, has relied heavily on grants and donations to do their work. In an article about the program, John Herbert and Karen Estlund explain that the UDN has raised “an impressive $2.5 million, but our needs continue to exceed our resources.” This shows just how difficult and expensive it is to digitize newspapers and most importantly, to make them searchable.

Making newspapers searchable requires good scans of quality newspapers and OCR technology to read the papers. As we know, OCR is not a perfect technology, “It averages 70 percent, according to our own survey,” explains Herbert and Estlund. Furthermore, UDN has “two people separately transcribe the masthead and article headlines and subheadings” to insure that they are as accurate as possible for good searchability. This is wonderful quality control, but not all institutions have the extensive resources of UDN. So how can smaller institutions or even individual scholars complete laborious tasks that do not require a trained expert to complete them? The answer is crowdsourcing.

screen-shot-2016-11-02-at-1-34-01-pm

According to the Digital Humanities Network at the University of Cambridge crowdsourcing is a method of “creating or mobilising online communities of volunteers to assist them in their research.” This “assistance” can take many forms but volunteers are mostly utilized to do tasks that “computers cannot yet do effectively.” The Washington State Digital Archives has a crowdsourcing project to help the archives index its records. The program is called Scribe and it has been successful. You can sign up for it here and get started with some indexing.

Unfortunately not all crowdsourcing projects are successful. Historian Dr. Larry Cebula points out some of the flaws in Flickrs crowdsourcing effort. He analyzes one photograph in particular that has a lot of metadata added by users but none of it “contains useful historical information to give context or help us understand the photograph.”

It looks like “the crowd” does not always do tasks the way we had hoped. Nonetheless we should not be discouraged from initiating well thought out crowdsourcing projects in the future. They save time, energy, and resources when they are executed effectively.

Share with your friends:
This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Google News Archive and the Fragility of Digital Resources

  1. I was pretty burned by the Google News changes, too. Good point about remembering how fragile our resources are, especially when those resources aren’t really owned by the public. Google Books might be a good precedent–it was enormously valuable to academics for a short while, then turned into something more valuable to Google itself. Hard not to by cynical about the “big data” companies getting involved in research resources.

Leave a Reply

Your email address will not be published. Required fields are marked *