Get LISNews via email! Enter Your Email Address:
80 terabytes of archived web crawl data available for research
Internet Archive crawls and saves web pages and makes them available for viewing through the Wayback Machine because we believe in the importance of archiving digital artifacts for future generations to learn from. In the process, of course, we accumulate a lot of data.
We are interested in exploring how others might be able to interact with or learn from this content if we make it available in bulk. To that end, we would like to experiment with offering access to one of our crawls from 2011 with about 80 terabytes of WARC files containing captures of about 2.7 billion URIs. The files contain text content and any media that we were able to capture, including images, flash, videos, etc.
The Living Voters Guide, recent winner of the Evergreen Apps Challenge, has released its 2012 update that allows Washington state voters to learn about different ballot measures, compare the pros and cons of each and sound off with fellow contributors.
And if there’s a particular fact in question, they can call on the expertise of a librarian.
Digging through the clutter of the online world: A Q&A with TED Books author Jim Hornthal
The latest TED Book deals with an issue we all can relate too: the difficulty of finding answers to complex questions on the Internet when a simple search can lead you down a rabbit hole of impersonal data. In A Haystack Full of Needles: Cutting Through the Clutter of the Online World to Find a Place, Partner or President, Jim Hornthal explores groundbreaking new approaches to discovering the useful insights buried deep within our complex and noisy datasphere. Hornthal, a venture capitalist in Silicon Valley, introduces us to innovators who are pushing the edges of data science and data visualization by applying the principles of pattern recognition to isolate relevant signals in the noise. Their efforts will have enormous implications for the way we practice medicine, discover music and movies, and even identify our romantic partners.
Curious to hear more about the ideas he explores in his e-book, the TED Blog asked Hornthal a few questions over email.
I've heard people opine for a Spotify for ebooks. This isn’t as kooky as when they opine for Netflix or Blockbuster for physical books. The thing is… Spotify is pretty good for consumers, but sucks for creators. Well… it’s not all rosy as some people want to believe. (I really don’t get why so many librarians absolutely loathe the publishing industry, but give the music industry a pass.)
Our online history is disappearing at an astonishing rate, creating a black hole for future historians.
"Our descendants will surely be grateful for a record that reflects more than marketable data and consumer preferences. As to preservation, though, the problem may be intractable. Between private profits, the privacy of personal histories and our hunger for perpetual renewal, “history” itself may be a concept ripe for rethinking: not so much the objective sifting of sources as a living thing, perpetually remade across networks for which there’s no time but the present.
NPR piece about a professor trying to edit an entry on the 1886 Haymarket Square riot in Chicago.
Here is the Wikipedia entry.
Americans are paying high prices for poor quality Internet speeds — speeds that are now slower than in other countries, according to author David Cay Johnston. He says the U.S. ranks 29th in speed worldwide.
"We're way behind countries like Lithuania, Ukraine and Moldavia. Per bit of information moved, we pay 38 times what the Japanese pay," Johnston tells Fresh Air's Dave Davies. "If you buy one of these triple-play packages that are heavily advertised — where you get Internet, telephone and cable TV together — typically you'll pay what I pay, about $160 a month including fees. The same service in France is $38 a month."
In his new book, The Fine Print: How Big Companies Use "Plain English" to Rob You Blind, Johnston examines the fees that companies — such as cellphone and cable — have added over the years that have made bills incrementally larger.
How hard is it to prove online that you are who you say you are? Author Philip Roth had to publish a letter in The New Yorker to satisfy the editors of Wikipedia
Wikipedia succeeds by "not doing the things that nobody ever thought of not doing". Specifically, Wikipedia does not verify the identity or credentials of any of its editors. This would be a transcendentally difficult task for a project that is open to any participant, because verifying the identity claims of random strangers sitting at distant keyboards is time-consuming and expensive. If each user has to be vetted and validated, it's not practical to admit anyone who wants to add a few words to a Wikipedia entry.