Exploring a ‘Deep Web’ That Google Can’t Grasp

One day last summer, Google’s search engine trundled quietly past a milestone. It added the one trillionth address to the list of Web pages it knows about. But as impossibly big as that number may seem, it represents only a fraction of the entire Web.

Beyond those trillion pages lies an even vaster Web of hidden data: financial information, shopping catalogs, flight schedules, medical research and all kinds of other material stored in databases that remain largely invisible to search engines.

The challenges that the major search engines face in penetrating this so-called Deep Web go a long way toward explaining why they still can’t provide satisfying answers to questions like “What’s the best fare from New York to London next Thursday?” or “When will the Yankees play the Red Sox this year?” The answers are readily available — if only the search engines knew how to find them.

Full article in the NYT

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

What Blocks More, Filters or the ALA

Libraries make the Internet available. Google is likely the top search engine in most libraries. Google, as we know, as powerful as it is, does not reach into the "Deep Web." The Deep Web is reachable, however, with specialized web sites, etc. Yet I'll dare say almost no library helps patrons reach the Deep Web. I have never noticed the ALA to even once provide such guidance to libraries, correct me if I'm wrong by providing a URL, though I know the ALA occasionally discusses the issue.

To me, this is a significant reason why the ALA's complaining about filters blocking a few web sites is irrelevant. The ALA itself does little other than talk to help patrons get past the blocking of 7/8ths of the entire Internet known as the Deep Web. Filters may block a few web sites, but the ALA effectively blocks 7 times as many by not providing guidance for navigating the Deep Web, and it is navigable.

-=-=-=-
http://www.SafeLibraries.org
http://safelibraries.blogspot.com/

staggering

What does this even mean?

The Deep Web is accessible

The Deep Web is accessible via Librarians.

That's because

a large amount of the "DEEP WEB" is either restricted websites, those that you have to pay to access, have authorizations to get into, and most of which are not registered with any search engine.

Mostly it contains information that would not be considered "reference" worthy because they are not considered authentic or authoratative sources OR those that are are "fee only" access.

Since the "Deep Web" or rather a better name is the "Invisible Web" is not indexed and not searchable, access to it is not really possible for public libraries ,NOR should libraries even consider accessibility.

The "Deep Web" Consists of the following:

Deep Web resources may be classified into one or more of the following categories:[citation needed]

Dynamic content – dynamic pages which are returned in response to a submitted query or accessed only through a form, especially if open-domain input elements (such as text fields) are used; such fields are hard to navigate without domain knowledge.

Unlinked content – pages which are not linked to by other pages, which may prevent Web crawling programs from accessing the content. This content is referred to as pages without backlinks (or inlinks).

Private Web – sites that require registration and login (password-protected resources).

Contextual Web – pages with content varying for different access contexts (e.g., ranges of client IP addresses or previous navigation sequence).

Limited access content – sites that limit access to their pages in a technical way (e.g., using the Robots Exclusion Standard, CAPTCHAs or pragma:no-cache/cache-control:no-cache HTTP headers, prohibiting search engines from browsing them and creating cached copies.

Scripted content – pages that are only accessible through links produced by JavaScript as well as content dynamically downloaded from Web servers via Flash or AJAX solutions.

Non-HTML/text content – textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.

All of these create problems for the public, and even more, problems with taxpayers who would have to PAY to maintain the computers that access these sites which are largely useless or limited.

It's navigable

But library funding rarely allows training in many of these sites, which basically require individualized training for usage.

Syndicate content