« Previous entry | Home | Next entry »

Diving deep into the web: Glenbrook Networks

You think the Web is big? In truth, it's far bigger than it appears. The Web is made up of hundreds of billions of Web documents -- far more than the 8 billion to 20 billion claimed by Google or Yahoo.

But most of these Web pages are largely unreachable by most search engines because they are stored in databases that cannot be accessed by Web crawlers. Glenbrook Networks has been working on accessing these documents, using technology that crawls databases, and which automatically completes online forms and extracts data. Here's our Merc story today (free registration). The company can do some interesting things, like this map of jobs in Silicon Valley.


Trackbacks
TrackBack URL for this entry:
http://www.siliconbeat.com/cgi-bin/mt331/mt-tb.cgi/627

Links to blogs that reference this entry:

From: Software Only
Glenbrook Networks in the San Jose Mercury News - and Search Engine Watch
Excerpt: SiliconBeats Michael Bazeley featured Glenbrook Networks co-founders Julia and Edward Komissarchik, and the Glendor showcase, in a great piece about Deep Web search and information extraction. Michael summarized it quite well: Komissarchik and her fath...
Tracked: August 18, 2005 1:53 AM
From: Glendor.com Blog
Glenbrook Networks in the San Jose Mercury News
Excerpt: SiliconBeat’s Michael Bazeley featured Glenbrook Networks co-founders Julia and Edward Komissarchik, and the Glendor showcase, in a great piece about “Deep Web” search and information extraction. Michael summarized it quite well: ...
Tracked: August 18, 2005 1:56 AM
From: AI3 - Adaptive Information:::
Intellectual Honesty, Attribution, Historical Revisionism, and Truth: The ‘Deep Web’ Example
Excerpt: Last week I came across a reference from Search  Engine Watch – for which I have been a subscriber for many years and have been a speaker at their conferences — that TOTALLY FRIED me.  It’s related to a topic near...
Tracked: August 22, 2005 3:36 PM
From: cheap air fare
cheap air fare
Excerpt: cheap air fare
Tracked: March 27, 2006 8:14 PM
From: spy sweeper
spy sweeper
Excerpt: spy sweeper
Tracked: March 28, 2006 12:59 AM
From: health savings account
health savings account
Excerpt: I don't really exist therefore I sing.
Tracked: June 24, 2006 7:02 PM
From: Web Crawler
Web Crawler
Excerpt: It is intended to fetch a large number of web pages to fill the database of a search engine.Become.com, an innovative shoppin...
Tracked: July 31, 2006 8:26 PM

Comments

Glenbrook does have some interesting technology. and there's definitely more stuff out there than even the very broad (but thin) crawls that Yahoo and Google do. most folks aren't aware of what tools like these or Kapow & Transparensee can offer, in addition to other in-house proprietary technology. still, having a particular vertical focus makes it easier to compete with the big guys.

fyi, Glenbrook et al aren't the only folks doing a job map mashup:
http://tinyurl.com/de8kj

look for more data visualization coming soon from several corners...

Dave McClure on August 18, 2005 3:05 AM
Comment link

This is interesting but a bit vexing. The essence of a search engine is its generality. By limiting their 'showcase' to the Bay Area, it undermines the notion of generality. A more compelling showcase would have been to actually show how the engine works with an arbitrary location. It is not that difficult to hand-code the solution for one limited market. What this showcase shows is just a concept.

Peter Rip on August 18, 2005 7:35 AM
Comment link

Peter> Glenbrook can turn its web trawlers to any location, or set of companies, at the end of the day it depends on the amount of gear thrown at the problem.
The reason for focusing on Bay Area companies was to get a meaningful set of results for the sort of tech jobs users of the showcase might look for, and deliver some value to them whilst putting our technology stack to use in a real context.

For me a concept prototype (like a concept car) is a one time realization from the labs that will be used one day to build real products. This showcase is leveraging 4 years of R&D in the field of information extraction, and could easily grow to millions of jobs by scaling the back end infrastructure. And since the system is generic, ie it does not use a templating system to extract information from web sites, scalalibility is not an issue.

Feel free to get in touch to discuss further.

Jeff Clavier on August 18, 2005 12:33 PM
Comment link

Matt: if you try to search with this engine for "CTO", it will return as part of results all "DireCTOr" entries. This hardly counts as high level of search sophistication - in fact, lack of string/word distinction is an entry-level error, especially such common job-related words. It is good to have a powerful crawler, but this alone does not deliver user value. Previous venture of this team failed in part due to same lack of attention to detail; this time, they got to make sure they have no blunders like this.

Yuri Ammosov on August 20, 2005 5:37 AM
Comment link
Post a comment












Remember personal info?