This blog is the record of my efforts to design and implement a new search engine.

I will try to discuss and justify each of my design decisions.  Since I am new to this particular area many of my decisions will be naïve and faulty for one reason or another.  You, gentle reader, may comment at will.

Why do we use the net? 

Here are six reasons:

  1. To learn about something.  To educate ourselves generally about a topic;
  2. To get the answer to a question.  To find a specific answer;
  3. To find out what’s new.  To be introduced to a new topic or some new facts we didn’t know we wanted;
  4. To broaden our interests or to be entertained.  To find something different or an unusual opinion or an interesting fact;
  5. To select people or things we want.  To meet people and buy things;
  6. To advertise ourselves and our products.  To help people meet us and buy our stuff. 

Who is the competition?

Who are the current players in the search engine universe, and how do they fit into the taxonomy above?

Enterprise Strengths Comments
Google, Yahoo!, Microsoft Live, … 1 Text Search Engines
Powerset, Carrot2, CognitionSearch, SWSE, Wikia, Blekko 1,2? 2nd generation text search
Wikipedia, WebMD 1,2 Organized information w/ human quality control
Google Earth, MapQuest 1,2 Cartography organized and delivered by the web
Assista, TrueKnowledge, ChaCha, … 2 ‘Ontologies’, some Q&A
Google News, Yahoo News 3 News Feed semantic clustering
Copernic, awasu 3 Passive search agents
Silobreaker 3,4 News Feed and Blog semantic clustering
Drudgereport, Junkscience 4 Selected from news feeds
Townhall.com, Huffingtonpost 4 News & comment
Truveo, Pixsy, Blinkx, Podscope 4 A/V content search
YouTube, Flickr, NetFlix 4 Video content providers
Hakia 5 Meet people w/ same questions
Shopping.com, ebay, overstock.com, buy.com, amazon.com 5,6 Online department stores and flea markets
Match.com, eHarmony, … 5,6 Personal ads
AdWords, adCenter, Yahoo! Search Marketing 5,6 Bringing buyers and sellers together…
Facebook, MySpace 5,6 Elaborate personal ads

For completeness’ sake, here’s a list of search engines I cobbed from here. I do not claim to have explored all these sites, although I have visited some of them:
• A9
• AOL
• AURA!
• blinkx
• boing
• bookmach.com
• BOXXET
• ChaCha
• ClipBlast!
• Clusty
• collarity
• CometQ
• CONGOO
• d e c i p h o
• del.icio.us
• digg
• digg labs swarm
• Ditto
• dumbfind
• exalead
• factbites
• fazzle
• FEEDS|2.0
• Feedster
• FindSounds
• GIGABLAST
• girafa
• gnn o d
• GoDefy
• goshme
• GoYams
• grokker
• ICEROCKET
• ixquick
• KartOO
• last.fm
• Lexxealpha
• like
• LiveDeal
• liveplasma
• Local.com
• lurpo
• MetaGlossary
• mnemomap
• Mojeek
• Mooter
• MrSAPO
• MS. DEWEY
• nayio
• Octora
• OiHoi Search
• Pagebull
• PlanetSearch
• pluggd
• PODZINGER
• Previewseek
• pronto.com
• QTsearch
• Quintura
• Releton
• retrevo gamma
• riya
• ROLLYO O
• SearchTheWeb2
• SEEQPOD
• sidekiq
• Simply Google
• Singing FISH
• Slideshow
• Slifter
• soople
• Speegle
• Sphider
• SPURL.net
• S R C H R
• SurfWax
• Swoogle
• TagJag!
• thefind.com
• Trexy
• turboscout
• UJIKO
• url.com
• VMGO.com
• Web 2.0
• Webaroo
• WEBBRAIN
• What to RENT?
• whonu?
• WIKIO
• WiseNut
• Yahoo! MINDSET
• yoono
• yoople
• yubnub
• YuFind
• ZABASEARCH
• zapmeta
• Zippy
• ZUULA

The point is that the market is already saturated with wanna-be’s.  It is pretty clear that a new engine is going to have to be

  1. High Concept, and; either
  2. High Technology because of a niche focus, or
  3. High Technology because of improved fundamental algorithms of some sort.

Just making a better general purpose search engine will probably not cut the mustard because of the fact you’d be lost in the crowd.  Consider this: there are many search engines out there I’d consider to be better than Google in one domain or another for some reason or another, but I don’t use them because Google is just easier.  In the face of that simple fact, what could a successful strategy look like?

How to break through the logjam is the big question.

Some Observations

Observation 1: The more limited your domain, the more interesting the algorithms you can use.  A second-generation search engine should prove itself first on limited domains.  For example: news feeds, blog feeds, Stuff I’ve Seen, wikis, or the like.

Observation 2: Most people have learned to work with Google and phrase their probes as word-lists.  The computational model is easy to understand.  A successful replacement will also be easy to understand.
 
Observation 3: Nobody can interpret still photos or video.  The technology of still-image search is purely context-driven (anchor text, surrounding text, captions, tags, etc.).  The technology of the video search companies is essentially to use context and speech recognition to tag video streams.  Since speech recognition doesn’t work all that well, the technology doesn’t work all that well.  If you wanted to compete by doing Image Search, you would do it stepwise.  I bet if you could solve 1, 2 & 3 you’d have a business:

  1. Identify humans in images (the human-recognition problem);
  2. Recognize the same human from frame-to-frame in video (the continuity problem for humans);
  3. Identify a given person across different times and pictures (the person-recognition problem);
  4. Then branch out into identifying non-human items and certain recurring situations.

Observation 4: Second-generation search is starting, but it isn’t mature.  Here’s what I’d expect from a second-generation search engine:

  1. Search probes should be expanded where appropriate according to spelling, synonymy, semantic role, part-of-speech ambiguity, etc.
  2. Search results should be clustered where appropriate according to topic, source, level of detail, etc.
  3. Words in context should be disambiguated between senses and (possibly?) part of speech.
  4. A search-engine should be able to answer simple questions on its domain.
  5. People who don’t like reading bad English should be able to filter out the illiterates.

Observation 5: Ranking search results by popularity is a cop-out.

What new ways could we use the net?

  1. More passively.  Passive search agents haven’t really caught on.  (These used to be known generally as avatars.  Now an avatar is more specifically an iconic presence.)  Is it a quality problem?  A user-interface problem?  This bears some investigation.
  2. For automatic generation of reports on subjects with embedded references and degree-of-consensus estimation for sub-topics.
  3. For automatic identification of people with whom we share interests based on browsing patterns, etc.
  4. For real-time sharing of our personal experiences: ‘telepresence.’  We should be able to logon to other peoples’ lives and ride along with them as they do interesting things.  We could have a new class of ‘experiencer superstars’ who would go out and experience things for us.  Think text-messaging on steroids…

So what is the plan?

  1. Acquire a good text corpus to form a statistical basis for linguistic computations.
  2. Secure methods for acquiring the data streams forming the search domain(s).
  3. Define the database and algorithms for summarizing the search domain(s).
  4. Define the retrieval algorithms for retrieving and presenting elements from the search domain(s).

One Response to “About the Blog”

  1. I love your site!

    _____________________
    Experiencing a slow PC recently? Fix it now!

Leave a Reply