This blog is the record of my efforts to design and implement a new search engine.
I will try to discuss and justify each of my design decisions. Since I am new to this particular area many of my decisions will be naïve and faulty for one reason or another. You, gentle reader, may comment at will.
Why do we use the net?
Here are six reasons:
- To learn about something. To educate ourselves generally about a topic;
- To get the answer to a question. To find a specific answer;
- To find out what’s new. To be introduced to a new topic or some new facts we didn’t know we wanted;
- To broaden our interests or to be entertained. To find something different or an unusual opinion or an interesting fact;
- To select people or things we want. To meet people and buy things;
- To advertise ourselves and our products. To help people meet us and buy our stuff.
Who is the competition?
Who are the current players in the search engine universe, and how do they fit into the taxonomy above?
| Enterprise | Strengths | Comments |
| Google, Yahoo!, Microsoft Live, … | 1 | Text Search Engines |
| Powerset, Carrot2, CognitionSearch, SWSE, Wikia, Blekko | 1,2? | 2nd generation text search |
| Wikipedia, WebMD | 1,2 | Organized information w/ human quality control |
| Google Earth, MapQuest | 1,2 | Cartography organized and delivered by the web |
| Assista, TrueKnowledge, ChaCha, … | 2 | ‘Ontologies’, some Q&A |
| Google News, Yahoo News | 3 | News Feed semantic clustering |
| Copernic, awasu | 3 | Passive search agents |
| Silobreaker | 3,4 | News Feed and Blog semantic clustering |
| Drudgereport, Junkscience | 4 | Selected from news feeds |
| Townhall.com, Huffingtonpost | 4 | News & comment |
| Truveo, Pixsy, Blinkx, Podscope | 4 | A/V content search |
| YouTube, Flickr, NetFlix | 4 | Video content providers |
| Hakia | 5 | Meet people w/ same questions |
| Shopping.com, ebay, overstock.com, buy.com, amazon.com | 5,6 | Online department stores and flea markets |
| Match.com, eHarmony, … | 5,6 | Personal ads |
| AdWords, adCenter, Yahoo! Search Marketing | 5,6 | Bringing buyers and sellers together… |
| Facebook, MySpace | 5,6 | Elaborate personal ads |
For completeness’ sake, here’s a list of search engines I cobbed from here. I do not claim to have explored all these sites, although I have visited some of them:
• A9
• AOL
• AURA!
• blinkx
• boing
• bookmach.com
• BOXXET
• ChaCha
• ClipBlast!
• Clusty
• collarity
• CometQ
• CONGOO
• d e c i p h o
• del.icio.us
• digg
• digg labs swarm
• Ditto
• dumbfind
• exalead
• factbites
• fazzle
• FEEDS|2.0
• Feedster
• FindSounds
• GIGABLAST
• girafa
• gnn o d
• GoDefy
• goshme
• GoYams
• grokker
• ICEROCKET
• ixquick
• KartOO
• last.fm
• Lexxealpha
• like
• LiveDeal
• liveplasma
• Local.com
• lurpo
• MetaGlossary
• mnemomap
• Mojeek
• Mooter
• MrSAPO
• MS. DEWEY
• nayio
• Octora
• OiHoi Search
• Pagebull
• PlanetSearch
• pluggd
• PODZINGER
• Previewseek
• pronto.com
• QTsearch
• Quintura
• Releton
• retrevo gamma
• riya
• ROLLYO O
• SearchTheWeb2
• SEEQPOD
• sidekiq
• Simply Google
• Singing FISH
• Slideshow
• Slifter
• soople
• Speegle
• Sphider
• SPURL.net
• S R C H R
• SurfWax
• Swoogle
• TagJag!
• thefind.com
• Trexy
• turboscout
• UJIKO
• url.com
• VMGO.com
• Web 2.0
• Webaroo
• WEBBRAIN
• What to RENT?
• whonu?
• WIKIO
• WiseNut
• Yahoo! MINDSET
• yoono
• yoople
• yubnub
• YuFind
• ZABASEARCH
• zapmeta
• Zippy
• ZUULA
The point is that the market is already saturated with wanna-be’s. It is pretty clear that a new engine is going to have to be
-
High Concept, and; either
-
High Technology because of a niche focus, or
-
High Technology because of improved fundamental algorithms of some sort.
Just making a better general purpose search engine will probably not cut the mustard because of the fact you’d be lost in the crowd. Consider this: there are many search engines out there I’d consider to be better than Google in one domain or another for some reason or another, but I don’t use them because Google is just easier. In the face of that simple fact, what could a successful strategy look like?
How to break through the logjam is the big question.
Some Observations
Observation 1: The more limited your domain, the more interesting the algorithms you can use. A second-generation search engine should prove itself first on limited domains. For example: news feeds, blog feeds, Stuff I’ve Seen, wikis, or the like.
Observation 2: Most people have learned to work with Google and phrase their probes as word-lists. The computational model is easy to understand. A successful replacement will also be easy to understand.
Observation 3: Nobody can interpret still photos or video. The technology of still-image search is purely context-driven (anchor text, surrounding text, captions, tags, etc.). The technology of the video search companies is essentially to use context and speech recognition to tag video streams. Since speech recognition doesn’t work all that well, the technology doesn’t work all that well. If you wanted to compete by doing Image Search, you would do it stepwise. I bet if you could solve 1, 2 & 3 you’d have a business:
-
Identify humans in images (the human-recognition problem);
-
Recognize the same human from frame-to-frame in video (the continuity problem for humans);
-
Identify a given person across different times and pictures (the person-recognition problem);
-
Then branch out into identifying non-human items and certain recurring situations.
Observation 4: Second-generation search is starting, but it isn’t mature. Here’s what I’d expect from a second-generation search engine:
- Search probes should be expanded where appropriate according to spelling, synonymy, semantic role, part-of-speech ambiguity, etc.
- Search results should be clustered where appropriate according to topic, source, level of detail, etc.
- Words in context should be disambiguated between senses and (possibly?) part of speech.
- A search-engine should be able to answer simple questions on its domain.
- People who don’t like reading bad English should be able to filter out the illiterates.
Observation 5: Ranking search results by popularity is a cop-out.
What new ways could we use the net?
-
More passively. Passive search agents haven’t really caught on. (These used to be known generally as avatars. Now an avatar is more specifically an iconic presence.) Is it a quality problem? A user-interface problem? This bears some investigation.
-
For automatic generation of reports on subjects with embedded references and degree-of-consensus estimation for sub-topics.
-
For automatic identification of people with whom we share interests based on browsing patterns, etc.
-
For real-time sharing of our personal experiences: ‘telepresence.’ We should be able to logon to other peoples’ lives and ride along with them as they do interesting things. We could have a new class of ‘experiencer superstars’ who would go out and experience things for us. Think text-messaging on steroids…
So what is the plan?
-
Acquire a good text corpus to form a statistical basis for linguistic computations.
-
Secure methods for acquiring the data streams forming the search domain(s).
-
Define the database and algorithms for summarizing the search domain(s).
-
Define the retrieval algorithms for retrieving and presenting elements from the search domain(s).
RSS
I love your site!
_____________________
Experiencing a slow PC recently? Fix it now!