Definition of Synonymy: Synonyms are words that mean the same as each other.

Brainrack will not understand “meaning”, as such. Brainrack will understand how words are used, not what they mean. So we have to revise this definition for Brainrack.

Definition of Synonymy per Brainrack: Synonyms are words that are used the same as each other.

That is, for Brainrack, two words are synonyms when they tend to occur in the same contexts.

For instance, suppose we encounter the sentence The dynamite exploded. Suppose we also encounter the sentence The dynamite detonated. Then we’ll take that as evidence that exploded and detonated are synonyms.

Let’s ignore whether a word is a noun or verb or pronoun or whatever, and just consider a sample of text as an unordered bag of words, D . Then define a feature vector \underline{d}=\sum_{1 \le i \le W}\left\rvert \hat{w}_i \right\rangle d_i , where w_i is a word in the vocabulary of size W , and d_i = \left\{ \begin{array}{ll} 1 & w_i \in D \\ 0 & w_i \not\in D \end{array} \right. . [See here for a description of the notational conventions I use for matrices and vectors.]

Note: other definitions for d_i can lead to slightly more complicated calculations for the covariance and correlation matrices below.

If we have a collection of N text samples D_n , with corresponding feature vectors \underline{d}_n then we can define the cooccurrence matrix calculated from the collection as \Omega = \sum_n \underline{d}_n \underline{d}_n^t . Write \omega_{ij} for the coefficient of \left\rvert \hat{w}_i \right\rangle \left\langle \hat{w}_j \right\lvert in the matrix.

We may proceed to define the normalized cooccurrence matrix:

\widehat{\Omega} = \sum_{i,j} \left\rvert \hat{w}_i \right\rangle \left\langle \hat{w}_j \right\lvert \widehat{\omega}_{ij} , where \widehat{\omega}_{ij} = \frac{\omega_{ij}}{N} .

The covariance matrix of the data is given by

\mathbf{C} = \sum_{i,j} \left\rvert \hat{w}_i \right\rangle \left\langle \hat{w}_j \right\lvert C_{ij} , where C_{ij} = \widehat{\omega}_{ij} - \widehat{\omega}_{ii} \widehat{\omega}_{jj} .

The correlation matrix of the data is a normalized covariance matrix as follows

\widehat{\mathbf{C}} = \sum_{i,j} \left\rvert \hat{w}_i \right\rangle \left\langle \hat{w}_j\right\lvert \widehat{C}_{ij} , where \widehat{C}_{ij} = \frac{C_{ij}}{\sqrt{C_{ii}-C_{ii}^2}\sqrt{C_{jj}-C_{jj}^2}} .

Consider a vector from \widehat{\mathbf{C}} . For example \underline{\widehat{C}}_v = \widehat{\mathbf{C}} \left\rvert \hat{v} \right\rangle = \sum_{i} \left\rvert \hat{\imath} \right\rangle \widehat{C}_{iv}. The vectors \widehat{\mathbf{C}} \left\rvert \hat{w}_i \right\rangle define the cooccurrence patterns for each word w_i in the vocabulary. According to our definition, similar patterns of cooccurrence belong to synonymous words. Thus, it remains for us to define a similarity metric over \widehat{\mathbf{C}} \left\rvert \hat{w}_i \right\rangle vectors.

Correlation Similarity

Define the correlation similarity:

\mathbb{S}_c(i,j) = \widehat{\mathbf{C}} \left\rvert \hat{w}_i \right\rangle \circ \widehat{\mathbf{C}} \left\rvert \hat{w}_j \right\rangle

Using this definition, define the synonymy matrix

\mathbf{S}_c = \sum_{i,j} \left\rvert \hat{w}_i \right\rangle \left\langle \hat{w}_j \right\lvert \mathbb{S}_c(i,j) .

Entropic Similarity

Use the normalized cooccurrence matrix to estimate p(\widehat{\omega}|j) = p(X \ge \widehat{\omega} | j) , the probability that a value in column j of the normalized cooccurrence matrix equals or exceeds the value \widehat{\omega} .

Now define the entropic similarity:

\mathbb{S}_e(i,j) = \sum_{k} - \lg \max \left(p\left(\widehat{\omega}_{ki} | i\right),p\left(\widehat{\omega}_{kj} | j\right)\right) .

Using this definition, define the synonymy matrix

\mathbf{S}_e = \sum_{i,j} \left\rvert \hat{w}_i \right\rangle \left\langle \hat{w}_j \right\lvert \mathbb{S}_e(i,j) .

For now, let’s consider a search engine to be a mapping from input strings into a document set.  Sometimes I will call the input string a probe .  Sometimes I will call the document set the domain.

The classic search engine uses a text retrieval algorithm that is both syntactically and semantically dumb.  In mapping between search probes and documents, the search engine only considers the lexicography of the search probe:  The search probe is broken down into an unordered bag of individual words and the search engine searches for the subset of documents containing all the individual words in the search probe.

We have all had our expectations of what to expect from a search engine shaped by this model.  It is simple to understand — almost intuitive.  We expect to phrase our probes not as complete sentences or questions or pictures or sounds, but as strings of words.  We expect the result of a probe to be a set of pre-existing documents, not a graph or precis.

Whether the developers of second-generation search engines like it or not, the simple fact is that this model usually works.  And in those few cases where it doesn’t work, we can easily understand why and figure out a way to force it to work.  We computer users have learned to be conservative; if we find a way to do something we do not lightly abandon it for the latest greatest alternative coming down the pike.1

We could pause here to examine the expectations we’ve learned to internalize by using the standard search engine model, and what might be involved in changing them, but first lets accept the general framework and ask how it is that sometimes, even given the limitations of the framework, our expectations are disappointed.

Interpretations of Probe Words

When we talk about a probe being made up of words, we gloss over some distinctions that linguists commonly make.  These distinctions are relevant here and worth reviewing:

Aspect

Definition

Example

Lexicography the written form of the word The word trees in the sentence I like trees.
Pronounciation the spoken form of the word /TREEZ/
Morphology the particular form of the word in its context the plural of tree, formed by adding an s to the stem
Syntax the grammatical role the word plays in its context trees is the noun phrase forming the object of the sentence
Semantics the meaning of the word

#trees# refers to a kind of plant

What we have in our heads when we type in a probe is some semantic notion — a meaning, a real #tree#.  But what we type into the computer is just a string of letters — a lexicographic representation, the word tree .  Between lexicography and semantics lies plenty of room for ambiguity and misinterpretation:

Aspect

Example of Problems

Possible Remedies

Lexicography misspellings, alternate spellings Computer suggests legal words that are close to the probe word
Morphology A word might have many different forms that are all relevant to a given core meaning, e.g. do vs did vs does . Just because we use one form of a word in a probe, doesn’t mean we want to ignore all other forms of the word in the document set. Use stemmers to treat all the different forms of a word as identical by representing words with a single stem in the databases
Syntax A written word might play many different syntactic roles, for example: trees as a [plural] noun, trees as a [present-tense third-person singular] verb.  When we write a probe we usually have one or the other in mind and we’d like a search engine to distinguish between the roles.  Some syntactic relationships can be approximated by taking account of the proximity or the order of words in a probe and in the document set. Require matches to observe proximity constraints or ordering constraints.  Viz: Man bites dog should return a different set of documents from Dog bites man .
Parse the probe and parse the sentences in the document set. Require matches to observe part-of-speech distinctions.
Semantics – Ambiguity Words have multiple meanings. Even if we know the syntactic role of a word, we might not know its intended semantic interpretation, for example:tree could refer to a kind of plant, ortree could refer to a data structure used in programming computers

Cluster the result set of documents according to their similarity and then let the user sort through the clusters to find the semantics of interest.  The similarity metric might be the extent of shared vocabulary in the result set, or might be determined according to any of a number of pattern recognition and clustering algorithms.

Restrict the domain of the search engine so that ambiguities in the meanings of probe words is avoided.  This is the so-called ‘Vertical Search Engine’ strategy.

Use the cooccurrence probabilities of the words in the probe set to decide between different senses of the words in the probe and thus narrow the probe.  Use the cooccurrence probabilities of the words in your document set to decide between different senses of the words in the document set and thus improve the relevance of the matches between probe set and document set.

Semantics – Synonymy Multiple words can carry the same meaning. Just because our probe is phrased in terms of a particular set of words, doesn’t necessarily mean we don’t care about other synonymous ways of writing about the same idea. Use a thesaurus to look up the synonyms for each word in the probe. Expand the probe set by including all the synonyms.
Words that tend to occur in the same contexts, tend to have the same meanings. Determine the coocurrence probabilities of words in the document set and look for similar patterns of cooccurrence — words with similar patterns are semantically related. Use these relations as a kind of thesaurus to expand the probe set.3

As you can see, the solution to some of the problems I’ve listed is to start modifying the bag-of-words interpretation of probes and documents: take account of the order, or the proximity, or the part-of-speech, or the pairwise probabilities of the words.

Beyond the Bag-of-Words

Naturally, we don’t generally use words without order or structure: We use words in sentences. Sentences have syntactic structure and semantic structure.  Thus, naturally, search engines are looking for ways to abandon the bag-of-words interpretation of language and incorporate more and more structure into their operation.  The hope is that thereby search engines will become more and more natural for people to use. 

One ambitious trend in search engine design is to search the domain looking for facts4 : A pinochle deck has cards ranked 9-Ace; Maple syrup comes from maple trees; Kate Hepburn starred in Philadelphia Story…  The idea is that if a person is asking a question, and the search engine has enough information to give the answer, the search engine ought to be able to answer the question.  To this end, the search engine parses the probe string.  If the probe string is a question or a demand for information, the search engine attempts to produce the relevant information — often by turning the probe string into a database query. 

The fact-based search engines are often coupled with natural language processing5 .  They attempt to allow the users to enter probes in everyday parlance, and to respond appropriately.  The approach runs up against at least two major difficulties: people are lazy and generally prefer not to type complete sentences; once you encourage your users to start using natural language for their probes, it is difficult to explain any limitations on your technology to them.  For example, how might you explain to an average user that their probes shouldn’t contain nested appositives because your parsing algorithm can’t handle them?

Try something like this:  Pick a question.  I picked the question “How many cards are there in a deck?”6   Go to a number of search engines and type in your question as a query.  Here’s what happens (01 February 2008):

Search Engine

Response
www.hakia.com A bunch of URL’s.  The answer to my question is not displayed on the page. The answer to my question is not immediately apparent in the first 10 URL’s displayed either.
www.assista.com “Your search did not match any entries”
http://wikipedia.cognition.com My answer is displayed, indirectly, on the first page as the definition of deck-of-cards: 1) n: a pack of 52 playing cards.  It is also displayed in the summary blurb for the first URL: Bicycle is a standard issue deck of cards consisting of 52 red or black colored cards.
http://alpha.search.wikia.com “We do not have a mini article about …”  I am invited to start an article about the subject for everyone to read.  The answer to my question is not displayed on the page.  The answer to my question is not immediately apparent in the first 10 URL’s displayed either.
www.trueknowledge.com “Your request cannot currently be understood…” The answer to my question is not displayed on the page.  The answer to my question is not immediately apparent in the first 10 URL’s displayed either.  A suggested link takes me (within two clicks) to a wikipedia entry that contains an answer to my question.
www.google.com A bunch of URL’s.  The answer to my question is actually displayed, indirectly, in the blurb for the 7th URL. I don’t know how many in a Peck.  But there are 52 in a deck.

What I would normally do in order to find out how many cards there are in a deck is type a probe like “card deck” to Google.  I’d look for a response that isn’t an ad for playing cards, open the URL, and start reading. 

This approach worked pretty quickly; My answer was midway through the first URL I opened (on wikipedia).  It strikes me, however, that the sentence There are 52 cards in a deck must occur at least a few times in the net.  The least probable of all the words in my query must be cards and deck, so why doesn’t my favorite search engine find me one of those sentences containing cards and deck and not much else?  The answer here is that my favorite search engine does not understand questions like How many?.  Nor does it understand the category of number.  (Does it understand the concept of a sentence?)  When I type there are 52 cards in a deck to Google, the blurb to the 4th URL reads There are 52 cards in a standard deck of cards .

Predicate Logic as an Interpretive Model

Let’s step back from the NLP approach to enhanced search engines a bit and ask whether there isn’t a middle ground between parsing sentences and the bag-of-words approach.  It strikes me that there is.  The middle ground is (first-order) predicate logic.

In the vocabulary of predicate logic there are terms and predicates.  Traditionally, predicates are capitalized and terms are in lower-case.  Predicates either apply to terms, or they don’t.  For instance, the predicate APPLE applies to the term macintosh.  Predicates have scope.  Simply, the scope of a predicate is the set of terms to which the predicate applies.

To a first approximation, a predicate in a probe might be intended with a number of different scopes:

Scope Example Probe Discussion Significance
terms contained in the probe Batman’s black car The predicate BATMAN’S applies to car.  The predicate BLACK applies to car. The default mode for most search engines will handle this probe perfectly well.
Predicates applied to terms contained in the probe are discovered by the normal indexing algorithms
document returned by probe Shakespeare’s Hamlet

We might want documents containing the play Hamlet written by Shakespeare — in which case SHAKESPEARE’S is a predicate applying to the document as a whole.
We might want documents containing discussion of Shakespeare’s Hamlet — in which case Shakespeare’s is a term in the probe.
We might want to know where Shakespeare lived — in which case Shakespeare’s is again a term in the probe.

Current search engines call this type of information “metadata”.  Metadata, if it is relevant to a search performed by current search engines at all, must be part of the original document’s mark-up.
Predicates applied to the documents returned by a probe may be discovered by normal search engine algorithms if sufficient document metadata is available.  
terms contained in document returned by probe Number of cards in a deck The predicate NUMBER applies to the terms in the document returned by the probe.  The words cards and deck are terms in the probe. Predicates applied to the terms contained in documents returned by a probe might be discovered by taxonomic or ontological4 systems now under development.

To allay any confusion as to the distinctions here, let me give some more examples:

Probes in which the predicates apply to terms in the probe:

Probe Comment
Britney’s hair result documents should be about Britney’s hair
world’s richest man result documents should be about world’s richest man
semantic ambiguity result documents should be about semantic ambiguity

Probes in which the predicates apply to the documents returned by the probe:

Probe Comment
PhD dissertation on computers learning syntax Result documents should be about computers learning syntax.  The documents should be in the form of PhD dissertations.
philosophy exam Result documents should be about discuss philosophy.  The documents should be in the form of exams.
Shakespeare’s sonnets Result documents should be about … The documents should be in the form of sonnets.  The documents should be written by Shakespeare.
Vegan cooking blog Result documents should be about vegan cooking.  The documents should be in the form of a blog.
Iraq editorial Result documents should be about Iraq.  The documents should be in the form of editorial.
Probes in which the predicates apply to the terms in the documents returned by the probe:
Probe Comment
number of cards in a deck Result documents should be about cards in a deck containing numbers.
author of Little Women Result documents should be about Little Women containing authors.
color of Britney’s hair Result documents should be about Britney’s hair containing colors.
taste of garlic ice cream Result documents should be about garlic ice cream containing tastes.

 

This way of thinking about enhancing search engine capabilities has some nice properties:
  1. It is a natural extension of current technology.  The first class of predicate applications is solved agreeably well by current technology.  The second and third may be omitted with no degradation in search performance;
  2. It offers a means to respond to a large class of probes in a more relevant way;
  3. It doesn’t require anyone to change his standard bag-of-words approach to formulating searches.7  
Of course, now we have some big questions to answer:
  1. How do you decide what predicates are intended to apply to what?
  2. How can a computer automatically determine when a particular predicate applies to a particular document or term?

Good questions.  More to follow…

 


 

  1. A personal note: the two fastest, most streamlined, most comfortable development environments I have ever used are now at least 15 years out-of-date.  One environment was doing C development under MS-DOS using the Norton Commander interface.  The other environment was doing LISP development on the TI Explorer using the integrated editor/compiler/read-eval-print loop.  I still lament their passing.
  2. Luckily, most of us don’t use speech recognition to interact with our computers, so I’m going to omit the pronunciation category from the table.
  3. If you’re interested in this sort of thing, look up Karhunen-Loeve or Principal Component Analysis with your favorite search engine and start reading.  In order to maximize jargon and repel the uninitiated, the techniques have come to be known as Latent Semantic Indexing (LSI) within the computational linguistics community.
  4. For some unknown reason the computational linguistics community has come to call these sets of facts (and their organizational scheme) ontologies — in spite of the fact that this is a complete misnomer according to the historical usage of the word.  You’d think linguists would be more careful…
  5. For a quick tutorial on the state-of-the-art in natural language processing (NLP) as applied to search engines, visit PowerSet or TrueKnowledge , and read some of their background material.
  6. Like [almost?] all queries, this one is ambiguous:  What kind of deck?  Standard, Pinochle, Spanish, … ?  How about the alternate meaning of cards as in rogues or cutups?  Maybe I want to know how many rogues there are on the typical surface of an ocean-going vessel…
  7. Dubbed searchese by some.