Latent Semantic Indexing - An Introduction
Have you heard of Latent Semantic indexing (or analysis)? You would have come across the lingo if read about semantics and machine learning. I too had. But never went beyond reading the lingo. I believe, some others like me, would have just given it a pass because of the overload of statistical Greek symbols and the algebra, mathematics, statistics…et al. after all, we are programmers, what you matrices have to do with me… So, here is an attempt to introduce the LSI to a semantics programmer who has left mathematics some decades behind.
What is Latent Semantic Indexing or LSI?
Typically, information is retrieved by literally matching terms in documents with those of a query. However, lexical matching methods can be inaccurate when they are used to match a user’s query. Since there are usually many ways to express a given concept (synonymy), the literal terms in a user’s query may not match those of a relevant document. In addition, most words have multiple meanings (polysemy), so terms in a user’s query will literally match terms in irrelevant documents. A better approach would allow users to retrieve information on the basis of a conceptual topic or meaning of a document.
LSI is a technique in natural language processing of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
What are the benefits of LSI?
A common problem of natural language processing is synonymy and polysemy.
Synonymy – the phenomenon of multiple words having similar meanings
Polysemy – the phenomenon of a single word having more than one meaning.
Synonymy often causes relevant document to be missed in the search results while polysemy may lead to noise. LSA sees all words in the context of other words in the document. This contextual information enables LSA to deal with word ambiguities like synonymy and polysemy.
LSI has been proven very effective in categorization of content into predefined concepts or topics. The concepts contained in the documents being categorized are compared to the concepts contained in the example items, and a category (or categories) is assigned to the documents based on the similarities between the concepts they contain and the concepts that are contained in the example documents.
Even without predefined topics, LSA is useful in dynamic clustering of content based on their conceptual similarity to each other. This is very useful when dealing with an unknown collection of unstructured text. Because it uses a strictly mathematical approach, LSI is inherently independent of language and domain.
LSI algorithm – a 1000 feet view
Well, the algorithm is complex, very expensive. As per some text I have read, running LSI on tens of thousands of document would take upto a full day (24 hrs). Furthermore, it is not recommended to run LSI again, whenever new documents get added. There is a well evolved concept of ‘folding in’ new content to prevent re-indexing.
This complexity gives us a bigger reason to keep 1000 feet away from algorithmic complexities.
In a nutshell, LSI is a mathematical algorithm with matrices, the heart of which is matrix decomposition (SVD) and dimensionality reduction.
The algorithm constructs a word-document matrix where each row corresponds to a unique word in the document corpus and each column corresponds to a document. The value at each position is how many times the row’s word occurs in the column’s document. This matrix is called the Term-document matrix (C).
Then the algorithm applies complex mathematical SVD (Singular Value Decomposition) to this term-document matrix. Now, this is the core mathematical stuff.. matrix decomposition, eigenvalues, eigenvectors, blah blah… So rather than getting into steps of matric decomposition, we move to the output of SVD.
SVD yields 3 matrices.
SVD : C = U?V T where U – the is the wordspace, ? – the singular values, and V – the document space.
The Wordspace (U) represents relatedness of each term to every concept value in the document corpora. It has a row for each unique term and a column for each concept.
The document space(T) represents relatedness of each document to every concept value in the document corpora. It has a row for each concept and a column for each document.
Together Word space and document space can be used to find relatedness of documents to each other based on the concepts they are related to.
Exiting implementations of LSI
Now that we know it is complex, common sense tells us that there must be several existing implementations that may be used almost out-of-the-box. Yes, several universities are pursuing LSI topic. There are existing implementations of SVD and even LSI.
Stay tuned for updates around existing implementations.