Understanding words similarity / relatedness using WordNet - Semantic similarity
Semantic similarity measures play an important role in the extraction of semantic relations. Semantic similarity measures are widely used in Natural Language Processing (NLP) and Information Retrieval (IR). Measuring semantic similarity between words has been playing an important role in many research areas such as Artificial Intelligence, Linguistics, Cognitive Science and Knowledge Engineering. The semantic similarity between words can be utilized to disambiguate word senses, and improve the accuracy of ontology mapping.
Measuring the semantic similarity or distance between words is a process of quantifying the relatedness between the words utilizing the knowledge obtained from certain information sources.
- lexical resources such as dictionaries, thesauri and semantic networks
- collections of documents such as corpus and
- the Internet.
Semantic networks are considered as better choices for estimating semantic similarity than other lexical resources. WordNet is one of the most popularly used semantic networks for estimating semantic similarities.
Based on the way of utilizing WordNet, the WordNet-based semantic similarity measures can be classified into three categories:
- Node-based methods, which estimate the semantic similarity by computing the amount of information contained by related words in WordNet.
- Edge-based methods, which assess the semantic similarity by calculating the length of edges on the shortest path between the words in WordNet
- Hybrid methods, which combine the information content theory and the structure information from WordNet to estimate the semantic similarity between words
WordNet has been commonly used to measure semantic similarity among words since it has the inherent advantages of being structured in the way of simulating human recognition behaviors.
Most of node-based methods employ the information content to quantify the amount of information that a concept contained.The Information Content (IC) of a concept c can be quantified by IC(c)=-log(P(c)), where P(c) is the probability of c appearing in a corpus.Resnik believed that the similarity of c1 and c2 is determined by the closest common superordinate concepts (i.e., hypernyms) of c1 and c2 in WordNet.Thus, Resnik proposed to use the IC of the lowest hypernyms of c1 and c2 to calculate the semantic relatedness between c1 and c2.
The related value is equal to the information content (IC) of the Least Common Subsumer (LCS) (most informative subsumer). This means that the value will always be greater-than or equal-to zero. The upper bound on the value is generally quite large and varies depending upon the size of the corpus used to determine information content values. To be precise, the upper bound should be ln (N) where N is the number of words in the corpus.
Richardson and Smeaton amended Resnik’s method by using a different equation to calculate the IC of the lowest hypernyms of c1 and c2.
Banerjee and Pedersen developed a score schema to estimate the relatedness through cross comparing the words being used in the definition of c1 and c2 and their hypernyms.
The drawbacks of node-based methods include:
- It is a time-consuming work to analysis the corpora for estimating the IC values;
- Unbalanced contents of the employed corpora may significantly decrease the accuracy of the IC values.
Leacock and Chodorow suggested that the semantic relatedness between c1 and c2 can be estimated using the edge number of the shortest path between c1 and c2, and the depth of the involved taxonomy tree in WordNet.
Hirst and St-Onge utilized edge directions to estimate edge lengths of the shortest path. An edge direction is determined by the relation type of the edge. The directions of edges in a path between concepts are used to determine if the path is allowable, and to define the strength of turns in the allowable path. Hirst and St-Onge used the length of the shortest allowable path between c1 and c2, and the number of the direction turns in the path to estimate the semantic relatedness between c1 and c2.
The accuracy of the edge-based methods is significantly affected by the lack of considering the varieties of semantic distances between adjacent words, which is caused by the uneven word densities in WordNet. For instance, Leacock and Chodorow did not consider the variety of all edge lengths; Hirst and St-Onge ignored the diversity of semantic distance among edges that are in a same type.
3. Hybrid methods
Jiang and Conrath accumulated the scaled length of all the edges in the shortest path between concepts to estimate the semantic similarity of the concepts. The edge length between concept c (a node in the shortest path) and concept p (the parent node of c in the shortest path) is calculated by length(c, p) = log(P(p))-log(P(c)). Structure information from WordNet such as
- local density of the nodes in the shortest path,
- the depth of the nodes, and
- the relation type of edges are utilized to scale the edge length.
According to previous research, hybrid methods outperform the other two kinds of methods. However, hybrid methods share a common weakness with all node-based methods: the data sparseness of the employed corpus. This weakness can be caused by two reasons:
- The size of the corpus is relatively undersized, and
- The coverage of the corpus is not well balanced. One possible way to complement the deficiency is to use a better corpus.