DiShIn

DiShIn (Disjunctive Shared Information) is a method for exploitation of multiple inheritance when calculating the shared information content between two ontological concepts being compared by node-based semantic similarity measures. DiShIn re-defines the shared information content between two concepts as the average of all their disjunctive ancestors, assuming that an ancestor is disjunctive if the difference between the number of distinct paths from the concepts to it is different from that of any other more informative ancestor. In other words, a disjunctive ancestor is the most informative ancestor representing a given set of parallel interpretations. DiShIn is an improvement of GraSM in terms of computational efficiency and in the management of parallel interpretations.

Example

For example, palladium, platinum, silver and gold are considered to be precious metals, and silver, gold and copper considered to be coinage metals. Thus, we have:

                    metal
                   /     \
           precious       coinage
          /    |  \ \     / /  \
         /     |   \  gold /    \
palladium  platinum  silver  copper

When calculating the semantic similarity between platinum and gold, DiShIn starts by calculating the number of paths difference for all their common ancestors:

gold -> coinage -> metal
gold -> precious -> metal  
platinum -> precious -> metal
gold -> precious
platinum -> precious

For metal we have two paths from gold and one from platinum, so we have a path difference of one. For precious we have one path from each concept, so we have a path difference of zero.

Since their path difference is distinct, both common ancestors metal and precious are considered to be disjunctive common ancestors.

When calculating the semantic similarity between platinum and palladium, DiShIn starts by calculating the number of paths difference for all their common ancestors:

palladium -> precious -> metal  
platinum -> precious -> metal
palladium -> precious
platinum -> precious

For both metal and precious, we have only one path from each concept, so we have a path difference of zero for both common ancestors. Thus, only the common ancestor precious (the most informative) is considered to be a disjunctive common ancestor.

Given that node-based semantic similarity measures are proportional to the average of the information content of their common disjunctive ancestors: metal and precious in case of platinum and gold; and precious in case of platinum and palladium, means that for DiShIn palladium and platinum are more similar than platinum and gold.

When calculating the semantic similarity between silver and gold, DiShIn starts by calculating the number of paths difference for all their common ancestors:

gold -> coinage -> metal
gold -> precious -> metal  
silver -> coinage -> metal
silver -> precious -> metal  
gold -> precious
silver -> precious
gold -> coinage
silver -> coinage

As in the case of platinum and palladium, here all common ancestors have a path difference of zero, since silver and gold share the same relationships and therefore have parallel interpretations. Thus, only the most informative common ancestor precious or coinage is considered to be a disjunctive common ancestor. This means that for DiShIn the similarity between silver and gold is greater or equal than the similarity between any other pair of the leaf concepts. Thus, DiShIn does not penalize parallel interpretations as GraSM did.

Implementation

After estimating the information content for each concept and the number of distinct paths from one concept to another, DiShIn can be implemented as a single SQL query described in the authors's publication in the Journal of Biomedical Semantics.

An SQL Implementation for the MySQL release of the Gene Ontology computes the semantic similarity of a pair of GO terms on-the-fly, i.e. not requiring any preliminary calculations.

It can be used in the GO database mirror at the EBI:

mysql -hmysql.ebi.ac.uk -ugo_select -pamigo -P4085 go_latest < DiShIn.sql 

or by using a local installation from GO pre-built database dumps

The SQL script starts by the definition of the input GO terms to calculate shared information:

SET @t1Id = (SELECT id FROM term WHERE acc='GO:0060255'),
    @t2Id = (SELECT id FROM term WHERE acc='GO:0031326');

or for example the terms used in http://dx.doi.org/10.1186/2041-1480-2-5

SET @t1Id = (SELECT id FROM term WHERE acc='GO:0008387'), 
    @t2Id = (SELECT id FROM term WHERE acc='GO:0008396');

or for example the terms used in http://dx.doi.org/10.1016/j.datak.2006.05.003

SET @t1Id = (SELECT id FROM term WHERE acc='GO:0008387'),
    @t2Id = (SELECT id FROM term WHERE acc='GO:0008396');

Calculation of the maximum frequency for a term, assuming the number of gene products as the maximum frequency possible

SET @maxFreq = (SELECT COUNT(*) FROM gene_product);
   

Calculation of the information content of input term @t1Id

SET @t1IC = (
   SELECT -LOG(COUNT(DISTINCT a.gene_product_id)/@maxFreq) as ic
   FROM graph_path gp
       INNER JOIN association an ON (gp.term2_id = a.term_id)
   WHERE gp.term1_id = @t1Id
     AND a.is_not = 0
     AND gp.relationship_type_id IN (SELECT id FROM term WHERE name='part_of' OR name='is_a')
 );

Calculation of the information content of input term @t12d

SET @t2IC = (
   SELECT -LOG(COUNT(DISTINCT a.gene_product_id)/@maxFreq) as ic
   FROM graph_path gp
       INNER JOIN association an ON (gp.term2_id = a.term_id)
   WHERE gp.term1_id = @t1Id
     AND a.is_not = 0
     AND gp.relationship_type_id IN (SELECT id FROM term WHERE name='part_of' OR name='is_a')
 );    


Calculation of the disjunctive shared information (DiShIn) without requiring preliminary calculations. It assumes that the difference of the number of distinct paths can be estimated on-the-fly by the difference of the number of distinct nodes in the paths.

SET @dishin =
 ( SELECT AVG(dishin.ic)
  FROM
    (SELECT MAX(ca_ic.ic) AS ic
     FROM
       ( SELECT ca.term_id, ca.diff, -LOG(COUNT(DISTINCT a.gene_product_id)/@maxFreq) AS ic
        FROM
          ( SELECT ca.term_id,
                   ABS(ca.ca_t1_number - ca.ca_t2_number) AS diff
           FROM
             (SELECT ca.ancestor AS term_id,
                     COUNT(DISTINCT ca_t1_nodes.term2_id) AS ca_t1_number,
                     COUNT(DISTINCT ca_t2_nodes.term2_id) AS ca_t2_number
              FROM
                ( SELECT p1.term1_id AS ancestor
                 FROM graph_path p1,
                                 graph_path p2
                 WHERE p1.term2_id = @t1Id
                   AND p2.term2_id = @t2Id
                   AND p1.term1_id = p2.term1_id
                   AND p1.relationship_type_id IN
                     (SELECT id
                      FROM term
                      WHERE name='part_of'
                        OR name='is_a')
                   AND p2.relationship_type_id IN
                     (SELECT id
                      FROM term
                      WHERE name='part_of'
                        OR name='is_a')) AS ca
              INNER JOIN graph_path ca_t1_nodes ON (ca.ancestor = ca_t1_nodes.term1_id)
              INNER JOIN graph_path ca_t2_nodes ON (ca.ancestor = ca_t2_nodes.term1_id)
              WHERE ca_t1_nodes.term2_id IN
                  ( SELECT p2.term1_id AS ancestor
                   FROM graph_path p2
                   WHERE p2.term2_id = @t1Id)
                AND ca_t2_nodes.term2_id IN
                  ( SELECT p2.term1_id AS ancestor
                   FROM graph_path p2
                   WHERE p2.term2_id = @t2Id)
                AND ca_t1_nodes.relationship_type_id IN
                  (SELECT id
                   FROM term
                   WHERE name='part_of'
                     OR name='is_a')
                AND ca_t2_nodes.relationship_type_id IN
                  (SELECT id
                   FROM term
                   WHERE name='part_of'
                     OR name='is_a')
              GROUP BY ca.ancestor ) AS ca ) AS ca
        INNER JOIN graph_path gp ON (ca.term_id = gp.term1_id)
        INNER JOIN association an ON (gp.term2_id = a.term_id)
        WHERE a.is_not = 0
          AND gp.relationship_type_id IN
            (SELECT id
             FROM term
             WHERE name='part_of'
               OR name='is_a')
        GROUP BY ca.term_id,
                 ca.diff ) AS ca_ic
     GROUP BY ca_ic.diff) AS dishin );
       

Information content normalization to a [0..1] interval

SET @maxIC = ( SELECT -LOG(1/@maxFreq) );
SET @t1IC_norm = ( SELECT @t1IC/@maxIC );
SET @t2IC_norm = ( SELECT @t2IC/@maxIC );
SET @dishin_norm = ( SELECT @dishin/@maxIC );

Calculation of the semantic similarity measures using DiShIn:

SELECT @dishin_norm as Sim_resnik;
SELECT @t1IC_norm + @t2IC_norm - 2*@dishin_norm as Dist_jc;
SELECT (2*@dishin_norm) / (@t1IC_norm + @t2IC_norm)  as Sim_lin;

References

This article is issued from Wikipedia - version of the Monday, December 31, 2012. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.