The Islander algorithm finds genomic islands based on mechanistic consequences of their typical site-specific integration into tRNA/ tmRNA genes (tDNAs). Islands that target tDNAs carry a replacement fragment that restores the tDNA as integration displaces a tDNA fragment. This leaves and integration signature pattern, with the island flanked by the intact tDNA and the displaced fragment:
We search for the above signature, and further insist that the island contain an integrase gene. Several filters are used to rule out spurious islands, including a CDS filter that rejects candidate islands whose displaced fragment falls in the CDS for a Pfam-A domain. This represents a great relaxation of the previous form of the filter that rejected when the displaced fragment fell into any CDS. Some false positives accrue from this relaxation, that we have attempted to identify using additional criteria:
- Housekeeping Index: Enrichment among the primary island set is calculated for each Pfam-A family, then islands are scored based on the enrichment factors of their genes.
- G+C Bias
- Integrase-to-End Distance: shortest distance between integrase gene and island end
Optimal k-means clustering of islands by these features occurred at k=4, and one of these clusters showed for each criterion the expectation for false positives. Applying the basic algorithm to 2031 whole prokaryotic genomes, we found 4065 islands (this differs from the value reported in the citation because of improved tandem resolution). The false positive cluster comprised 303 islands which we removed from the main navigation system. Thus we present 3762 islands currently at the website.