Java Program to Implement Suffix Tree

«
»
This is a Java Program to implement Suffix Tree. A suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix tree allows a particularly fast implementation of many important string operations. The construction of such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string’s suffix tree typically requires significantly more space than storing the string itself. This program is based on Mark Nelson’s implementation of Ukkonen’s algorithm.

Here is the source code of the Java program to implement Suffix tree. The Java program is successfully compiled and run on a Windows system. The program output is also shown below.

  1. /*
  2.  *      Java Program to Implement Suffix Tree
  3.  */
  4.  
  5. import java.io.*;
  6.  
  7. /** Class Node **/
  8. class Node     
  9. {    
  10.     public int suffix_node;    
  11.     public static int Count = 1;
  12.  
  13.     /** Constructor **/
  14.     public Node() 
  15.     { 
  16.         suffix_node = -1; 
  17.     }    
  18. }
  19.  
  20. /** Class Suffix Tree **/
  21. class SuffixTree
  22. {
  23.     private static final int MAX_LENGTH = 1000;
  24.     private static final int HASH_TABLE_SIZE = 2179; 
  25.  
  26.     private char[] T = new char[ MAX_LENGTH ];
  27.     private int N;    
  28.     private Edge[] Edges ;    
  29.     private Node[] Nodes ;
  30.  
  31.     private Suffix active;
  32.  
  33.     /** Class Suffix **/
  34.     class Suffix 
  35.     {
  36.         public int origin_node;
  37.         public int first_char_index;
  38.         public int last_char_index;
  39.  
  40.         /** Constructor **/
  41.         public Suffix(int node, int start, int stop )
  42.         {
  43.             origin_node = node ;
  44.             first_char_index = start ;
  45.             last_char_index = stop;
  46.         }
  47.  
  48.         /** Function Implicit  **/
  49.         public boolean Implicit()
  50.         {
  51.             return first_char_index > last_char_index; 
  52.         }
  53.  
  54.         /** Function Explicit  **/
  55.         public boolean Explicit()
  56.         { 
  57.             return first_char_index > last_char_index; 
  58.         }
  59.  
  60.         /** Function Canonize()
  61.          * A suffix in the tree is denoted by a Suffix structure
  62.          * that denotes its last character.  The canonical
  63.          * representation of a suffix for this algorithm requires
  64.          * that the origin_node by the closest node to the end
  65.          * of the tree.  To force this to be true, we have to
  66.          * slide down every edge in our current path until we
  67.          * reach the final node 
  68.          **/
  69.         public void Canonize()
  70.         {
  71.             if (!Explicit() ) 
  72.             {
  73.                 Edge edge = Find( origin_node, T[ first_char_index ] );
  74.                 int edge_span = edge.last_char_index - edge.first_char_index;
  75.  
  76.                 while ( edge_span <= ( last_char_index - first_char_index ) ) 
  77.                 {
  78.                     first_char_index = first_char_index + edge_span + 1;
  79.                     origin_node = edge.end_node;
  80.                     if ( first_char_index <= last_char_index ) 
  81.                     {
  82.                         edge = Find( edge.end_node, T[ first_char_index ] );
  83.                         edge_span = edge.last_char_index - edge.first_char_index;
  84.                     }
  85.                 }
  86.             }
  87.         }
  88.     }
  89.  
  90.     /** Class Edge **/
  91.     class Edge 
  92.     {
  93.         public int first_char_index;
  94.         public int last_char_index;
  95.         public int end_node;
  96.         public int start_node;
  97.  
  98.         /** Constructor **/
  99.         public Edge()
  100.         {
  101.             start_node = -1;
  102.         }
  103.  
  104.         /** Constructor **/
  105.         public Edge( int init_first, int init_last, int parent_node )
  106.         {
  107.             first_char_index = init_first;
  108.             last_char_index = init_last;
  109.             start_node = parent_node;
  110.             end_node = Node.Count++;
  111.         }
  112.  
  113.         /** function Insert ()
  114.          *  A given edge gets a copy of itself inserted into the table
  115.          *  with this function.  It uses a linear probe technique, which
  116.          *  means in the case of a collision, we just step forward through
  117.          *  the table until we find the first unused slot.
  118.          **/
  119.         public void Insert()
  120.         {
  121.             int i = Hash( start_node, T[ first_char_index ] );
  122.             while ( Edges[ i ].start_node != -1 )
  123.                 i = ++i % HASH_TABLE_SIZE;
  124.             Edges[ i ] = this;
  125.         }
  126.  
  127.         /** function SplitEdge ()
  128.          *  This function is called
  129.          *  to split an edge at the point defined by the Suffix argument
  130.          **/
  131.         public int SplitEdge( Suffix s )
  132.         {
  133.             Remove();
  134.             Edge new_edge =  new Edge( first_char_index, first_char_index + s.last_char_index - s.first_char_index, s.origin_node );
  135.             new_edge.Insert();
  136.             Nodes[ new_edge.end_node ].suffix_node = s.origin_node;
  137.             first_char_index += s.last_char_index - s.first_char_index + 1;
  138.             start_node = new_edge.end_node;
  139.             Insert();
  140.             return new_edge.end_node;
  141.         }
  142.  
  143.         /** function Remove ()
  144.          *  This function is called to remove an edge from hash table 
  145.          **/
  146.         public void Remove()
  147.         {
  148.             int i = Hash( start_node, T[ first_char_index ] );
  149.             while ( Edges[ i ].start_node != start_node ||
  150.                     Edges[ i ].first_char_index != first_char_index )
  151.                 i = ++i % HASH_TABLE_SIZE;
  152.             for ( ; ; ) 
  153.             {
  154.                 Edges[ i ].start_node = -1;
  155.                 int j = i;
  156.                 for ( ; ; ) 
  157.                 {
  158.                     i = ++i % HASH_TABLE_SIZE;
  159.                     if ( Edges[ i ].start_node == -1 )
  160.                         return;
  161.                     int r = Hash( Edges[ i ].start_node, T[ Edges[ i ].first_char_index ] );
  162.                     if ( i >= r && r > j )
  163.                         continue;
  164.                     if ( r > j && j > i )
  165.                         continue;
  166.                     if ( j > i && i >= r )
  167.                         continue;
  168.                     break;
  169.                 }
  170.                 Edges[ j ] = Edges[ i ];
  171.             }
  172.         }        
  173.     }   
  174.  
  175.     /** Constructor */
  176.     public SuffixTree()
  177.     {
  178.         Edges = new Edge[ HASH_TABLE_SIZE ];
  179.         for (int i = 0; i < HASH_TABLE_SIZE; i++)
  180.             Edges[i] = new Edge();
  181.         Nodes = new Node[ MAX_LENGTH * 2 ];
  182.         for (int i = 0; i < MAX_LENGTH * 2 ; i++)
  183.             Nodes[i] = new Node();
  184.         active = new Suffix( 0, 0, -1 );   
  185.     }
  186.  
  187.     /** Function Find() - function to find an edge **/
  188.     public Edge Find( int node, int c )
  189.     {
  190.         int i = Hash( node, c );
  191.         for ( ; ; ) 
  192.         {
  193.             if ( Edges[ i ].start_node == node )
  194.                 if ( c == T[ Edges[ i ].first_char_index ] )
  195.                     return Edges[ i ];
  196.                 if ( Edges[ i ].start_node == -1 )
  197.                     return Edges[ i ];
  198.                 i = ++i % HASH_TABLE_SIZE;
  199.             }
  200.     }
  201.  
  202.     /** Function Hash() - edges are inserted into the hash table using this hashing function **/
  203.     public static int Hash( int node, int c )
  204.     {
  205.         return (( node << 8 ) + c ) % HASH_TABLE_SIZE;
  206.     }
  207.  
  208.     /** Function AddPrefix() - called repetitively, once for each of the prefixes of the input string **/
  209.     public void AddPrefix( Suffix active, int last_char_index )
  210.     {
  211.         int parent_node;
  212.         int last_parent_node = -1;
  213.  
  214.         for ( ; ; ) 
  215.         {
  216.             Edge edge;
  217.             parent_node = active.origin_node;
  218.  
  219.             if ( active.Explicit() ) 
  220.             {
  221.                 edge = Find( active.origin_node, T[ last_char_index ] );
  222.                 if ( edge.start_node != -1 )
  223.                     break;
  224.             } 
  225.             else 
  226.             { 
  227.                 edge = Find( active.origin_node, T[ active.first_char_index ] );
  228.                 int span = active.last_char_index - active.first_char_index;
  229.                 if ( T[ edge.first_char_index + span + 1 ] == T[ last_char_index ] )
  230.                     break;
  231.                 parent_node = edge.SplitEdge( active );
  232.             }
  233.  
  234.             Edge new_edge = new Edge( last_char_index, N, parent_node );
  235.             new_edge.Insert();
  236.             if ( last_parent_node > 0 )
  237.                 Nodes[ last_parent_node ].suffix_node = parent_node;
  238.             last_parent_node = parent_node;
  239.  
  240.             if ( active.origin_node == 0 )
  241.                 active.first_char_index++;
  242.             else
  243.                 active.origin_node = Nodes[ active.origin_node ].suffix_node;
  244.             active.Canonize();
  245.         }
  246.         if ( last_parent_node > 0 )
  247.             Nodes[ last_parent_node ].suffix_node = parent_node;
  248.         active.last_char_index++;  
  249.         active.Canonize();
  250.     }
  251.  
  252.     /** Function to print all contents and details of suffix tree **/
  253.     public void dump_edges(int current_n )
  254.     {
  255.         System.out.println(" Start  End  Suf  First Last  String\n");
  256.         for ( int j = 0 ; j < HASH_TABLE_SIZE ; j++ ) 
  257.         {
  258.             Edge s = Edges[j];
  259.             if ( s.start_node == -1 )
  260.                 continue;
  261.             System.out.printf("%5d %5d %3d %5d %6d   ", s.start_node , s.end_node, Nodes[ s.end_node ].suffix_node, s.first_char_index, s.last_char_index);
  262.  
  263.             int top;
  264.             if ( current_n > s.last_char_index )
  265.                 top = s.last_char_index;
  266.             else
  267.                 top = current_n;
  268.             for ( int l = s.first_char_index ; l <= top; l++)
  269.                 System.out.print( T[ l ]);
  270.             System.out.println();
  271.         }
  272.     }    
  273.     /** Main Function **/
  274.     public static void main(String[] args) throws IOException
  275.     { 
  276.         BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
  277.  
  278.         System.out.println("Suffix Tree Test\n");
  279.         System.out.println("Enter string\n"); 
  280.         String str = br.readLine(); 
  281.  
  282.         /** Construct Suffix Tree **/       
  283.         SuffixTree st = new SuffixTree();
  284.         st.T = str.toCharArray();
  285.         st.N = st.T.length - 1;  
  286.  
  287.         for (int i = 0 ; i <= st.N ; i++ )
  288.             st.AddPrefix( st.active, i );
  289.  
  290.         st.dump_edges( st.N );    
  291.     }
  292. }

advertisement
Suffix Tree Test
 
Enter string
 
sanfoundry
 Start  End  Suf  First Last  String
 
    0     2  -1     1      9   anfoundry
    0     9  -1     7      9   dry
    0     4  -1     3      9   foundry
    0     7   0     2      2   n
    0     5  -1     4      9   oundry
    0    10  -1     8      9   ry
    0     1  -1     0      9   sanfoundry
    0     6  -1     5      9   undry
    0    11  -1     9      9   y
    7     8  -1     7      9   dry
    7     3  -1     3      9   foundry

Sanfoundry Global Education & Learning Series – 1000 Java Programs.

Note: Join free Sanfoundry classes at Telegram or Youtube
advertisement
advertisement
If you wish to look at all Java Programming examples, go to Java Programs.

advertisement
advertisement
Subscribe to our Newsletters (Subject-wise). Participate in the Sanfoundry Certification contest to get free Certificate of Merit. Join our social networks below and stay updated with latest contests, videos, internships and jobs!

Youtube | Telegram | LinkedIn | Instagram | Facebook | Twitter | Pinterest
Manish Bhojasia - Founder & CTO at Sanfoundry
Manish Bhojasia, a technology veteran with 20+ years @ Cisco & Wipro, is Founder and CTO at Sanfoundry. He lives in Bangalore, and focuses on development of Linux Kernel, SAN Technologies, Advanced C, Data Structures & Alogrithms. Stay connected with him at LinkedIn.

Subscribe to his free Masterclasses at Youtube & technical discussions at Telegram SanfoundryClasses.