Intelligent Feature Search

A powerful, intuitive search algorithm for finding features within GenBank files, designed for accuracy and ease of use.

The Challenge: Finding Data in Dense Files

GenBank files are incredibly data-rich, but their dense, text-based format makes finding specific information difficult. A simple text search often isn't enough, as researchers need to look for specific features like genes or CDS, and spelling errors can lead to missed results.

Our project introduces an intelligent search layer on top of standard GenBank files. This tool allows users to perform targeted, error-tolerant searches to quickly locate the exact features they need.

How to Use Our Search

Let's walk through an example. Imagine you want to find the traN gene within the GenBank record for P. putida.

You provide the searchword (e.g., "traN") and have the option to specify a subclass to narrow your search. If you select "gene" as the subclass, our tool will only look for "traN" within gene annotations, ignoring other sections. If you leave it blank, it searches the entire file.

Error Tolerance is Key

Our search isn't just looking for exact matches. It finds words that are a close match, so even minor typos won't prevent you from getting the results you need. The output is a list of matches, ranked by how closely they match your search term.

The Technology Behind It

Our tool operates in a few simple steps. First, it fetches the requested .gb file. Then, it processes the file line by line, identifying the "subclass" of each line and comparing its contents to your search query.

Understanding Subclasses

We use the term "subclass" to refer to the primary feature types in a GenBank file. This allows for highly specific searches. The most common subclasses are:

  • Source: General information about the entire sequence (organism, strain, etc.).
  • Gene: Marks the location of a specific gene.
  • CDS (Coding Sequence): Contains the specific nucleotide sequence that codes for a protein.
  source           1..128921
				   /organism="Pseudomonas putida"
  gene             complement(1..867)
				   /locus_tag="HXC77_RS00005"
  CDS              complement(1..867)
				   /product="Replication initiation protein"

Measuring Closeness with Levenshtein Distance

To provide error-tolerant results, we calculate the Levenshtein distance between your searchword and words found in the file. This distance is the minimum number of single-character edits (insertions, deletions, substitutions) needed to change one word into the other.

Our tolerance for errors is dynamic. We use the formula 4 * (1 - 2^(-a/3)), where 'a' is the word length. This allows more potential typos for longer words, up to a reasonable maximum limit of 4 edits.

Conclusion

By combining a targeted subclass system with an error-tolerant matching algorithm, our tool provides a fast, flexible, and intuitive way to navigate the dense information within GenBank files, significantly improving the user experience for researchers.