IMAS User Manual
The Interactive Multigenomic Analysis System
Christopher D. Shaw, Ph.D. shaw <at> sfu.ca
Bio/V Lab, School of Interactive Arts & Technology
Simon Fraser University
Copyright 2008
4.4.1. How Codons Are Selected For Highlighting
4.6.1. Find Genes Running Glimmer
4.6.2. Find Genes with long-orfs
4.6.3. Find Genes by Blast Matching Secondary Sequences
5.3. Computing Percent Identity
5.4. Geometry of Blast Results
7.2. Multialignment Source Data
7.3. Clustal-W Output File Storage
This is a user guide for IMAS, the genomic Interactive Multigenomic Analysis System. IMAS contains facilities to load a nucleotide sequence, find genes, search for similar sequences using Blast, multialign similar sequences, and so on. Results of these analyses are presented in a unified graphical view that can be quickly navigated.
When IMAS[1] starts up, you will see the following window:
IMAS presents a at least two Tabs that organize the data you are working with into separate themes. The Data Management and NT Sequence Analysis tabs are present in all versions. The Mini Web Browser tab is only present on the Windows version.
This Data Management tab is where you organize the data files for later processing.
The first thing to do is to Open or create a Project, which is a collection of files that contain sequences, databases and the results of Blast analysis, Clustal-W analysis, and so on. In IMAS, you work on only one Project at a time.
The Project file is an XML file that contains a list of other files. The file extension is .pxml, which reminds you that this is a IMAS Project XML file.
The File menu contains commands for creating or opening projects:
If this is the first time you are working on a sequence, choose New Project. This will create a new project internal to IMAS named IMASProject.pxml. IMAS will immediately pop up a file selection box that prompts you to save IMASProject.pxml in your home directory. Feel free to rename this file and store it in any folder you want. The folder that you choose is called the Project Folder.
Choose a project filename that is not identical to one that already exists in the project folder. Once you start doing analysis, IMAS will create files and subfolders in the project folder. Among these automatically created subfolders are: BlastResults, HmmSearchResults, MultiAlignResults, internal_DB. This is where runs of Blast, hmmpfam, Clustal-W, and secondary sequence analysis are stored, respectively.
If you have run IMAS before, to open a previously saved project, choose Open Project, which pops up a file selection box that allows you to choose a project file that you had saved earlier.
Once a project has been opened or created, you can save it.
Save Project saves the current project in the project file that you loaded it from. Save overwrites previous copies.
Save Project As will pop up a file selection box in which you will choose a project name and project folder.
The File->Preferences menu item pops up this dialog box:
The purpose of this dialog box is to set system configuration information. The most important of these is the location of the analysis programs that IMAS uses, and the location of temporary disk storage. The Preferences dialog requires the folders for Glimmer, Blast, Clustal-W, Hmmer, Primer3. All of these can be found in <IMASInstallFolder>\tools. You would usually only need to set these values once. The Temp directory can be anywhere you want. C:\Temp or C:\WUTemp are reasonable places. Make sure that the temporary directory you choose exists!! On MacOS and Linux, /tmp is used, which always exists.
You must set the Temp directory in order to run external analysis programs, because IMAS places input files and parameter files in the Temp directory.
IMAS operates on one Primary Sequence at a time. This is the nucleotide sequence that you are analyzing to find genes, align sequences, and so on. In IMAS, a Primary Sequence contains numerous analyses of the NT sequence, which can be viewed and worked on in the NT Sequence Analysis tab. The purpose of IMAS is to help you add to your knowledge about the primary sequence, and to develop an annotation that represents that knowledge. You build that knowledge by performing various kinds of sequence analysis, interpreting the results, and by recording what you have learned using the IMAS note-taking facility.
An IMAS Project can also contain one or more Secondary Sequences. Each Secondary Sequence contains information about an organism that is typically a close family relative of the Primary Sequence. At the moment, a Secondary Sequence is just a multi-Fasta file that contains all of the NT Open Reading Frames of the given organism. This will eventually be the INSDSeq XML file (or equivalent), but that facility is not yet complete.
The primary purpose of a Secondary Sequence is to identify and package all of the genomic data for a single organism, along with any derived annotation information. Eventually, this will come to represent more biological information than just the multi-Fasta NT ORFs.
To load a Primary Sequence for processing in the project, choose the File ->Load Primary Sequence file menu item. The file must be in FASTA format, containing one NT sequence.
This image shows the results of having loaded “Rric85k.txt” The result is that the Primary Sequences section of the display shows what has been loaded. The first line of the FASTA file was
>Rrickettsii genome 1-85450 bp
Double-click on a Primary Sequence to show it in the NT Sequence Analysis tab of IMAS.
Each sequence that you load is translated by IMAS into an XML file that is saved on disk. Thus, loading a Primary or Secondary sequence reads the Fasta file and creates a new file in the Project Folder with the same base filename as the Fasta file, but with the extension .sxml. This Sequence XML file contains the NT sequence, 6 AA translated reading frames, plus a list of Features (ORFs) in the sequence. The sxml file contains references to all other files that are the result of analysis of this sequence, such as Blast searches, Clustal-W multialignments, and Hmmer searches.
There are three types of databases that IMAS can use for analysis: Blast NT databases, Blast AA databases, and Hmmer Motif databases, such as the Pfam database. These databases are displayed in the right two columns of the Data Management tab. The image below shows empty lists of databases.
IMAS performs Blast searches against databases that you have identified and formatted in advance, and that are stored in files somewhere on your disks. Typically, you will want to do a Blast search on collections of these files, so IMAS allows you to define Groups of databases. Each group can contain one or more database files. You can select files, and add or remove files to and from groups using the Database Groups dialog. Note that the databases listed here must be files on your computer’s filesystem.
To select databases to use, select Database Files->Database Groups. This pops up the Database Groups Dialog, shown here. This dialog controls a Database Group XML file that lists database files and collects them into groups. The file extension is .dxml, which reminds you that this is a IMAS Database list. If the Database Files ->Database Groups menu item is grayed out, this means that you have not yet opened or created a project. You must have a Project open to edit database groups. The Project file stores the name of the database group.dxml file so that you have it available next time you open the project.
The Database group dialog lets you edit groups of databases, select or delete lists of databases, and so on. When you are finished with creating groups and assigning database files to them, you can press the Save DXML File button either button, which saves the chosen .dxml file.
The Save DXML File As.. button pops up a file selection dialog that enables you to select a new .dxml filename.
Hitting Cancel discards all changes you made and dismisses the dialog.
Database Groups are further collected into fixed types: Nucleotide, Amino Acid, and Motif. To start editing groups and selecting files, select a database group file, or create one. In the IMAS install folder, there is a database folder that contains the group file DBGroups1.dxml. Once you have selected a file, you can select Nucleotide, Amino Acid, or Motif. The picture shows that Nucleotide has been selected from DBGroups1.dxml. On the right pane is a list of database files. These files are multi-Fasta files that have been processed by formatdb. Here we select only the base filename, not the .nin, .nhr etc files that formatdb generates.
On the left side is a combobox with the current collection of groups. The currently selected group is shown, and in the pane below is the group’s list of database files.
§ To add files to a group, select the file on the right and press the < button in the center. To delete files from a group, press the > button. More than one file can be selected at a time by holding the shift key or the ctrl (Command on Mac) key while you select files with the mouse.
§ To add a group, press the Add Group button on the lower left. To rename a group, press the Rename Group button. This pops up a dialog for you to type in a new group name. Choose whatever you want.
§ To delete a group, press the Delete Group button. This does not affect the list of files on the right. This only deletes the group name.
§ To add one or more files, press the Add File button. This will pop up a file selection dialog where you can choose database files to run queries on. Make sure that you select the appropriate kind of database file for the current category (eg. NT data for the NT group). Doing NT queries on AA files will yield no results.
§ To delete a selected file from the file list and from all groups, press the Delete File in All Groups button.
§ Press Save DXML File to save the database Group .dxml file.
§ Press Cancel to return to the main window without saving any changes.
The Project file records the name of the DBGroup dxml file internally, so that when you open a project again, the DB group dxml file is also loaded.
When you return to the Data Management tab, you will see that the selected NT group, AA group and Motif group have its list of files, and that you can select between groups. By default, when you select a group, all files are highlighted, which means that all of these files will be used in Blast/Hmmer runs of this type. You can shift-select or ctrl-select (cmd-select on Mac) which of these files you want to Blast against at any time. If you wish to run against a single file, then select it, and subsequent Blast runs will only use this file.
Double-clicking on a primary sequence opens the NT Sequence Analysis pane, as shown in the image. At the top of this pane is the name of the whole sequence, which came from the >Fasta Comment Line of the sequence file. The short white text box below it contains text information based on user selection of elements in the main analysis area.
The largest area of the screen is the Main Analysis Area. This is where most of your interaction will take place, and is the place where analysis results are visualized. The area’s operations will be explained in greater detail in the following sections.
Below the main analysis area is a log window that displays text in response to user actions such as loading projects, performing analyses, etc. In this window, you can see details of where analysis files have been stored. You can scroll back 3000 or so lines to see what has happened previously during the current session. You can also select text and copy it to the system clipboard.
Below the log window is a status line that displays one-line hints about menu operations as you move the mouse over them.
The primary purpose of the top-level text window is for IMAS to output details of items that have been selected in the Main Analysis Area. The following sections will explain what text will appear. You can select text in this area by pressing the left mouse button in the usual manner. Clicking the right mouse button (ctrl-click on a Mac) pops up a menu with two operations.
§ Copy is the standard operation that copies the highlighted text into the system Clipboard for future paste operations into other applications, like a Web Browser.
§ Search For Selected Text on NCBI (shown in the Mini Web Browser Tab) (Windows Only) takes the selected text and formats a query to the NCBI Entrez search engine. The intent of this is for the user to select something like an accession number in the text window and query Entrez with it. The results show up in the Mini Web Browser tab, which is a version of Microsoft Internet Explorer that has been “swallowed” by IMAS. In this web browser, you can click on hotlinks etc, but you can’t type in URLs, and the navigation bar is missing.
On the Mac or Linux, paste the copied text into the NCBI/EBI/DDBJ text search box.
At the top of the main analysis area is a ruler that shows the nucleotide location, based at 0. Below that is the NT sequence. The NT sequence is shown only as positive strand information. The NT complement is not explicitly shown. Below that is a plot of GC content, and below that are the 3 forward and 3 complement reading frames. The 3 forward frames are read visually left-to-right. The 3 complement reading frames are read right-to-left. This is so that the geometric location of each AA residue is located below its corresponding NT codon.
The NT row, the GC plot row, and the 6 reading frames are always present in the display.
On the right of the main analysis window is a standard linear scrollbar. On the left is a color-coded nonlinear scrollbar that can be used to jump vertically to a particular “channel” of information. The size of the colored region is the logarithm of the vertical size of corresponding data type.
In IMAS, we use color to distinguish data types.
§ Thus, Blast results are blue.
§ ORFs are green.
§ Multialignments have an orange background.
Press the left mouse button in this region to jump to the corresponding region, and drag the mouse to smoothly scroll vertically. The available channels are NT, GC, 6 AA channels, the ORF channel, the Blast channel, and the Multialignment channel.
The main analysis window can display sequences, ORFs, Blast alignments and multialignments at various levels of horizontal scale. Initially, the displayed scale is fully zoomed in, displaying all information at maximum magnification. The scale at this level is 1 nucleotide to 1 character.
To zoom out/ reduce magnification, press the – (minus) key. This will immediately redraw the main analysis window at the next smaller level of magnification. To zoom in, press the = key. At the magnification ratio 3 NT : 1 Char, the NT letters are not shown, and the AA letters are shown tightly spaced. For the ratios beyond (6:1, 15:1, 30:1, 75:1, 225:1, 750:1, 2000:1, 5000:1, 20000:1), AA letters are also not shown, and the visual representation of the various analyses listed below changes to show the most relevant information. Necessary information like text labels are maintained at a readable size, so that as the containing item shrinks, some of the label may be truncated as the display is zoomed out.
The NT sequence and the AA sequences are selectable. To highlight a section of DNA, you can click the left mouse button in the NT channel. IMAS will display a cyan arrow indicating one endpoint of a selected region of DNA. You may then navigate horizontally (zooming in and out as necessary) to select the other end of the desired segment of DNA. Select the other end by clicking the left mouse button at the desired point. The selected NT region will contain 3n nucleotides, anchored around the first selection, as shown on the two images on the right. The selection shown there is CAAAGCTAATTA, which is 12NTs. If you want to start or end at a different nucleotide, you must select again, taking care to precisely locate the mouse pointer.
With the selected region, you can pop up a menu using the right mouse button in the NT channel (ctrl-click on Mac). The choices are as follows:
§ Create Feature, which creates a forward-strand “ORF” (Feature) over the segment you have selected.
§ Create Reverse Complement Feature, which creates a reverse complement “ORF” over the segment you have selected. Each of these “ORFs” will be displayed in the ORF channel, and will be named UserNTSeq.
§ Copy (Fasta Format) copies the selected NT sequence into the system clipboard, which you can then paste into another application. The result of this operation is shown below. The Fasta comment is formatted with the name of the sequence from the Primary Sequence Fasta comment, plus an indicator of which NTs were selected (4..15 inclusive in this case)
>Rrickettsii genome 1-85450 bp:4-15
CAAAGCTAATTA
§ Copy Reverse Complement (Fasta Format) copies the selected NT sequence in reverse complement into the system clipboard. The NT index numbers after the colon indicate complement, and state the index numbers in reverse order. Here is the result of that operation:
>Rrickettsii genome 1-85450 bp:c15-4
TAATTAGCTTTG
Amino Acid sequence selection works in much the same way as NT selection. Click the left mouse button to set one end of the selected region, and click again to select the other end. Rounding up to an even factor of 3 NTs does not apply in this case.
§ Copy (Fasta Format) copies the AA sequence into the system clipboard. If the selected reading frame is one of the reverse-complement reading frames, then the sequence will be in reverse order, as below for the selected region shown in the image.
Rrickettsii genome 1-85450 bp:AA17-0
NN*LCH
§ Create Feature in this Reading Frame, which creates an “ORF” (Feature) over the segment you have selected in the proper (forward or reverse) reading frame.
IMAS has a facility to highlight Codons in the Main Analysis Area. This is typically used to highlight alternate Start usage, and so on. These highlights can be displayed in the NT channel, the 6 AA channels, and in other related analysis displays. To choose what Codons to highlight, select View->Codon Highlight Controls…from the main menu bar, which will pop up the Codon Highlight Dialog as shown below.
In the bottom half of the dialog is a table of Codons that you can left-click to change colors. Selecting one of these will pop up a color selection dialog, in which you may select a color for the text background. This dialog shows a number of standard colors, and you may design your own custom colors if you wish. Once you click OK in the color selection dialog, the background color of the codon will change to the chosen color. In the image, you can see that ATG has been highlighted orange. Although you may choose any color, it’s wise to select light colors, because dark colors will make the letters difficult to read.
You may select and re-select colors as you see fit. When the OK button is selected, the current color choices are reflected in the Main Analysis Window. To turn off a particular highlight, select the codon to be deactivated, and choose white.
IMAS uses the standard translation table from NT to AA, as shown. For the Codon highlights, only the NT values are used for shading, and the translations shown in the dialog are just a reminder. There will be an option in the future to choose different translation tables.
The top segment of the dialog shows some brief instructions, and a checklist of what types of things in the Main Analysis Area will get Codon Highlights. Check/uncheck these to select what channels of information will be highlighted. Below that is a pair of buttons that toggle whether forward strand codons are used, or both forward and reverse complement.
The algorithm for codon selection is simply that for each NT triplet that does not have a white background, IMAS searches the NT sequence for occurrences of that triplet. Thus, in the NT channel, matching codons in any reading frame will be selected. Any 3-letter sequence that matches will be selected and highlighted. These highlights can be drawn in the AA channels as well, but in this case, only codons that are in frame are selected. Similarly, Features, Blast Hits and Multialignments (which have an assigned reading frame) will be highlighted by in-frame codons only. The drawback of this approach is that there will be many overlapping highlights in the NT channel, making it hard to understand. Display of these highlights in the NT channel can be turned off.
Selecting the Forward and Reverse Complement radio button in the dialog allows string matching on the Forward and the reverse complement sequences. For the NT channel, this will add visual clutter, but the frame selection will be active in the other channels.
Please note that these highlights are computed as geometric entities, not as string searches in Blast or Features. For Blast alignments, gaps inserted into the query are not accounted for, so if gaps occur, the highlight will be located in the wrong place.
For Blast Alignments that are drawn in reverse complement order, the highlights are relocated starting from the right instead of the left. Reversing the display (see section 5.2) will also reverse the display of highlights.
Similar to Codon highlighting, IMAS has a facility to highlight selected Amino Acids in the Main Analysis Area. This can be used to infer secondary structure, etc. These highlights can be displayed in the 6 AA channels, and in other related analysis displays. Select View->AA Highlight Controls…from the main menu bar to pop up the AA Highlight Dialog as shown below.
In the bottom half of the dialog is a table of Amino Acids that you can left-click to pop up the color selection dialog. In the left column, you can select a the color of a single Amino Acid at a time.
The BLOSUM column allows you to select “similar” Amino Acids according to the values in the BLOSUM62 substitution matrix. Choosing a color for one of these results in a color change for all of the AAs to the right of the arrow. For example, picking the A ->A S in the first row will set the same background color for A and S. The logic behind the BLOSUM62 highlights is that for a given Amino Acid, any BLOSUM62 entry with a positive value is a possible substitute.
Similarly, in the column headed BioChem is a set of categories of Amino Acids that you can select for the same highlighting. The screenshot shows that the Very Hydrophobic cell has been turned orange, which highlights C, F, I, L, M, V, and W. These categories are here for convenience, and they of course depend on the biochemical environment, so you should select your own sets of Amino Acids as you see fit.
If Codon Highlights and Amino Acid Highlights are selected, they will both appear in the display, possibly causing confusion.
You can draw display the selected Amino Acid Highlights in Features, Blast Alignments, and so on. Like Codon Highlights, AA Highlights are geometric items, so gaps will cause highlight displacement.
In the future, we will fix the displacement problem, and we will probably add a facility for highlighting more general types of strings, such as restriction enzyme sites.
In this document, we use the more generic term Feature to refer to a segment of DNA in the primary sequence that can be identified, named and displayed in the ORF channel. An ORF obviously fits in this category. In the future, other types of non-coding DNA segments may also be displayed in this channel, likely using a different color.
Currently, when a Feature is created/found/identified in the Primary Sequence, its segment of the NT sequence is marked by a green rectangle in the ORF channel in the main analysis area. The image to the right shows the Feature named Rricke10, which is a positive-strand ORF. Its reading direction is marked on the right by a black direction arrow. The location of this arrow is at the right end of the Feature in this case, because most of the Rricke10 ORF extends to the left of the left boundary in this screenshot. The direction arrow will be in the rightmost pixels in the window when the Feature/ORF extends beyond the right window edge. The reading frame is also marked by the single digit to the left (3 in this case). Negative strand Features are marked -1, -2 or -3. Also, the corresponding region in the corresponding AA channel is highlighted green (the third channel in this case).
A significant use of Features in IMAS is to provide an anchor for further analysis. That is, a Feature is used to package a segment of DNA for analysis, and as the place to geometrically locate the results of that analysis. A Feature also stores a name and textual annotation information. If a Feature is an ORF, the goal of analysis is to determine if it is a gene, find what genes is similar to, multialign it with similar genes, and so on. For intergenic Features, other types of analysis are envisioned.
There are three methods for finding/creating/identifying ORFs in the NT sequence, all accessible from the main Analyze menu in the menu bar.
The first gene finder is Glimmer3.02. To run Glimmer, select Analyze->Run Glimmer on Selected NT Sequence. This will run the locally-provided Glimmer 3 on the sequence[2]. You will notice an MS-DOS window pop up briefly while IMAS runs the external Glimmer program. (On a Mac, external programs like Glimmer run quietly in the background). Running Glimmer in this way will result in a set of ORFs being displayed in the ORF channel below the 6 AA sequences.
IMAS actually generates a script that trains Glimmer3.02 with the Glimmer3.02 build-icm training program, then runs Glimmer3.02 using this training information. The temporary files for this operation are placed in the Temp directory you set up in the Preferences dialog. You may look at these training files and result files if you wish. The location of the generated ORFs is stored inside the Project file, so the output files in Temp are not used again once IMAS has completed its run of Glimmer. If Glimmer fails for some reason, you will see no addition of Features to the display, and the log window will print out some sort of diagnostic message that can help you figure out the problem. The most common problems are that the location of the Glimmer executables was not set correctly in the Preferences dialog, or that the Temp folder was not set correctly, or perhaps the Temp folder is on a full disk.
Currently, Glimmer cannot be parameterized, and IMAS uses the default Glimmer parameters for overlap and so on.
The second method for finding genes is the somewhat less sophisticated ORF finder included within IMAS. It simply looks for start/stop codon pairs of sufficient length without intervening stops. We have set the minimum length of ORFs reported to be 50NT, so this will create a large number of junk ORFs.
No files are created externally for this operation, and the results are stored the Project file.
This following IMAS feature does not work at this time and has been turned off.
The third method for finding genes uses the secondary sequences you have loaded. The idea of this “Gene Finder” is to take each of the NT ORFs in the selected secondary sequence and search for the best sequence alignment in the Primary sequence, and mark that as a Feature. Each Feature that is found is named after the ORF in the original secondary sequence that aligns to it. Blast hits that are too short (less than 50 NT) are discarded.
One drawback with this technique is that because the quality of the sequence alignment may be poor near the beginning or the end of a sequence, there is no guarantee that the found Feature will be in the correct reading frame. If the reading frame is ambiguous, IMAS selects the first appropriate (forward or reverse) reading frame that has no stop codons in it. The user must decide if this is the correct reading frame or not.
Unlike Glimmer and long-orfs, this process takes a long time to execute, since it must run Blast for each of the ORFs in the secondary sequence. Therefore, you should plan to run this against an entire Primary sequence or a big subset.
The mechanics of this command are that the IMAS creates a Blast NT database in a folder named internal_DB in the Project folder. Blast is run for each secondary sequence ORF against this database. Each sequence alignment of sufficient quality from each Blast run is used to create a Feature in the Primary sequence, shown in the ORF channel.
The green Feature boxes in the ORF channel can be selected for further operations by clicking the left mouse button (ctrl-click on Mac). Selected Feature boxes are a brighter green. Holding ctrl (cmd on Mac) while clicking a Feature box toggles the selection of it. This clicking method is operational everywhere in IMAS that it makes sense to select a collection of like objects.
Selecting a Feature will display the Feature’s name, reading frame, NT location and notes in the text window above the main pane. In some cases, many lines will be necessary and you can use its vertical scrollbar.
Pressing the right mouse button (ctrl-click on Mac) over a Feature pops up the Feature menu, which will be labeled with the name of the Feature that has been selected. The four chunks of this menu enable Blast search/alignment, Hmmer search, Feature data editing, and copying sequence data to the clipboard. The Blast and Hmmer operations will be covered in the following sections. If you have not explicitly selected a Feature, then you will get a menu for the Feature under the cursor.
§ Rename pops up a dialog box with the current name of the selected feature. Simply type a new name for the Feature. If more than one Feature was selected, you will be presented with a dialog box for each Feature to rename.
§ Add/Edit Note pops up a text dialog for each Feature to add comments. The purpose of this is to allow the user to type in annotations and other information, perhaps copied and pasted from Blast-aligned homologs. We may also include a more structured form-filling interface for Features in the future.
§ Generate HTML Report of This Feature creates an HTML document that lists the name, location, NT Sequence, AA sequence, and notes of this feature. On Windows, the mini web browser loads the resulting HTML file. The file is located in the folder <ProjectFolder>/HTMLReport.
§ Delete Feature removes the selected Features from the ORF channel. This information is still stored in the database. The purpose of this is to throw out junk ORFs. There is currently no method to unhide Features.
§ Copy NT Sequence (FASTA Format) copies the NT sequence of the Feature into the system clipboard suitable for pasting into other documents. The Fasta comment line is in the same format as shown in section 4.3. The NT sequence includes the start and stop codons. The sequence is copied in the appropriate order. That is, if this is a reverse complement strand Feature, you will get the reverse-complement sequence.
§ Copy AA Sequence (FASTA Format) copies the AA sequence of the Feature into the system clipboard. The Fasta comment line is in the same format as shown in section 4.3.1. The AA sequence includes the stop symbol. The result is shown in the log window.
To run Blast on part of the primary NT sequence, you must identify which Feature is going to trigger the results. This facility was designed with the idea that an ORF is the sequence item to query a Blast database with. There are 4 Blast operations, accessed by right-clicking the mouse on the selected Feature.
§ Blast NT Feature against selected NT DBs takes the NT sequence of the selected Feature(s) and queries the NT databases that have been selected in the Data Management Tab (section 3.2). The NT sequence of the selected Feature will be used as the query for each NT Blast database that was highlighted in the Nucleotide Sequence Database section of The Database Management Tab. For example, if 3 databases were highlighted, Blast will query each of these 3 databases. If more than one Feature was selected for this operation, each Feature NT Sequence will be queried against these 3 databases.
§ Blast AA Feature against selected AA DBs takes the AA sequence of the selected Feature(s) and queries the AA databases that have been selected in the Data Management Tab. Each Feature is queried against each highlighted AA Blast database.
For each Blast query, an MS-DOS window will briefly pop up, and the results will be visually presented below the corresponding Feature. For small databases (1-2Mbase), these local queries are quick, and you will get to see the results right away.
§ Blast NT Feature against nr NT database at NCBI takes the NT sequence of the selected Feature(s) and queries the nr NT databases at NCBI. You need to be online for this to work. The query is made by a web-enabled version of Blast that submits a query to NCBI and collects the results.
§ Blast AA Feature against nr AA database at NCBI takes the AA sequence of the selected Feature(s) and queries the nr AA databases at NCBI. You need to be online for this to work. The query is made by a web-enabled version of Blast that submits a query to NCBI and collects the results.
Each NCBI query will take between 1 and 5 minutes, depending on the response time of the NCBI server. Unfortunately, you will not be able to use IMAS while the query is being processed. We are thinking of enhancing IMAS so that these queries can be queued up as background processes to allow the user to examine results while the NCBI servers work.
For a single Blast run (single query against single DB), the High-Scoring Pair (HSP) results are plotted inside a blue outline box that encloses the maximum extent of all of the results. At maximum detail, the HSPs are presented similarly to standard human-readable Blast, except that the HSP alignment is presented continuously from left to right. Scrolling left or right will allow you to see more of the alignment. Zooming out will allow you to see a percent identity plot of alignment quality, and allow you to see the entire alignment in one image. Instead of presenting index numbers of Query and Subject, each alignment is anchored at the left end, so that it can be located in the context of the query Feature. Gaps in the alignment will cause the results to fall out of alignment with the top channel NT sequence.
To temporarily realign a Blast HSP with the NT or AA channel, hold down the alt key (both Mac and PC) as you move the mouse. Blast runs with Query gaps will move left to realign with the global NT sequence.
Each Blast HSP is presented in a four-row blue box. The top row shows the name of the aligned sequence. The next 3 rows show the query, alignment midline, and subject (hit) sequence.
The image shows the leftmost pieces of 4 such alignments. In this case, the labels are the same for each hit because the queried database is the entire R.akari NT sequence. The dark blue backgrounds of the 3 detail lines of the alignment are a percent identity plot of the alignment. The details of computing this are presented in a following section. IMAS does two visualizations of percent identity. The first is shown at all levels of zoom, where the dark blue color represents 100% identity, and very light blue represents 0 identity. These background colors are shown at all zoom levels so that the user can visually locate an interesting part of the alignment and maintain focus on it while zooming in and out.
The image below shows the same alignments with the display zoomed out. When the zoom level leaves no space to show NT or AA text, the alignment details are replaced with a line plot of percent identity, which visually duplicates the blue background, but allows more accurate visualization. When zooming, the text labels are scaled appropriately to be visible, but they may be cut off on the left due to lack of space, as can be seen in the lower three alignments.
The cyan highlight in the label line of the top alignment indicates that it has been selected. Like Feature selection, corresponding text details are printed in the text window. This includes many of the important details of the selected alignment:
§ Name
§ Accession number
§ Hit length
§ Bit score
§ Expected value
§ Hit ID
The HitID is usually something like gi|28262418|gb|EAA25922.1|, which can be used for data lookup by selecting Search For Selected Text on NCBI in the text window.
You can also copy text for later use in the comment of the corresponding Feature. We may build more direct support for copying this type of info into the Feature that triggered the Blast alignment.
For sequences on the forward strand, rendering the alignment is easy. A Blast alignment is shown in forward order, as in the image below. In this case, the query Feature Rricke31 is on the forward strand. The blue arrow in the upper-right corner of the blue outline box points right to indicate that the query was given in the forward order.
Each individual HSP alignment also has one arrow on the right end of its box for the Query and one arrow for the Subject sequence. In the alignment with RC0022, both arrows point right, indicating that a Forward query aligned with a Forward sequence.
To create an alignment in which a reverse-complemented NT sequence must be used, Blast reverse-complements the Query and searches for this flipped query in the database. Thus, all Blast alignment reports will initially show the Hit in forward order. If the query was reverse-complemented, it is shown in reverse-complement order. An example of this is presented in the lower small alignment in the image below. The reverse complement Query was TACTACAGCTGTAA, and the left-pointing arrow on the (top) Query line indicates that it is in reverse-complement order. Thus, it is in opposite order from the sequence in the NT channel.
As with Features, a menu may be popped up for Blast HSPs that you have selected.
Right click the mouse (ctrl-click on Mac) when the mouse cursor is over a Blast HSP to get the Blast menu. The following command is relevant for reverse-complement rendering:
§ Draw HSP in Reverse Complement Order reverses the order of the Query, the Midline and the Subject of the selected HSP, and complements the Query and Subject letters. Also, the arrows on the right of the HSP display are flipped in their respective opposite directions. Selecting this menu item again will again reverse and complement the selected HSPs, yielding the original result again. This command is useful if the Query sequence is in reverse complement order and you want to put it in the same order as the NT sequence and its corresponding AA reading frame.
For Reverse Complement ORFs, having the query in forward order aligns with the forward sequence in the NT channel, but it is reverse-complemented from the transcription order. Flipping the presentation allows you to read the transcribed NT sequence from left to right.
For AA Blast HSPs, the AA query sequence is in transcription order, since querying with a reversed AA sequence makes no sense. For negative-strand ORFs, IMAS creates the Query from right to left, and this is indicated in the HSP display by displaying a left-pointing arrow in the Query line. This is just a reminder that the Query was from a negative strand ORF. In this case, the HSP AA sequence and the corresponding negative-strand AA channel are in opposite order from each other.
For these situations, a hover box that highlights the translation to/from negative strand DNA might be preferable.
The Percent Identity Plot for HSPs is computed by sliding a 10 NT character window over the HSP alignment and scoring each Query-Subject letter pair according to its similarity. The sum of scores per letter is divided by the number of letters, resulting in a value between 0 and 1, which is plotted.
For NT HSPs, we have built two schemes. The first is very simple. Identical NTs score 1, everything else scores 0, and the number of letters is always 10. The problem with this simple scheme is that some NT differences matter more than others, because they code for the same Amino Acid.
Our more complex NT scoring scheme accounts for codon identity and codon redundancy, as well as for the resulting Amino Acid similarity according to the standard substitution matrices. For this codon-based similarity, IMAS scores each codon according to its level of identity. There are three situations:
§ First, if the Query & Subject codons are identical, the score is 3, which is numerically compatible with the previous scheme of scoring 1 per identical nucleotide.
§ Second, if the codons are different, but code for the same amino acid, the score is 2.
§ Third, if the codons code for different amino acids, then the BLOSUM62 matrix is consulted. Each codon is translated into its Amino Acid, the BLOSUM62 matrix is used to look up the substitution log-odds probability for these two AAs. This value subst is plugged into the formula score = (subst – min) / (max – min). The value min is the minimum log-odds ratio for the matrix, and max is the maximum for the Query AA, which is on the matrix diagonal. This formula yields a codon score between 0 and 1. The sum of codon scores is divided by the 3 times the number of codons used.
Since the size of the sliding window is not necessarily divisible by 3, a codon is included in the current window only if all 3 letters fit. If not, the current window is completed, and the non-fitting codon is used as the first codon in the next window. If necessary, the scoring procedure accounts for slippage by “catching up” with an extra codon every 3 windows. For example, with window size 10, the first window uses 9 NTs, the next uses 9 NTs, and the next uses 12 NTs. 9+9+12 = 30.
For AA HSPs, the same Blosum-based scoring scheme is used, with the obvious exception that no NT translation needs to take place. Similarly, for window sizes not a factor of 3, the catch-up procedure is used. The score per window is scaled to a value between 0 and 1 according to the number of AAs used.
The presentation of each HSP starts immediately below the ORF channel. Blast reports HSPs in order from most to least significant HSP, and we use this strategy to lay results out spatially. The first HSP is placed as close to the top as possible, and the next is stacked below as close to the top as possible. If there is space available, an HSP will be placed in the top row, which would happen if there were two non-overlapping alignments along the NT sequence. HSPs are stacked from top to bottom in order from most to least significant. If there is space available horizontally for an HSP near the top, it will be placed there.
Blast runs for a single database (which are a set of HSPs) have a similar stacking rule of being fitted in the first available space. Because alignment gaps tend to make results larger than the query, Blast runs on closely-spaced ORFs will give badly-stacked results with lots of empty space. We are investigating layout techniques to solve this problem.
All Blast results are placed within a dedicated rectangle that is as wide as the Primary Sequence, and as tall as is needed to fit all the results. Other analyses, such as multialignments, are placed below this Blast channel.
Many of the HSPs reported by Blast can be of low significance. IMAS currently has no means of calling Blast with cutoff parameters, so IMAS calls Blast with default values and filters out the results. Select View->Blast Display Controls to filter out small or insignificant hits from the display. Note that this filtering operation controls Display only. The Blast HSPs that you hide are still present in the system, and can be un-hidden at any time.
The image below shows the Blast Filter Dialog, which controls the presentation of all blast results. In the upper part is a set of 3 numerical criteria for hiding a Blast HSP. IMAS will apply each of the criteria for hiding Blast hits that you check. You can pop up the Blast Filter Dialog at any time and check/uncheck hiding criteria, change the numbers, and so on.
§ Bit Score of Blast Bit < X allows you to hide HSPs with low bit scores. Type in a numeric value and check the box on the left.
§ Length of Blast Hit < X allows you to drop small HSPs. Around 30 or so will cut off the many small NT HSPs that might otherwise appear.
§ Expected Value of Blast Hit > X allows you to hide HSPs with expected values greater than some limit. Type in something like 0.00001.
In the lower section of this dialog is a facility for changing the display of Blast HSPs. Again, these operations apply globally to all Blast HSPs.
§ Hide Text Label of All Blast Hits allows you to show just the three lines of Query-Midline-Subject. This allows you to pack more HSPs in the display. You can still learn the name of a Blast HSP by clicking on it and looking at the resulting text window display.
§ Shrink Height of All Blast Hits By 3 allows the display of just the Midline of the HSP alignment. For NT HSPs, this will look like a sequence of vertical bars and spaces. For AA HSPs, you will see the AA letters and + signs. This option allows even more dense vertical packing.
These last two options can be activated together or separately. Deactivate each option by clearing the appropriate checkbox. The results are displayed when the dialog Close button is selected.
When IMAS runs Blast, it creates a file named <TempFolder>\inputN.tmp for NT queries and <TempFolder>\inputP.tmp for AA queries. IMAS asks Blast to output XML-formatted results with the -m 7 option. The results will be stored in the folder BlastResults in the project folder. Each Blast output file has a unique name that is composed as follows:
BaseName_FeatureName_NTIdx_YYMMDD_HHMM_DBNameN.bxml.
The table indicates what each of these pieces are.
BaseName |
Name of the Primary Sequence |
FeatureName |
Name of the Feature forming the query |
NTIdx |
NT index of the query |
YYMMDD_HHMM |
Date and Time of the query |
DBNameN |
Database name that was queried |
At the end of the DBName is either N or P, indicating Nucleotide or Protein query. This very long name can be used to help you understand what the query was, where it was on the sequence, what feature was selected to make the query, when the query was made, and what database was used. The long name helps disambiguate possibly similar queries, and ensure that there are no name collisions with multiple queries on the same feature with the same database. The file extension bxml is chosen to remind you that it is a Blast XML file.
The Sequence File (.sxml) stores the filename of each Blast run that IMAS has made for the primary sequence. Be sure to select File->Save to save the fact that you have done these analyses. If you perform a Blast analysis and don’t do a Save, then the .sxml file will not have the pointer to the .bxml file, and it will have effectively disappeared from the Project. The actual .bxml file will still be left in the <ProjectFolder>\BlastResults folder, so you can look at it with Internet Explorer, but currently there is no way to reconnect it to a Project.
To run hmmpfam on an AA sequence, you must identify which Feature is going to hold the results. Like Blast, a Feature is the anchor for HMM searches. In the Feature menu is the item:
Run HmmPfam for Motifs on this AA Sequence takes the AA sequence of the selected Feature(s) and queries the Motif databases that have been selected in the Data Management Tab (section 3.2). The AA sequence of the selected Feature will be used as the query for each HMMer database that was highlighted in the Motif Database section of The Database Management Tab. For example, if 3 databases were highlighted, hmmpfam will query each of these 3 databases. If more than one Feature was selected for this operation, each Feature AA Sequence will be queried against these 3 databases.
We use hmmpfam because this program is most suitable for finding all Motifs within a single AA sequence. If you run against the full Pfam database, the search will take a minute or so to complete.
Hmmer results are presented in the same style as Blast results, as outlined above. The sequence alignment is presented as one long horizontal feature which can be examined in detail by zooming in, or viewed overall by zooming out.
Currently, Hmmer results are presented in the Blast channel, stacked below the most recent Blast run. The Hmm results can be visually distinguished from Blast results by the use of a different background color for the Percent Identity Plot. Hmmer uses a greenish blue, while Blast uses blue.
Clicking on the Hmmer results displays text detail in the Text Window. Currently, there is no menu available to pop up on Hmm alignments. We are currently designing operations that would fit in such a menu.
The visual scoring function for Hmmer results is currently the same additive scoring function for AA sequence alignments. Unfortunately, this visualization is a little misleading, because the Position-Specific Scoring Matrix that Hmmer uses for scoring an alignment has a significantly greater range per AA residue than Blast uses. Thus, while the visualization is appropriate for giving a gross overview, the visual display doesn’t correlate well with the expected value or overall score. We are currently developing visualization techniques that display the results more appropriately.
HmmPfam results are stored in the folder <ProjectFolder>\HmmSearchResults. Similar to Blast, the filename for each run is BaseName_FeatureName_NTIdx_YYMMDD_HHMM_DBName.hmmo
The fields are the same as for Blast, with the obvious exception that the Database name is a Hmmer database, and hmmo is the file extension IMAS uses to label Hmmer output. Again, each filename is stored in the .sxml file of the Primary Sequence that was used for the Hmmer search query. To make be able to reference this Hmm search again, make sure to Save the project before quitting IMAS.
To run the Clustal-W multialignment program, you can select one or more Blast HSPs in the Main Analysis Area to multialign. You may multialign either NT or AA sequences. IMAS uses the Subject (Hit) sequences and the NT or AA sequence of the related Feature for multialignment. You do not need to explicitly select the Feature, since this will be determined by which Blast HSPs were selected. It only makes sense to multialign the Feature with the HSPs that were derived that Feature. If you select Blast hits from a different Feature, these will be ignored.
To select more than one Blast HSP, use Ctrl-left mouse button (cmd-click on Mac) . On any selected hit, Right-click to pop up the Blast HSP Operations menu. Select MultiAlign Selected Blast Hits to run Clustal-W. The image shows the situation in which the two HSPs RC0001 and RP001 have been selected, as derived from the Feature Rricke104. The Feature and the two Subject sequences of RC0001 and RP001 will be multialigned.
Multialignment results are displayed on an orange background with the letters of the alignment colored according to their similarity to the consensus. The consensus line is computed by IMAS, and is simply the letter that is represented above 50% in the column. Columns of identical letters in the multialignment are colored red, while non-identities are blue. All sequences in multialignments are presented in Forward order. Thus, if a reverse-complement Feature or Blast HSP is to be used, it is transformed into forward order and multialigned.
The whole multialignment is labeled with the Feature that was used in the multialignment. Each row of the multialignment is labeled with which sequence was used to create it. Since we allocate a fixed amount of space on the left for the row labels, as IMAS is zoomed out, that space is also shrunk, which leaves less and less room for row labels. We will soon allow more operations on multialignments, such as sorting, selecting, etc.
At less detailed zoom levels, similarity to the consensus is computed row-by-row in the same way as the Percent Identity Plot for Blast HSPs.
In the image below is the same sequence as above zoomed out by 15. In this case, the alignment is very good, so the line graphs are near 100% for each line. In the next image, there are patches of high similarity (dark blue) separated by low similarity (light blue/orange). Each line is compared to the consensus, which would bias this computation a bit, but the visualization gives the general impression of the level of consensus across the multialignment.
The source data used in multialignments comes from databases that were used to generate Blast HSPs. There are two categories of multialignment input data:
Feature data comes from either the NT sequence as derived from the extent of the Feature, or the appropriate reading frame of the AA sequence for this Feature.
HSP Subject sequences are used as a database accession query into the source database that generated the HSP alignment. That is, instead of just copying the sequence from the Subject line in the HSP that IMAS displays, IMAS takes the Accession key from the Blast HSP and searches for the “full-size” sequence in the local Blast database where the HSP was found. The length of this sequence is trimmed to the length of the Feature, so that Blast HSPs that came from whole-genome NT sequences are not multialigned over millions of nucleotides. The result is that Blast is used to identify the corresponding ORF in another organism, and that full ORF is used in multialignments. Thus, when multialigning, you will see a sequence MS-DOS windows popping up to perform this database lookup with the Blast program fastacmd.
This function does not work for Blast HSPs derived from the network-based NCBI nr database, because fastacmd does not allow this type of retrieval to the NCBI nr database. Of course, you could download the desired database and run queries locally.
Blast Database Formatting must be done correctly for this function to retrieve data from local Blast databases to work. Thus, if you are formatting your own databases, be sure to call formatdb –o so that accession numbers are included in the formatted database. Also, make sure that each Fasta entry in the multi-Fasta file has some kind of accession number or other unique identifier. If this is not present, then the database lookup process for multialignments will fail, and you will be left with just the Subject sequence.
Like Blast and Hmmer, IMAS stores the query sequences to Clustal-W runs in the TEMP folder that you identified in Preferences, and then runs Clustal-W using this input file. The file is named BaseName.clustalInp, and it gets overwritten for each run of Clustal-W.
The output from ClustalW is written into the <ProjectFolder>\MultiAlignResults folder, using a similar naming scheme as other output files:
BaseName_FeatureName_NTIdx_YYMMDD_HHMMma.mxml
Thus, the output file is an XML-ixzed version of the multi-fasta file with gaps inserted per sequence to form the multialignment. You may view this mxml file outside of IMAS, if you wish. The .sxml file for the Primary Sequence stores the filename for this file, associated with the Feature that was used in the multialignment. Multiple runs of Clustal-W will result in multiple files stored in the MultiAlignResults folder, with multiple links within the .sxml file. You must Save the project to ensure that IMAS has access to this file when you next run IMAS. Files that “lose” their links are not deleted from the disk.