Software

14 pages
8 views

A knowledge integration framework for information visualization

of 14
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
Users can better understand complex data sets by combining insights from multiple coordinated visual displays that include relevant domain knowledge. When dealing with multidimensional data and clustering results, the most familiar displays and
Transcript
  A Knowledge Integration Frameworkfor Information Visualization  Jinwook Seo 1,2 and Ben Shneiderman 1,2,3 1. Department of Computer Science, 2. Human-Computer Interaction Laboratory, Institute forAdvanced Computer Studies, and 3. Institute for Systems ResearchUniversity of Maryland, College Park, MD 20742{seo, ben}@cs.umd.edu Abstract. Users can better understand complex data sets by combining insightsfrom multiple coordinated visual displays that include relevant domain knowl-edge. When dealing with multidimensional data and clustering results, the mostfamiliar displays and comprehensible are 1- and 2-dimensional projections (his-tograms, and scatterplots). Other easily understood displays of domain knowl-edge are tabular and hierarchical information for the same or related data sets. The novel parallel coordinates view [6] powered by a direct-manipulationsearch, offers strong advantages, but requires some training for most users. Weprovide a review of related work in the area of information visualization, andintroduce new tools and interaction examples on how to incorporate users’ do-main knowledge for understanding clustering results. Our examples present hi-erarchical clustering of gene expression data, coordinated with a parallel coor-dinates view and with the gene annotation and gene ontology. 1 Introduction Modern information-abundant environments provide access to remarkable collectionsof richly structured databases, digital libraries, and information spaces. Text searchingto locate specific pages and starting points for exploration is enormously successful,but this is only the first generation of knowledge discovery tools. Future interfacesthat balance data mining algorithms with potent information visualizations will enableusers to find meaningful clusters of relevant documents, relevant relationships amongdimensions, unusual outliers, and surprising gaps [10].Existing tools for cluster analysis are already used for multidimensional data inmany research areas including financial, economical, sociological, and biologicalanalyses. Finding natural subclasses in a document set not only reveals interestingpatterns but also serves as a basis for further analyses. One of the troubles with clusteranalysis is that evaluating how interesting a clustering result is to researchers is sub- jective, application-dependent, and even difficult to measure. This problem generallygets worse as dimensionality and the number of items grows. The remedy is to enableresearchers to apply domain knowledge to facilitate insight about the significance of   the clustering result. Strategies that enable exploration of clusters will also supportsense-making about outliers, gaps, and correlations.A cluster is a group of data items that are similar to others within the same groupand are different from items in other groups. Clustering enables researchers to seeoverall distribution patterns, and identify interesting unusual patterns, and spot poten-tial outliers. Moreover, clusters can serve as effective inputs to other analysis methodsuch as classification.Researchers in various areas are still developing their own clustering algorithmseven though there are already a large number of general-purpose clustering algorithmsin existence. One reason is that it is difficult to understand a clustering algorithm wellenough to apply it to their new data set. A more important reason is that it is difficultfor researchers to validate or understand the clustering results in relation to theirknowledge of the data set. Even the same clustering algorithm might generate a com-pletely different clustering result when the distance/similarity measure changes. Aclustering result could make sense to some researchers, but not to others because va-lidity of a clustering result heavily depends on users’ interest and is application-dependent. Therefore, researchers’ domain knowledge plays a key role in understand-ing/evaluating the clustering result.A large number of clustering algorithms have been developed, but only a smallnumber of cluster visualization tools are available to facilitate researchers’ under-standing of the clustering results. Current visual cluster analysis tools can be im-proved by allowing researchers to incorporate their domain knowledge into visualdisplays that are well coordinated with the clustering result view. This paper describes additions to our interactive visual cluster analysis tool, theHierarchical Clustering Explorer (HCE) [9]. These additions include 1-D histogramsand 2-D scatterplots that are accessed through coordinated views. These views arefamiliar projections that are more comprehensible than higher dimensional presenta-tions. HCE also implements presentations of external domain knowledge. While HCEusers appreciate our flexible histogram and scatterplot views, his paper concentrateson novel presentations for high-dimensional data and for domain knowledge: −   a parallel coordinates view enables researchers to search for profiles similar to acandidate pattern, which is specified by direct-manipulation −   a tabular or hierarchical view enables researchers to explore relationships that maybe found in information that is external to the data set.Visualization techniques can be used to support semi-automatic information ex-traction and semantic annotation for domain experts. For example, visual analysis bytechniques such as dynamic queries has been successfully used in supporting re-searchers who are interested in analyses of multidimensional data [5][7]. Well-designed visual coordination with researchers’ domain knowledge facilitates users’understanding of the analysis result. This paper briefly explains the interactive exploration of clustering results usingour current version, HCE 3.0. Section 3 describes the knowledge integration frame-work, including the design considerations for direct-manipulation search and dynamicqueries. Section 4 presents a tabular view showing gene annotation and the gene on-tology browser and section 5 covers some implementation issues.  2 Interactive Exploration of Clustering Results with HCE 3.0 Some clustering algorithms, such as k-means, require users to specify the number of clusters as an input, but it is hard to know the right number of natural clusters before-hand. Other clustering algorithms automatically determine the number of clusters, butusers may not be convinced of the result since they had little or no control over theclustering process. To avoid this dilemma, researchers prefer the hierarchical cluster-ing algorithm since it does not require users to enter a predetermined number of clus-ters and it also allows users to control the desired resolution of a clustering result. Fig. 1. Overall layout of HCE 3.0. Minimum similarity bar was pulled down to get 55 clustersin the Dendrogram View. A cluster of 113 genes (highlighted with orange markers below thecluster) is selected in the dendrogram view and they are highlighted in scatterplots, detail view,and parallel coordinates view tab window (see section 3). Users can select a tab among theseven tab windows at the bottom pane to investigate the data set coordinating with differentviews. Users can see the names of the selected genes and the actual expression values in the de-tail views. HCE 3.0 is an interactive knowledge visualization tool for hierarchical clusteringresults with a rich set of user controls (dendrograms, color mosaic displays and etc.)(Fig. 1). A hierarchical clustering result is generally represented as a binary tree calleddendrogram whose subtrees are clusters. HCE 3.0 users can see the overall clusteringresult in a single screen, and zoom in to see more detail. Considering that the lower asubtree is, the more similar the items in the subtree are, we implemented two dynamic  controls, minimum similarity bar and detail cutoff bar, which are shown over the den-drogram display. Users can control the number of clusters by using the minimumsimilarity bar whose y-coordinate determines the minimum similarity threshold. Asusers pull down the minimum similarity bar, they get tighter clusters (lower subtrees)that satisfy the current minimum similarity threshold. Users can control the level of detail by using the detail cutoff bar. All the subtrees below the detail cutoff bar arerendered using the average intensity of items in the subtree so that we can see theoverall patterns of clusters without distraction by too much detail.Since we get a different clustering result as a different linkage method or similaritymeasure is used in hierarchical clustering, we need some mechanisms to evaluateclustering results. HCE 3.0 implements 3 different evaluation mechanisms. Firstly,HCE 3.0 users can compare two dendrograms (or hierarchical clustering results) inthe dendrogram view to visually comprehend the effects of different clustering pa-rameters. Two dendrograms are shown face to face, and when users double-click on acluster of a dendrogram, they can see the lines connecting items in the cluster and thesame items in the other dendrogram [9]. Secondly, HCE 3.0 users can compare a hier-archical clustering result and a k-means clustering result. When users click on a clus-ter in the dendrogram view, the items in the cluster are also highlighted in the k-means clustering result view (the last tab in Fig. 1) so that users can see if the twoclustering results are consistent. Thirdly, HCE 3.0 enables users to evaluate a cluster-ing result using an external evaluation measure (F-measure) when they know the cor-rect clustering result in advance. Through these three mechanisms, HCE 3.0 helps us-ers to determine the most appropriate clustering parameters for their data set.HCE 3.0 was successfully used in two case studies with gene expression data. Weproposed a general method of using HCE 3.0 to identify the optimal signal/noise bal-ance in Affymetrix gene chip data analyses. HCE 3.0's interactive features help re-searchers to find the optimal combination of three variables (probe set signal algo-rithms, noise filtering methods, and clustering linkage methods) to maximize theeffect of the desired biological variable on data interpretation [8]. HCE 3.0 was alsoused to analyze in vivo murine muscle regeneration expression profiling data usingAffymetrix U74Av2 (12,488 probe sets) chips measured in 27 time points. HCE 3.0'svisual analysis techniques and dynamic query controls played an important role infinding 12 novel downstream targets that are biologically relevant during myoblastdifferentiation [12]. In section 3 and 4, we will use this data set to demonstrate howHCE 3.0 combines users’ domain knowledge with other views to facilitate insightabout the clustering result and the data set.Fig. 2 shows four tightly coupled components of HCE and linkages between them.Updates by each linkage in Fig. 2 are instantaneous (or, it takes less than 100ms) formost microarray data sets.    Fig. 2. Diagram of interactions between components of HCE 3.0. All interactions are bi-directional. This paper describes coordination between the dendrogram view, scatter-plots/histograms views, parallel coordinates view, and knowledge tables/hierarchies views.Knowledge tables/hierarchies incorporate external domain knowledge while others show the in-ternal data using different visual representations. 3 Combining users’ domain knowledge: Parallel coordinates view Many microarray experiments measure gene expression over time [2][12]. Research-ers would like to group genes with similar expression profiles or find interesting time-varying patterns in the data set by performing cluster analysis. Another way to iden-tify genes with profiles similar to known genes is to directly search for the genes byspecifying the expected pattern of a known gene. When researchers have some do-main knowledge such as the expected pattern of a previously characterized gene, re-searchers can try to find genes similar to the expected pattern. Since it is not easy tospecify the expected pattern at a single try, they have to conduct a series of searchesfor the expression profiles similar to the expected pattern. Therefore, they need an in-teractive visual analysis tool that allows easy modification of the expected pattern andrapid update of the search result.Clustering and direct profile search can complement each other. Since there is noperfect clustering algorithm right for all data sets and applications, direct profilesearch could be used to validate the clustering result by projecting the search resultonto the clustering result view. Conversely, a clustering result could be used to vali-date the profile search by projecting the cluster result on the profile view. Therefore,coordination between a clustering result and a direct search result make the identifica-tion process more valid and effective.‘Profile Search’ in the Spotfire DecisionSite (www.spotfire.com) calculates thesimilarity to a search pattern (so called 'master profile') for all genes in the data setand adds the result as a new column to the data set. The built-in profile editor makes it
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x