18 pages

A kernel-based integration of genome-wide data for clinical decision support

of 18
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
A kernel-based integration of genome-wide data for clinical decision support
  See discussions, stats, and author profiles for this publication at: A kernel-based integration of genome-widedata for clinical decision support  Article   in  Genome Medicine · May 2009 DOI: 10.1186/gm39 · Source: PubMed CITATIONS 35 READS 30 9 authors , including: Some of the authors of this publication are also working on these related projects: ENGAGE project: Engaging self-regulation targets to understand the mechanisms of behavior changeand improve mood and weight outcomes   View projectLS-SVM Applications   View projectOlivier GevaertStanford University 127   PUBLICATIONS   1,478   CITATIONS   SEE PROFILE Annelies DebucquoyUniversity of Leuven 38   PUBLICATIONS   510   CITATIONS   SEE PROFILE Karin HaustermansUniversity of Leuven 362   PUBLICATIONS   7,595   CITATIONS   SEE PROFILE Bart L.R. De MoorUniversity of Leuven 1,094   PUBLICATIONS   32,758   CITATIONS   SEE PROFILE All content following this page was uploaded by Anneleen Daemen on 12 January 2017. The user has requested enhancement of the downloaded file.  Genome Medicine  2009, 1:39   Research A kernel-based integration of genome-wide data for clinical decision suppor   t  Anneleen Daemen*, Olivier Gevaert*, Fabian Ojeda*,  Annelies Debucquoy  † , Johan AK Suykens*, Christine Sempoux ‡ , Jean-Pascal Machiels § , Karin Haustermans † and Bart De Moor*  Addresses: *Department of Electrical Engineering (ESAT-SCD), Katholieke Universiteit Leuven, Kasteelpark Arenberg, 3001 Leuven,Belgium. † Department of Experimental Radiotherapy, Katholieke Universiteit Leuven, UZ Herestraat, 3000 Leuven, Belgium. ‡ Departmentof Pathology, Université Catholique de Louvain, St Luc University Hospital, Avenue Hippocrate, 1200 Brussels, Belgium. § Department of Medical Oncology, Université Catholique de Louvain, St Luc University Hospital, Avenue Hippocrate, 1200 Brussels, Belgium.Correspondence: Anneleen Daemen. Email: Abstract B   ackground:Although microarray technology allows the investigation of the transcriptomicmake-up of a tumor in one experiment, the transcriptome does not completely reflect theunderlying biology due to alternative splicing, post-translational modifications, as well as theinfluence of pathological conditions (for example, cancer) on transcription and translation. Thisincreases the importance of fusing more than one source of genome-wide data, such as thegenome, transcriptome, proteome, and epigenome. The current increase in the amount of available omics data emphasizes the need for a methodological integration framework.Methods:We propose a kernel-based approach for clinical decision support in which manygenome-wide data sources are combined. Integration occurs within the patient domain at the levelof kernel matrices before building the classifier. As supervised classification algorithm, a weightedleast squares support vector machine is used. We apply this framework to two cancer cases,namely, a rectal cancer data set containing microarray and proteomics data and a prostate cancerdata set containing microarray and genomics data. For both cases, multiple outcomes are predicted.Results:For the rectal cancer outcomes, the highest leave-one-out (LOO) areas under thereceiver operating characteristic curves (AUC) were obtained when combining microarray andproteomics data gathered during therapy and ranged from 0.927 to 0.987. For prostate cancer, allfour outcomes had a better LOO AUC when combining microarray and genomics data, rangingfrom 0.786 for recurrence to 0.987 for metastasis.Conclusions:For both cancer sites the prediction of all outcomes improved when more than onegenome-wide data set was considered. This suggests that integrating multiple genome-wide datasources increases the predictive performance of clinical decision support models. Thisemphasizes the need for comprehensive multi-modal data. We acknowledge that, in a first phase,this will substantially increase costs; however, this is a necessary investment to ultimately obtaincost-efficient models usable in patient tailored therapy.   Published: 3 April 2009 GenomeMedicine  2009, 1:39 (doi:10.1186/gm39)The electronic version of this article is the complete one and can befound online at 4 November 2008Revised: 20 March 2009Accepted: 3 April 2009   © 2009 Daemen et al. ; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (,which permits unrestricted use, distribution, and reproduction in any medium, provided the srcinal work is properly cited.  Background Kernel methods are a powerful class of methods for patternanalysis. In recent years, they have become a standard toolin data analysis, computational statistics, and machinelearning applications [1]. Based on a strong theoreticalframework, their rapid uptake in applications such as bioinformatics [2], chemoinformatics, and even compu-tational linguistics is due to their reliability, accuracy, andcomputational efficiency. In addition, they have thecapability to handle a very wide range of data types (forexample, kernel methods have been used to analyzesequences, vectors, networks, phylogenetic trees, and so on).The ability of kernel methods to deal with complex struc-tured data makes them ideally positioned for heterogeneousdata integration. More specifically, in this study we used aweighted least squares support vector machine (LS-SVM),an extension of the support vector machine (SVM) for super- vised classification [3-5]. Compared to the SVM, the LS-SVMis easier and faster for high dimensional data because thequadratic programming problem is converted into a linearproblem. To account for the unbalancedness in many two-class problems, this linear problem is extended with weightsthat are different for the positive and negative classes.The growing amount of data combined with factors such astime, cost, and personalized treatment is complicatingclinical decision making. Using advanced mathematicalmodels such as the above mentioned LS-SVM can aid clinicaldecision support because information arising from clinicalrisk factors (for example, tumor size, number of positivelymph nodes) is not accurate enough to reliably predictpatient prognoses. Patients with the same clinical and patho-logical characteristics but different clinical outcomes canpotentially be discerned with microarray technology. Thistechnology investigates the transcriptomic make-up of atumor in one experiment. A decade ago, it was first used incancerstudies to classify tissues as cancerous or non-cancerous[6,7]. Within the domain of cancer, microarray technology has earned a prominent place for its capacity to characterizeunderlying tumor behavior in detail. Although the first geneexpression profile signature is being validated in clinicaltrials [8-10], microarray technology can not measure thecomplete transcription profile due to the limited number of probes per gene on a chip; nor does the transcriptomecompletely reflect the biology underlying a disease.Besides transcription, pathological conditions such as canceralso influence alternative splicing, chromosomal aberra-tions, and methylation [11,12]. For example, chromosomalaberrations have been found in the general population aswell as in all major tumor types [13,14]. These regions of increased or decreased DNA copy number can be detectedusing, for example, array comparative genomic hybridiza-tion (CGH) technology. This technique measures copy number variations (CNVs) within the entire genome of adisease sample compared to a normal sample [11]. Many small aberrations have emerged as prognostic and predictivemarkers. Numerous aberrations, however, also affect largegenomic regions, encompassing multiple genes or wholechromosome arms.Due to differential splicing or post-translational modifica-tions such as phosphorylation or acetylation, the proteome ismany orders of magnitude bigger than the transcriptome.This makes the proteome, which reflects the functional stateof the cell, a potentially richer source of data for unravelingdiseases [15]. It can be measured using mass spectrometry [16], or protein or antibody microarrays [17]. Additionally,other available omics data, such as epigenomics - the study of epigenetic changes such as DNA methylation and histonemodifications [12] - and single nucleotide polymorphismgenotyping [18], should be considered as they promise to beuseful in unraveling cancer mechanisms and the refinementof their molecular descriptions. Although the technologiesare available, joint analysis of multiple hierarchical layers of  biological regulation is at a preliminary stage.In this study we investigate whether the integration of information from multiple layers of biological regulationimproves the prediction of cancer outcome. Related work  Other research groups have already proposed the idea of data integration, but most groups have only investigated theintegration of clinical and microarray data. Tibshirani andcolleagues [19] proposed such a framework by reducing themicroarray data to one variable, addable to models based onclinical characteristics such as age, grade, and size of thetumor. Nevins and colleagues [20] combined clinical risk factors with metagenes (that is, the weighted averageexpression of a group of genes) in a tree-based classificationsystem. Wang et al. combined microarray data with know-ledge on two clinicopathological variables by defining a genesignature only for the subset of patients for whom theclinicopathological variables were not sufficient to predictoutcome [21]. A further evolution can be seen in studies in which twoomics data sources are simultaneously considered, in mostcases microarray data combined with proteomics or array CGH data. Much literature on such studies involving dataintegration already exists. However, the current definition of the integration of high-throughput data sources as it is usedin the literature differs from our point of view.In a first group of integration studies, heterogeneous datafrom different sources were analyzed sequentially; that is,one data source was analyzed while the second was used asconfirmation of the found results or for further deepeningthe understanding of the results [22]. Such approaches areused for biological discovery and a better understanding of the development of a disease, but not for predictive pur- GenomeMedicine  2009,Volume 1, Issue 4, Article 39Daemen et al. 39.2 Genome Medicine  2009, 1:39  poses. For example, Fridlyand and colleagues [23] foundthree breast tumor subtypes with a distinct CNV pattern based on array CGH data. Microarray data were sub-sequently analyzed to identify the functional categories thatcharacterized these subtypes. Tomioka e t al. [24] analyzedmicroarray and array CGH data of patients with neuro- blastoma in a similar way. Genomic signatures resulted fromthe array CGH data, while molecular signatures were foundafter the microarray analysis. The authors suggested that acombination of these independent prognostic indicatorswould be clinically useful.The term data integration has also been used as a synonymfor data merging in which different data sets are conca-tenated at the database level by cross-referencing thesequence identifiers, which requires semantic compatibility among data sets [25,26]. Data merging is a complex task dueto, for example, the use of different identifiers, the absenceof a ‘one gene-one protein’ relationship, alternative splicing,and measurement of multiple signals for one gene. In moststudies, the concordance between the merged data sets andtheir interpretation in the context of biological pathways andregulatory mechanisms are investigated. Analyses of themerged data set by clustering or correlating the protein andmicroarray data can help identify candidate targets whenchanges in expression occur at both the gene and proteinlevels. However, there has been only modest success fromcorrelation studies of gene and protein expression. Bitton e t al . [27] combined proteomics data with exon array data,which allowed a much more fine-grained analysis by assigning peptides to their srcinating exons instead of mapping transcripts and proteins based on their IDs.Our definition for the combination of heterogeneous biological data is different. We integrate multiple layers of experimental data into one mathematical model for thedevelopment of more homogeneous classifiers in clinicaldecision support. For this purpose, we present a kernel- based integration framework. Integration occurs within thepatient domain at a level not so far described in theliterature. Instead of merging data sets or analyzing them inturn, the variables from different omics data are treatedequally. This leads to the selection of the most relevantfeatures from all available data sources, which are combinedin a machine learning-based model. We were inspired by theidea of Lanckriet and colleagues [28]. They presented anintegration framework in which each data set is transformedinto a kernel matrix. Integration occurs on this kernel levelwithout referring back to the data. They applied theirframework to amino acid sequence information, expressiondata, protein-protein interaction data, and other types of genomic information to solve a single classification problem:the classification of transmembrane versus non-transmem- brane proteins. In this study by Lanckriet and colleagues, allconsidered data sets were publicly available. This requires acomputationally intensive framework for determining therelevance of each data set by solving an optimizationproblem. Within our set-up, however, all data sources arederived from the patients themselves. This makes thegathering of these data sets highly costly and limits thenumber of data sets, but guarantees more relevance for theproblem at hand.We previously investigated whether the prediction of distantmetastasis in breast cancer patients could be improved whenconsidering microarray data besides clinical data [29]. Inthis manuscript, we consider not only microarray data butalso high-throughput data from multiple biological levels.Three different strategies for clinical decision support areproposed: the use of individual data sets (referred to as step A); an integration of each data type over time by manually calculating the change in expression (step B); and anapproach in which data sets are integrated over multiplelayers in the genome (and over time) by treating variablesfrom the different data sets equally (step C).We apply our framework to two cases, summarized inTable1. In the first case on rectal cancer, tumor regressiongrade, lymph node status, and circumferential margininvolvement (CRM) are predicted for 36 patients based onmicroarray and proteomics data, gathered at two time pointsduring therapy. The second case on prostate cancer involvesmicroarray and copy number variation data from 55patients. Tumor grade, stage, metastasis, and occurrence of recurrence were available for prediction [30,31]. Materials and methods Dat   a set I: r   ec   t   al cancer Patients and treatment  Forty patients with rectal cancer (T3-T4 and/or N+) fromseven Belgian centers were enrolled in a phase I/II study investigating the combination of cetuximab, capecitabine,and external beam radiotherapy in the preoperativetreatment of patients with rectal cancer [32]. These patientsreceived preoperative radiotherapy (1.8Gy, 5days/week for5weeks) in combination with cetuximab (initial dose400mg/m 2 intravenous given 1week before the beginningof radiation followed by 250mg/m 2 /week for 5weeks) andcapecitabine for the duration of radiotherapy (first doselevel, 650mg/m 2 orally twice-daily; second dose level,825mg/m 2 twice-daily; including weekends). Details of theeligibility criteria, pretreatment evaluation, radiotherapy,chemotherapy and cetuximab administration, surgery,follow-up, and histopathological assessment of response tochemoradiation have been published [32]. Data preprocessing  Tissue and plasma samples were gathered at three timepoints: before treatment ( T  0 ); after the first loading dose of cetuximab but before the start of radiotherapy withcapecitabine ( T  1 ); and at the moment of surgery ( T  2 ). All GenomeMedicine  2009,Volume 1, Issue 4, Article 39Daemen et al. 39.3 Genome Medicine  2009, 1:39  experimental procedures were done following standardlaboratory procedures, or following the manufacturers’instructions. Because of the exclusion of some patients due toa missing outcome value, death before surgery, or not havingsurgery, the data set ultimately contained 36 patients.The frozen tissue samples were hybridized to Affymetrixhuman U133 2.0 plus gene chip arrays. The resulting datawere first preprocessed for each time point separately usingrobust multichip analysis [33]. Secondly, the number of features was reduced from 54,613 probe sets to 27,650 genes by taking the median of all probe sets that matched on thesame gene. Probe sets that matched on multiple genes wereexcluded because of the danger of cross-hybridization. Takinginto account the low signal-to-noise ratio of microarray data,we finally filtered out genes with low variation across allsamples. Only retaining the genes with a variance in the top25% reduced the number of features to 6,913 genes.Ninety-six proteins known to be involved in cancer weremeasured in the plasma samples using a Luminex 100instrument. Proteins that had absolute values above thedetection limit in less than 20% of the samples wereexcluded for each time point separately. This resulted in theexclusion of six proteins at T  0 , four at T  1 , and six at T  2 . Theproteomics expression values of transforming growth factoralpha, which had too many values below the detection limit,were replaced by the results of ELISA tests performed at theDepartment of Experimental Oncology in Leuven, Belgium.For the remaining proteins the missing values were replaced by half of the minimum detected for each protein over allsamples, and values exceeding the upper limit were replaced by the upper limit value. Because most of the proteins had apositively skewed distribution, a log transformation (base 2)was performed.In this paper, only the data sets at T  0 and T  1 were used because our goal is to predict the four different outcomes before therapy or early in therapy. Response classification   A semiquantitative classification system has been described by Wheeler et al. [34] for determining histopathologicaltumor regression (that is, the therapy response). There arealso two prognostic factors important in rectal cancer:pathologic lymph node involvement and CRM [35]. Becausethe completeness of tumor resection relies on the assess-ment of resection margins by the pathologist, knowledge of the CRM before therapy provides important prognosticinformation for local recurrence and for development of distant metastasis and survival [36].These three outcomes were registered for 36 patients at themoment of surgery. For all these outcomes, ‘responders’ aredistinguished from ‘non-responders’. The grading of regression established by Wheeler and colleagues [34] (fromnow on referred to as WHEELER) is a modified pathologicalstaging system for irradiated rectal cancer. It includes ameasurement of tumor response after preoperative therapy:grade 1, good responsiveness (tumor is sterilized or only microscopic foci of adenocarcinoma remain); grade 2,moderate responsiveness (marked fibrosis but still with amacroscopic tumor); grade 3, poor responsiveness (little orno fibrosis with abundant macroscopic tumor). Tumors areclassified as ‘responder’ when assigned to grade 1 (26patients) and ‘non-responder’ when assigned to grade 2 or 3(10 patients). Response can also be evaluated with thepathologic lymph node stage at surgery (pN-STAGE). The‘responder’ class contains 22 patients with no lymph nodesfound at surgery while the ‘non-responder’ class contains 14patients with at least 1 regional lymph node. CRM wasmeasured according to the guidelines of Quirke e t al. [37].CRM was considered positive when the distance between thetumor and the mesorectal fascia was ≤ 2 mm. Tumors with anegative CRM are classified as ‘responder’ (27 patients),while tumors with a positive CRM belong to the ‘non-responder’ class (9 patients). Thirteen patients belong to the‘responder’ class for all three outcomes, while there is anoverlap of two patients between the ‘non-responder’ classes. GenomeMedicine  2009,Volume 1, Issue 4, Article 39Daemen et al. 39.4 Genome Medicine  2009, 1:39   Table 1Overview of the tw   o case studie   s on rectal and pr   o   st   ate cancerData set I: rectal cancerData set II: prostate cancerNumber of samples 3655Data sources MicroarrayMicroarrayProteomicsGenomicsNumber of features (after preprocessing) T  0 : 6,913 genes; 90 proteins6,974 genes T  1 : 6,913 genes; 92 proteins7,305 CNVsOutcomesWHEELERGRADEpN-STAGESTAGECRMMETASTASISRECURRENCE
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!