Internet

8 pages
3 views

Principles component analysis, fuzzy weighting pre-processing and artificial immune recognition system based diagnostic system for diagnosis of lung cancer

of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
Principles component analysis, fuzzy weighting pre-processing and artificial immune recognition system based diagnostic system for diagnosis of lung cancer
Transcript
  Principles component analysis, fuzzy weighting pre-processingand artificial immune recognition system baseddiagnostic system for diagnosis of lung cancer Kemal Polat  * , Salih Gu¨nes  Selcuk University, Department of Electrical and Electronics Engineering, 42035 Konya, Turkey Abstract Lung cancers are cancers that begin in the lungs. Other types of cancers may spread to the lungs from other organs. However, theseare not lung cancers because they did not start in the lungs. It is evident that usage of machine learning methods in disease diagnosis hasbeen increasing gradually. In this study, diagnosis of lung cancer, which is a very common and important disease, was conducted withsuch a machine learning system. In this study, we have detected on lung cancer using principles component analysis (PCA), fuzzy weight-ing pre-processing and artificial immune recognition system (AIRS). The approach system has three stages. First, dimension of lung can-cer dataset that has 57 features is reduced to four features using principles component analysis. Second, a new weighting scheme based onfuzzy weighting pre-processing was utilized as a pre-processing step before the main classifier. Third, artificial immune recognition systemwas our used classifier. We took the lung cancer dataset used in our study from the UCI machine learning database. The obtained clas-sification accuracy of our system was 100% and it was very promising with regard to the other classification applications in literature forthis problem.   2006 Elsevier Ltd. All rights reserved. Keywords:  Principles component analysis; Artificial immune system; AIRS; Fuzzy weighting pre-processing; Lung cancer; Medical diagnosis 1. Introduction Lung cancers are cancers that begin in the lungs. Othertypes of cancers may spread to the lungs from otherorgans. However, these are not lung cancers because theydid not start in the lungs. When cancer cells spread fromone organ to another, they are called metastases. Researchhas found several risk factors for lung cancer. A ‘‘risk fac-tor’’ is anything that changes risk of getting a disease. Dif-ferent risk factors change risk by different amounts.The risk factors for lung cancer include the following(http://www.cdc.gov/lungcancer/basic_info/index.htm): •  smoking and being around others’ smoke, •  things around us at home or work (such as radon gas), •  personal traits (such as having a family history of lungcancer).Having so many factors to analyze to diagnose the lungcancer of a patient makes the physician’s job difficult.A physician usually makes decisions by evaluating the cur-rent test results of a patient and by referring to the previousdecisions she made on other patience with the same condi-tion. The former method depends strongly on the physi-cian’s knowledge. On the other hand, the latter dependson the physician’s experience to compare her patient withher earlier patients. This job is not easy considering thenumber of factors she has to evaluate. In this crucial step,she may need an accurate tool that lists her previous deci-sions on the patient having same (or close to same) factors. 0957-4174/$ - see front matter    2006 Elsevier Ltd. All rights reserved.doi:10.1016/j.eswa.2006.09.001 * Corresponding author. Tel.: +90 332 223 2056; fax: +90 332 241 0635. E-mail addresses:  kpolat@selcuk.edu.tr (K. Polat), sgunes@selcuk.edu.tr (S. Gu¨nes  ). www.elsevier.com/locate/eswa Expert Systems with Applications 34 (2008) 214–221 Expert Systems with Applications  Motivated by the need of such an important classifica-tion method, in this study, we propose a method to diag-nose the lung cancer. The proposed method uses PCA,AIRS classification system and a weighting process, whichis based on fuzzy weighted pre-processing method. First,dimension of lung cancer dataset that has 57 features isreduced to 4 features using principles component analysis.Second, a new weighting scheme based on fuzzy weightingpre-processing was utilized as a pre-processing step beforethe main classifier. Third, artificial immune recognition sys-tem was our used classifier. We took the lung cancer data-set used in our study from the UCI machine learningdatabase. The obtained classification accuracy of our sys-tem was 100% and it was very promising with regard tothe other classification applications in literature for thisproblem.The remaining of the paper is organized as follows. InSection 2, we give a brief background on natural and arti-ficial immune systems. We present the proposed method inSection 3. In Section 4, we give the experimental data to show the effectiveness of our method. Finally, we concludethis paper in Section 5 with future directions. 2. Natural and artificial immune systems The natural immune system is a distributed novel-pat-tern detection system with several functional componentspositioned in strategic locations throughout the body.Immune system regulates defense mechanism of the bodyby means of innate and adaptive immune responses.Between these, adaptive immune response is much moreimportant for us because it contains metaphors like recog-nition, memory acquisition, diversity, self-regulation, etc.The main architects of adaptive immune response are lym-phocytes, which can be divided into two classes as T and BLymphocytes (cells), each having its own function. Espe-cially B cells have a great importance because of theirsecreted antibodies (Abs) that take very critical roles inadaptive immune response.The simplified working procedure of the human immunesystem is illustrated in Fig. 1. Specialized antigen present-ing cells (APCs) called macrophages circulates throughthe body and if they encounter an antigen, they ingestand fragment them into antigenic peptides (I). The piecesof these peptides are displayed on the cell surface by majorhistocompatibility complex (MHC) molecules existing inthe digesting APC. The presented MHC-peptide combina-tion on the cell surface is recognized by the T-cells causingthem to be activated (II). Activated T cells secrete somechemicals as alert signals to other units in response to thisrecognition (III). B cells, one of the units that take thesesignals from the T cells, become activated with the recogni-tion of antigen by their antibodies occurred in the sametime (IV). When activated, B cells turn into plasma cellsthat secrete bound antibodies on their surfaces (V).Secreted antibodies bind the existing antigens and neutral-ize them signaling other components of immune systemto destruct the antigen–antibody complex (VI) (De Castro& Timmis, 2002). For more detailed information aboutimmune system refer to Abbas and Lichtman (2003),S  ahan, Kodaz, Gu¨nes  , and Polat (2004), Perelson andOster (1979).Artificial immune systems emerged in the 1990s as a newcomputational research area. Artificial immune systemslink several emerging computational fields inspired by bio-logical behavior such as artificial neural networks and arti-ficial life (De Castro & Timmis, 2002).In the studies conducted in the field of AIS, B cell mod-eling is the most encountered representation type. Differentrepresentation methods have been proposed in that model-ing. Among these, shape-space representation is the mostcommonly used one (De Castro & Timmis, 2002).The shape-space model (S) aims at quantitativelydescribing the interactions among antigens (Ags), the for-eign elements that enter the body like microbe . . . etc.,and antibodies (Ag–Ab). The set of features that character-ize a molecule is called its  generalized shape . The Ag– Ab representation (binary or real-valued) determines adistance measure to be used to calculate the degree of inter-action between these molecules. Mathematically, the gener-alized shape of a molecule ( m ), either an antibody or anantigen, can be represented by a set of coordinates m  =  h m 1 , m 2 , . . . , m L i , which can be regarded as a point inan  L -dimensional real-valued shape-space ( m  2  S  L ). In thiswork, we used real strings to represent the molecules. Anti-gens and antibodies are considered to be the same length  L . Fig. 1. Immune response. K. Polat, S. Gu¨nes  / Expert Systems with Applications 34 (2008) 214–221  215  The length and cell representation depend upon the givenproblem (Perelson & Oster, 1979). 3. The proposed system 3.1. Overview The proposed medical decision making method consistsof three main parts: PCA, fuzzy weighted pre-processingand AIRS classifier. We apply the former to the given lungcancer dataset to weight the input data. In the first stage,dimension of lung cancer dataset that has 57 features isreduced to four features using principles component analy-sis. In the second stage, a new weighting scheme based onfuzzy weighting pre-processing was utilized as a pre-processing step before the main classifier. Then, in the thirdstage, artificial immune recognition system (AIRS) was ourused classifier. The block diagram of the proposed methodis given in Fig. 2. We explain the details of the dimension-ality reduction, pre-processing and classification steps inthe following sections. 3.1.1. Principles component analysis (PCA) PCA was used to make a classifier system more effective.For this aim, before classifying with fuzzy weighting pre-processing, PCA method was used for dimensionalityreduction of lung cancer dataset. Therefore, lung cancerwas represented a vector consists of 57 attributes. PCA isbased on the assumption that most information about clas-ses is contained in the directions along which the variationsare the largest. The most common derivation of PCA is interms of a standardized linear projection, which maximizesthe variance in the projected space (Wang & Paliwal, 2003).For a given  p -dimensional data set  X  , the  m  principal axes T  1 , T  2 , . . . , T  m , where 1 6 m 6  p , are orthonormal axesonto which the retained variance is maximum in the pro- jected space. Generally,  T  1 , T  2 , . . . , T  m  can be given by the m  leading eigenvectors of the sample covariance matrix S   ¼ ð 1 =  N  Þ P  N i ¼ 1 ð  x i    l Þ T  ð  x i    l Þ , where  x i   2  X  ,  l  is the sam-ple mean and  N   is the number of samples, so that: ST  i  ¼  k i T  i ;  i  2  1 ;  . . .  ; m  ð 1 Þ where  k i   is the  i  th largest eigenvalue of   S  . The  m  principalcomponents of a given observation vector  x i   2  X   are givenby  y   ¼ ½  y  1 ;  y  2 ;  . . .  ;  y  m  ¼ ½ T  T  1  x ; T  T  2  ;  . . .  ; T  T m  ¼  T  T   x  ð 2 Þ The  m  principal components of   x  are decorrelated in theprojected space. In multi-class problems, the variations of data are determined on a global basis, that is, the principalaxes are derived from a global covariance matrix:  b S   ¼  1  N  X  K  j ¼ 1 X  N   j i ¼ 1 ð  x  j    ^ l Þð  x  j    ^ l Þ T  ð 3 Þ where  ^ l  is the global mean of all the samples,  K   is the num-ber of classes,  N   j   is the number of samples in class  j  ;  N   ¼ P  K  j ¼ 1  N   j  and  x  ji   represents the  i  th observation fromclass  j  . The principal axes  T  1 , T  2 , . . . , T  m  are therefore the m  leading eigenvectors of   b S  :  b ST  i  ¼  ^ k i T  i ;  i  2  1 ;  . . .  ; m  ð 4 Þ where  ^ k i  is the  i  th largest eigenvalue of   b S  . An assumptionmade for feature extraction and dimensionality reductionby PCA is that most information of the observation vectorsis contained in the subspace spanned by the first m princi-pal axes, where  m  <  p . Therefore, each srcinal data vectorcan be represented by its principal component vector withdimensionality  m  (Lindsay, 2002). 3.1.2. Fuzzy weighting pre-processing  In the fuzzy weighting procedure, each feature takes newfeature value according to its old value. Two membershipfunctions are defined in this procedure known as inputand output membership functions. These are selected as tri-angular membership functions as shown in Figs. 3 and 4,respectively. Lung Cancer Dataset that has 57 attributesDimensionality Reduction using Principles Component Analysis (PCA)Artificial Immune Recognition System (AIRS) Classification of Lung Cancer Dataset that has 4 attributes Fuzzy Weighted Pre-processing Fig. 2. The block diagram of proposed system. Fig. 3. Input membership functions.216  K. Polat, S. Gu¨nes  / Expert Systems with Applications 34 (2008) 214–221  Firstly, the formation of these membership functions isrealized as follows: As a first step, the mean values of eachfeature are calculated through using all of the samples’ cor-responding feature values in Eq. (3): m i  ¼  1  N  X  N k  ¼ 1  x k  ; i  ð 5 Þ here  x k  , i   represents the  i  th feature value of sample  x k  , k   = 1,2, . . . , N  . After calculation of these sample meansfor each feature, the input membership function is formedby triangles as in Fig. 4. The supports of these triangles aredetermined by  m i  /8,  m i  /4,  m i  /2,  m i  , 2  ·  m i  , 4  ·  m i  , 8  ·  m i   asshown in Fig. 3. The lines of input membership functionsare named as mf  1 ,mf  2 , . . . , mf  8 . For the output member-ship function formation, again eight parts formed member-ship functions (Fig. 4). The interval [0,1] is divided intoeight equal part and the corresponding lines are againnamed but in this case as mf  0 1 ; mf  0 2 ;  . . .  ; mf  0 8 . Before contin-uing it is worth to note that these input and output mem-bership functions are formed for each feature so therewill be exist different input–output membership functionconfiguration for each feature since the sample means of each feature differs.After determination of input and output membershipfunctions for each feature, the weighting procedure comesinto scene. For a feature value, say  x k  , i  , that is for  i  th fea-ture value of   x k   sample, this value is taken as in the  x -axisof input membership function and  y  values of the points atwhich this value cuts the input membership functions aredetermined. For example if this feature value is between0 and  m i  /8, then this point will cut both line mf  1  andmf  2 . The  y  values at these intersection points, say  y 1  and  y 2 , are known as membership values ( l ) and they will thenbe used in a fuzzy rule base in the following manner: firstly,the input membership value,  l ( i  ), is determined by usingthe above intersection points: l ð i Þ ¼  l  A \  B ð  x k  ; i Þ ¼  MIN ð l  A ð  x k  ; i Þ ; l  B ð  x k  ; i ÞÞ ;  x  2  X   ð 6 Þ Here,  l A ( x k  , i  ) and  l B  ( x k  , i  ) membership values correspondto the intersection points as mentioned above. The rulebase for our system is used as presented in Table 1. Afterthis  l ( i  ) value is determined through using Eq. (4) forour  x k  , i   feature value, the output weight value is then deter-mined by using output membership functions and the rulesin Table 1. Here in determining weight as a last step, firstlythe input membership value,  l ( i  ), is presented to outputmembership function to determine the correspondingweighted value of our srcinal feature value. This member-ship value is now taken as a point in  y -axis of the outputmembership functions and again as for the case in inputmembership functions, the intersection points are deter-mined which are cut by this membership value. It is appar-ent from output membership functions that there will bemore than one intersection points. That which of them willbe used is decided through the rules in Table 1. For exam-ple, if input feature value cuts mf  1  and mf  2  lines in inputmembership functions then the output value for this fea-ture will be the mean of two points that  l ( i  ) cuts mf  0 1  andmf  0 2  at the output membership functions (Polat, S  ahan, &Gu¨nes  , 2006). 3.1.3. The parameters in AIRS classifier One of the important advantages of AIRS is that it isnot necessary to know the appropriate settings for the clas-sifier in advance. The most important feature of the classi-fier is its self-determination ability (Watkins, 2001). Theexplanations of each parameter used in AIRS are givenbelow. Table 2 summarizes these parameters. •  Mutation rate : A parameter between 0 and 1 that indi-cates the probability that any given feature (or the out-put) of an  ARB   will be mutated. Fig. 4. Output membership functions.Table 1Fuzzy rule base for our system k   value Rules1 if Input_value cuts mf( k  ) and mf( k   + 1) thenOutput_value = (mf( k  ) 0 (  y ) + mf( k   + 1) 0 (  y ))/22 if Input_value cuts mf( k  ) and mf( k   + 1) thenOutput_value = mf( k  ) 0 (  y ) + mf( k   + 1) 0 (  y ))/23 if Input_value cuts mf( k  ) and mf( k   + 1) thenOutput_value = mf( k  ) 0 (  y ) + mf( k   + 1) 0 (  y ))/24 if Input_value cuts mf( k  ) and mf( k   + 1) thenOutput_value = mf( k  ) 0 (  y ) + mf( k   + 1) 0 (  y ))/25 if Input_value cuts mf( k  ) and mf( k   + 1) thenOutput_value = mf( k  ) 0 (  y ) + mf( k   + 1) 0 (  y ))/26 if Input_value cuts mf( k  ) and mf( k   + 1) thenOutput_value = mf( k  ) 0 (  y ) + mf( k   + 1) 0 (  y ))/27 if Input_value cuts mf( k  ) and mf( k   + 1) thenOutput_value = mf( k  ) 0 (  y ) + mf( k   + 1) 0 (  y ))/2Table 2Used parameter values in AIRS for the lung cancerUsed parameters ValueMutation rate 0.10ATS (affinity threshold scalar) 0.2Stimulation threshold 0.99Clonal rate 10Hyper clonal rate 2.0Number of resources 250Iteration number 30 k   Value for nearest neighbour 1 K. Polat, S. Gu¨nes  / Expert Systems with Applications 34 (2008) 214–221  217  •  Affinity threshold scalar  ( ATS  ): a value between 0 and 1that provides a cut-off value for memory cell replace-ment in the  AIRS   training routine when multiplied bythe  affinity threshold  . •  Stimulation threshold  : a parameter between 0 and 1 usedas a stopping criterion for the training on a specific antigen . •  Clonal rate : an integer value used to determine the num-ber of mutated clones that a given  ARB   is allowed toattempt to produce. •  Number of resources : a parameter that limits the numberof   ARB  s allowed in the system. Each ARB is allocatedto a number of resources based on its  stimulation value and the  clonal rate . •  k nearest neighbor  ( KNN  ): a classification scheme inwhich the response of the classifier to a previouslyunseen item is determined by a majority vote amongthe  k   closest data points. •  k value : the parameter that indicates how many  memorycells  should be used to determine the classification of agiven test item. 3.1.4. AIRS classification algorithm AIRS is a resource limited supervised learning algorithminspired from immune metaphors. In this algorithm, theused immune mechanisms are resource competition, clonalselection, affinity maturation and memory cell formation.The feature vectors presented for training and test arenamed as antigens while the system units are called as Bcells. Similar B cells are represented with artificial recogni-tion balls (ARBs) and these ARBs compete with each otherfor a fixed resource number. This provides ARBs, whichhave higher affinities to the training antigen to improve.The memory cells formed after the whole training antigenswere presented are used to classify test antigens. The algo-rithm is composed of four main stages, which are initializa-tion, memory cell identification and ARB generation,competition for resources and development of a candidatememory cell, and memory cell introduction. Table 3 sum-marizes the mapping between the immune system andAIRS.We give the details of our algorithm below.1.  Initialization : Create a set of cells called the memorypool ( M  ) and the ARB pool ( P  ) from randomly selectedtraining data.2.  Antigenic presentation : For each antigenic pattern do:(a)  Clonal expansion : For each element of   M  , determineits affinity to the antigenic pattern, which resides inthe same class. Select the highest affinity memory cell(mc) and clone mc in proportion to its antigenicaffinity to add to the set of ARBs ( P  ).(b)  Affinity maturation : Mutate each ARB descendant of the highest affinity mc. Place each mutated ARBinto  P  .(c)  Metadynamics of ARBs : Process each ARB using theresource allocation mechanism. This process willresult in some ARB death, and ultimately con-trols the population. Calculate the average stimula-tion for each ARB, and check for terminationcondition.(d)  Clonal expansion and affinity maturation : Clone andmutate the randomly selected subset of the ARBsleft in  P   based on their stimulation level.(e)  Cycle : While the average stimulation value of eachARB class group is less than a given stimulationthreshold go to step 2(c).(f)  Metadynamics of memory cells : Select the highestaffinity ARB of the same class as the antigenfrom the last antigenic interaction. If the affinity of this ARB with the antigenic pattern is better thanthat of the previously identified best memory cellmc then add the candidate (mc-candidate) to mem-ory set  M  . If the affinity of mc and mc-candidateare below the affinity threshold, remove mc from  M  .3.  Classify : Classify data items using the memory set  M  .Classification is performed in a  k  -nearest neighbor fash-ion with a vote being made among the  k   closest memorycells to the given data item being classified.We can characterize AIRS as follows: •  Memory : The memory of the AIRS algorithm is in thepool of memory cells developed through exposure tothe training data (experiences); •  Adaptation : The adaptation occurs primarily in the ARBpool. With each new experience, AIRS evolves a candi-date memory cell in reaction to this experience. If thismemory cell is of sufficient quality, then the memorystructure is adapted to include in it. •  Decision-making  : The initial decision is which memorycell is the most similar to the incoming training antigen.This cell is used as a progenitor for a pool of evolvingcells. During classification, the primary classificationdecision is made based on the  k   most similar memorycells to the data item being classified. Table 3Mapping between the immune system and AIRSImmunesystemAIRSAntibody Feature vectorRecognitionballCombination of feature vector and vector classShape-space Type and possible values of the data vectorClonalexpansionReproduction of ARBs that are well matched antigensAntigens Training dataAffinitymaturationRandom mutation of ARB and removal of the leaststimulated ARBsImmunememoryMemory set of mutated ARBsMetadynamics Continual removal and creation of ARBs and memorycells218  K. Polat, S. Gu¨ nes  / Expert Systems with Applications 34 (2008) 214–221
Related Documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks