iDocSlide.Com

Free Online Documents. Like!

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Share

Description

Principles component analysis, fuzzy weighting pre-processing and artificial immune recognition system based diagnostic system for diagnosis of lung cancer

Tags

Transcript

Principles component analysis, fuzzy weighting pre-processingand artiﬁcial immune recognition system baseddiagnostic system for diagnosis of lung cancer
Kemal Polat
*
, Salih Gu¨nes
Selcuk University, Department of Electrical and Electronics Engineering, 42035 Konya, Turkey
Abstract
Lung cancers are cancers that begin in the lungs. Other types of cancers may spread to the lungs from other organs. However, theseare not lung cancers because they did not start in the lungs. It is evident that usage of machine learning methods in disease diagnosis hasbeen increasing gradually. In this study, diagnosis of lung cancer, which is a very common and important disease, was conducted withsuch a machine learning system. In this study, we have detected on lung cancer using principles component analysis (PCA), fuzzy weight-ing pre-processing and artiﬁcial immune recognition system (AIRS). The approach system has three stages. First, dimension of lung can-cer dataset that has 57 features is reduced to four features using principles component analysis. Second, a new weighting scheme based onfuzzy weighting pre-processing was utilized as a pre-processing step before the main classiﬁer. Third, artiﬁcial immune recognition systemwas our used classiﬁer. We took the lung cancer dataset used in our study from the UCI machine learning database. The obtained clas-siﬁcation accuracy of our system was 100% and it was very promising with regard to the other classiﬁcation applications in literature forthis problem.
2006 Elsevier Ltd. All rights reserved.
Keywords:
Principles component analysis; Artiﬁcial immune system; AIRS; Fuzzy weighting pre-processing; Lung cancer; Medical diagnosis
1. Introduction
Lung cancers are cancers that begin in the lungs. Othertypes of cancers may spread to the lungs from otherorgans. However, these are not lung cancers because theydid not start in the lungs. When cancer cells spread fromone organ to another, they are called metastases. Researchhas found several risk factors for lung cancer. A ‘‘risk fac-tor’’ is anything that changes risk of getting a disease. Dif-ferent risk factors change risk by diﬀerent amounts.The risk factors for lung cancer include the following(http://www.cdc.gov/lungcancer/basic_info/index.htm):
•
smoking and being around others’ smoke,
•
things around us at home or work (such as radon gas),
•
personal traits (such as having a family history of lungcancer).Having so many factors to analyze to diagnose the lungcancer of a patient makes the physician’s job diﬃcult.A physician usually makes decisions by evaluating the cur-rent test results of a patient and by referring to the previousdecisions she made on other patience with the same condi-tion. The former method depends strongly on the physi-cian’s knowledge. On the other hand, the latter dependson the physician’s experience to compare her patient withher earlier patients. This job is not easy considering thenumber of factors she has to evaluate. In this crucial step,she may need an accurate tool that lists her previous deci-sions on the patient having same (or close to same) factors.
0957-4174/$ - see front matter
2006 Elsevier Ltd. All rights reserved.doi:10.1016/j.eswa.2006.09.001
*
Corresponding author. Tel.: +90 332 223 2056; fax: +90 332 241 0635.
E-mail addresses:
kpolat@selcuk.edu.tr (K. Polat), sgunes@selcuk.edu.tr (S. Gu¨nes
).
www.elsevier.com/locate/eswa
Expert Systems with Applications 34 (2008) 214–221
Expert Systems with Applications
Motivated by the need of such an important classiﬁca-tion method, in this study, we propose a method to diag-nose the lung cancer. The proposed method uses PCA,AIRS classiﬁcation system and a weighting process, whichis based on fuzzy weighted pre-processing method. First,dimension of lung cancer dataset that has 57 features isreduced to 4 features using principles component analysis.Second, a new weighting scheme based on fuzzy weightingpre-processing was utilized as a pre-processing step beforethe main classiﬁer. Third, artiﬁcial immune recognition sys-tem was our used classiﬁer. We took the lung cancer data-set used in our study from the UCI machine learningdatabase. The obtained classiﬁcation accuracy of our sys-tem was 100% and it was very promising with regard tothe other classiﬁcation applications in literature for thisproblem.The remaining of the paper is organized as follows. InSection 2, we give a brief background on natural and arti-ﬁcial immune systems. We present the proposed method inSection 3. In Section 4, we give the experimental data to
show the eﬀectiveness of our method. Finally, we concludethis paper in Section 5 with future directions.
2. Natural and artiﬁcial immune systems
The natural immune system is a distributed novel-pat-tern detection system with several functional componentspositioned in strategic locations throughout the body.Immune system regulates defense mechanism of the bodyby means of innate and adaptive immune responses.Between these, adaptive immune response is much moreimportant for us because it contains metaphors like recog-nition, memory acquisition, diversity, self-regulation, etc.The main architects of adaptive immune response are lym-phocytes, which can be divided into two classes as T and BLymphocytes (cells), each having its own function. Espe-cially B cells have a great importance because of theirsecreted antibodies (Abs) that take very critical roles inadaptive immune response.The simpliﬁed working procedure of the human immunesystem is illustrated in Fig. 1. Specialized antigen present-ing cells (APCs) called macrophages circulates throughthe body and if they encounter an antigen, they ingestand fragment them into antigenic peptides (I). The piecesof these peptides are displayed on the cell surface by majorhistocompatibility complex (MHC) molecules existing inthe digesting APC. The presented MHC-peptide combina-tion on the cell surface is recognized by the T-cells causingthem to be activated (II). Activated T cells secrete somechemicals as alert signals to other units in response to thisrecognition (III). B cells, one of the units that take thesesignals from the T cells, become activated with the recogni-tion of antigen by their antibodies occurred in the sametime (IV). When activated, B cells turn into plasma cellsthat secrete bound antibodies on their surfaces (V).Secreted antibodies bind the existing antigens and neutral-ize them signaling other components of immune systemto destruct the antigen–antibody complex (VI) (De Castro& Timmis, 2002). For more detailed information aboutimmune system refer to Abbas and Lichtman (2003),S
ahan, Kodaz, Gu¨nes
, and Polat (2004), Perelson andOster (1979).Artiﬁcial immune systems emerged in the 1990s as a newcomputational research area. Artiﬁcial immune systemslink several emerging computational ﬁelds inspired by bio-logical behavior such as artiﬁcial neural networks and arti-ﬁcial life (De Castro & Timmis, 2002).In the studies conducted in the ﬁeld of AIS, B cell mod-eling is the most encountered representation type. Diﬀerentrepresentation methods have been proposed in that model-ing. Among these, shape-space representation is the mostcommonly used one (De Castro & Timmis, 2002).The shape-space model (S) aims at quantitativelydescribing the interactions among antigens (Ags), the for-eign elements that enter the body like microbe
. . .
etc.,and antibodies (Ag–Ab). The set of features that character-ize a molecule is called its
generalized shape
. The Ag– Ab representation (binary or real-valued) determines adistance measure to be used to calculate the degree of inter-action between these molecules. Mathematically, the gener-alized shape of a molecule (
m
), either an antibody or anantigen, can be represented by a set of coordinates
m
=
h
m
1
,
m
2
,
. . .
,
m
L
i
, which can be regarded as a point inan
L
-dimensional real-valued shape-space (
m
2
S
L
). In thiswork, we used real strings to represent the molecules. Anti-gens and antibodies are considered to be the same length
L
.
Fig. 1. Immune response.
K. Polat, S. Gu¨nes
/ Expert Systems with Applications 34 (2008) 214–221
215
The length and cell representation depend upon the givenproblem (Perelson & Oster, 1979).
3. The proposed system
3.1. Overview
The proposed medical decision making method consistsof three main parts: PCA, fuzzy weighted pre-processingand AIRS classiﬁer. We apply the former to the given lungcancer dataset to weight the input data. In the ﬁrst stage,dimension of lung cancer dataset that has 57 features isreduced to four features using principles component analy-sis. In the second stage, a new weighting scheme based onfuzzy weighting pre-processing was utilized as a pre-processing step before the main classiﬁer. Then, in the thirdstage, artiﬁcial immune recognition system (AIRS) was ourused classiﬁer. The block diagram of the proposed methodis given in Fig. 2. We explain the details of the dimension-ality reduction, pre-processing and classiﬁcation steps inthe following sections.
3.1.1. Principles component analysis (PCA)
PCA was used to make a classiﬁer system more eﬀective.For this aim, before classifying with fuzzy weighting pre-processing, PCA method was used for dimensionalityreduction of lung cancer dataset. Therefore, lung cancerwas represented a vector consists of 57 attributes. PCA isbased on the assumption that most information about clas-ses is contained in the directions along which the variationsare the largest. The most common derivation of PCA is interms of a standardized linear projection, which maximizesthe variance in the projected space (Wang & Paliwal, 2003).For a given
p
-dimensional data set
X
, the
m
principal axes
T
1
,
T
2
,
. . .
,
T
m
, where 1
6
m
6
p
, are orthonormal axesonto which the retained variance is maximum in the pro- jected space. Generally,
T
1
,
T
2
,
. . .
,
T
m
can be given by the
m
leading eigenvectors of the sample covariance matrix
S
¼ ð
1
=
N
Þ
P
N i
¼
1
ð
x
i
l
Þ
T
ð
x
i
l
Þ
, where
x
i
2
X
,
l
is the sam-ple mean and
N
is the number of samples, so that:
ST
i
¼
k
i
T
i
;
i
2
1
;
. . .
;
m
ð
1
Þ
where
k
i
is the
i
th largest eigenvalue of
S
. The
m
principalcomponents of a given observation vector
x
i
2
X
are givenby
y
¼ ½
y
1
;
y
2
;
. . .
;
y
m
¼ ½
T
T
1
x
;
T
T
2
;
. . .
;
T
T m
¼
T
T
x
ð
2
Þ
The
m
principal components of
x
are decorrelated in theprojected space. In multi-class problems, the variations of data are determined on a global basis, that is, the principalaxes are derived from a global covariance matrix:
b
S
¼
1
N
X
K j
¼
1
X
N
j
i
¼
1
ð
x
j
^
l
Þð
x
j
^
l
Þ
T
ð
3
Þ
where
^
l
is the global mean of all the samples,
K
is the num-ber of classes,
N
j
is the number of samples in class
j
;
N
¼
P
K j
¼
1
N
j
and
x
ji
represents the
i
th observation fromclass
j
. The principal axes
T
1
,
T
2
,
. . .
,
T
m
are therefore the
m
leading eigenvectors of
b
S
:
b
ST
i
¼
^
k
i
T
i
;
i
2
1
;
. . .
;
m
ð
4
Þ
where
^
k
i
is the
i
th largest eigenvalue of
b
S
. An assumptionmade for feature extraction and dimensionality reductionby PCA is that most information of the observation vectorsis contained in the subspace spanned by the ﬁrst m princi-pal axes, where
m
<
p
. Therefore, each srcinal data vectorcan be represented by its principal component vector withdimensionality
m
(Lindsay, 2002).
3.1.2. Fuzzy weighting pre-processing
In the fuzzy weighting procedure, each feature takes newfeature value according to its old value. Two membershipfunctions are deﬁned in this procedure known as inputand output membership functions. These are selected as tri-angular membership functions as shown in Figs. 3 and 4,respectively.
Lung Cancer Dataset that has 57 attributesDimensionality Reduction using Principles Component Analysis (PCA)Artificial Immune Recognition System (AIRS) Classification of Lung Cancer Dataset that has 4 attributes Fuzzy Weighted Pre-processing
Fig. 2. The block diagram of proposed system. Fig. 3. Input membership functions.216
K. Polat, S. Gu¨nes
/ Expert Systems with Applications 34 (2008) 214–221
Firstly, the formation of these membership functions isrealized as follows: As a ﬁrst step, the mean values of eachfeature are calculated through using all of the samples’ cor-responding feature values in Eq. (3):
m
i
¼
1
N
X
N k
¼
1
x
k
;
i
ð
5
Þ
here
x
k
,
i
represents the
i
th feature value of sample
x
k
,
k
= 1,2,
. . .
,
N
. After calculation of these sample meansfor each feature, the input membership function is formedby triangles as in Fig. 4. The supports of these triangles aredetermined by
m
i
/8,
m
i
/4,
m
i
/2,
m
i
, 2
·
m
i
, 4
·
m
i
, 8
·
m
i
asshown in Fig. 3. The lines of input membership functionsare named as mf
1
,mf
2
,
. . .
, mf
8
. For the output member-ship function formation, again eight parts formed member-ship functions (Fig. 4). The interval [0,1] is divided intoeight equal part and the corresponding lines are againnamed but in this case as mf
0
1
;
mf
0
2
;
. . .
;
mf
0
8
. Before contin-uing it is worth to note that these input and output mem-bership functions are formed for each feature so therewill be exist diﬀerent input–output membership functionconﬁguration for each feature since the sample means of each feature diﬀers.After determination of input and output membershipfunctions for each feature, the weighting procedure comesinto scene. For a feature value, say
x
k
,
i
, that is for
i
th fea-ture value of
x
k
sample, this value is taken as in the
x
-axisof input membership function and
y
values of the points atwhich this value cuts the input membership functions aredetermined. For example if this feature value is between0 and
m
i
/8, then this point will cut both line mf
1
andmf
2
. The
y
values at these intersection points, say
y
1
and
y
2
, are known as membership values (
l
) and they will thenbe used in a fuzzy rule base in the following manner: ﬁrstly,the input membership value,
l
(
i
), is determined by usingthe above intersection points:
l
ð
i
Þ ¼
l
A
\
B
ð
x
k
;
i
Þ ¼
MIN
ð
l
A
ð
x
k
;
i
Þ
;
l
B
ð
x
k
;
i
ÞÞ
;
x
2
X
ð
6
Þ
Here,
l
A
(
x
k
,
i
) and
l
B
(
x
k
,
i
) membership values correspondto the intersection points as mentioned above. The rulebase for our system is used as presented in Table 1. Afterthis
l
(
i
) value is determined through using Eq. (4) forour
x
k
,
i
feature value, the output weight value is then deter-mined by using output membership functions and the rulesin Table 1. Here in determining weight as a last step, ﬁrstlythe input membership value,
l
(
i
), is presented to outputmembership function to determine the correspondingweighted value of our srcinal feature value. This member-ship value is now taken as a point in
y
-axis of the outputmembership functions and again as for the case in inputmembership functions, the intersection points are deter-mined which are cut by this membership value. It is appar-ent from output membership functions that there will bemore than one intersection points. That which of them willbe used is decided through the rules in Table 1. For exam-ple, if input feature value cuts mf
1
and mf
2
lines in inputmembership functions then the output value for this fea-ture will be the mean of two points that
l
(
i
) cuts mf
0
1
andmf
0
2
at the output membership functions (Polat, S
ahan, &Gu¨nes
, 2006).
3.1.3. The parameters in AIRS classiﬁer
One of the important advantages of AIRS is that it isnot necessary to know the appropriate settings for the clas-siﬁer in advance. The most important feature of the classi-ﬁer is its self-determination ability (Watkins, 2001). Theexplanations of each parameter used in AIRS are givenbelow. Table 2 summarizes these parameters.
•
Mutation rate
: A parameter between 0 and 1 that indi-cates the probability that any given feature (or the out-put) of an
ARB
will be mutated.
Fig. 4. Output membership functions.Table 1Fuzzy rule base for our system
k
value Rules1 if Input_value cuts mf(
k
) and mf(
k
+ 1) thenOutput_value = (mf(
k
)
0
(
y
) + mf(
k
+ 1)
0
(
y
))/22 if Input_value cuts mf(
k
) and mf(
k
+ 1) thenOutput_value = mf(
k
)
0
(
y
) + mf(
k
+ 1)
0
(
y
))/23 if Input_value cuts mf(
k
) and mf(
k
+ 1) thenOutput_value = mf(
k
)
0
(
y
) + mf(
k
+ 1)
0
(
y
))/24 if Input_value cuts mf(
k
) and mf(
k
+ 1) thenOutput_value = mf(
k
)
0
(
y
) + mf(
k
+ 1)
0
(
y
))/25 if Input_value cuts mf(
k
) and mf(
k
+ 1) thenOutput_value = mf(
k
)
0
(
y
) + mf(
k
+ 1)
0
(
y
))/26 if Input_value cuts mf(
k
) and mf(
k
+ 1) thenOutput_value = mf(
k
)
0
(
y
) + mf(
k
+ 1)
0
(
y
))/27 if Input_value cuts mf(
k
) and mf(
k
+ 1) thenOutput_value = mf(
k
)
0
(
y
) + mf(
k
+ 1)
0
(
y
))/2Table 2Used parameter values in AIRS for the lung cancerUsed parameters ValueMutation rate 0.10ATS (aﬃnity threshold scalar) 0.2Stimulation threshold 0.99Clonal rate 10Hyper clonal rate 2.0Number of resources 250Iteration number 30
k
Value for nearest neighbour 1
K. Polat, S. Gu¨nes
/ Expert Systems with Applications 34 (2008) 214–221
217
•
Aﬃnity threshold scalar
(
ATS
): a value between 0 and 1that provides a cut-oﬀ value for memory cell replace-ment in the
AIRS
training routine when multiplied bythe
aﬃnity threshold
.
•
Stimulation threshold
: a parameter between 0 and 1 usedas a stopping criterion for the training on a speciﬁc
antigen
.
•
Clonal rate
: an integer value used to determine the num-ber of mutated clones that a given
ARB
is allowed toattempt to produce.
•
Number of resources
: a parameter that limits the numberof
ARB
s allowed in the system. Each ARB is allocatedto a number of resources based on its
stimulation value
and the
clonal rate
.
•
k nearest neighbor
(
KNN
): a classiﬁcation scheme inwhich the response of the classiﬁer to a previouslyunseen item is determined by a majority vote amongthe
k
closest data points.
•
k value
: the parameter that indicates how many
memorycells
should be used to determine the classiﬁcation of agiven test item.
3.1.4. AIRS classiﬁcation algorithm
AIRS is a resource limited supervised learning algorithminspired from immune metaphors. In this algorithm, theused immune mechanisms are resource competition, clonalselection, aﬃnity maturation and memory cell formation.The feature vectors presented for training and test arenamed as antigens while the system units are called as Bcells. Similar B cells are represented with artiﬁcial recogni-tion balls (ARBs) and these ARBs compete with each otherfor a ﬁxed resource number. This provides ARBs, whichhave higher aﬃnities to the training antigen to improve.The memory cells formed after the whole training antigenswere presented are used to classify test antigens. The algo-rithm is composed of four main stages, which are initializa-tion, memory cell identiﬁcation and ARB generation,competition for resources and development of a candidatememory cell, and memory cell introduction. Table 3 sum-marizes the mapping between the immune system andAIRS.We give the details of our algorithm below.1.
Initialization
: Create a set of cells called the memorypool (
M
) and the ARB pool (
P
) from randomly selectedtraining data.2.
Antigenic presentation
: For each antigenic pattern do:(a)
Clonal expansion
: For each element of
M
, determineits aﬃnity to the antigenic pattern, which resides inthe same class. Select the highest aﬃnity memory cell(mc) and clone mc in proportion to its antigenicaﬃnity to add to the set of ARBs (
P
).(b)
Aﬃnity maturation
: Mutate each ARB descendant of the highest aﬃnity mc. Place each mutated ARBinto
P
.(c)
Metadynamics of ARBs
: Process each ARB using theresource allocation mechanism. This process willresult in some ARB death, and ultimately con-trols the population. Calculate the average stimula-tion for each ARB, and check for terminationcondition.(d)
Clonal expansion and aﬃnity maturation
: Clone andmutate the randomly selected subset of the ARBsleft in
P
based on their stimulation level.(e)
Cycle
: While the average stimulation value of eachARB class group is less than a given stimulationthreshold go to step 2(c).(f)
Metadynamics of memory cells
: Select the highestaﬃnity ARB of the same class as the antigenfrom the last antigenic interaction. If the aﬃnity of this ARB with the antigenic pattern is better thanthat of the previously identiﬁed best memory cellmc then add the candidate (mc-candidate) to mem-ory set
M
. If the aﬃnity of mc and mc-candidateare below the aﬃnity threshold, remove mc from
M
.3.
Classify
: Classify data items using the memory set
M
.Classiﬁcation is performed in a
k
-nearest neighbor fash-ion with a vote being made among the
k
closest memorycells to the given data item being classiﬁed.We can characterize AIRS as follows:
•
Memory
: The memory of the AIRS algorithm is in thepool of memory cells developed through exposure tothe training data (experiences);
•
Adaptation
: The adaptation occurs primarily in the ARBpool. With each new experience, AIRS evolves a candi-date memory cell in reaction to this experience. If thismemory cell is of suﬃcient quality, then the memorystructure is adapted to include in it.
•
Decision-making
: The initial decision is which memorycell is the most similar to the incoming training antigen.This cell is used as a progenitor for a pool of evolvingcells. During classiﬁcation, the primary classiﬁcationdecision is made based on the
k
most similar memorycells to the data item being classiﬁed.
Table 3Mapping between the immune system and AIRSImmunesystemAIRSAntibody Feature vectorRecognitionballCombination of feature vector and vector classShape-space Type and possible values of the data vectorClonalexpansionReproduction of ARBs that are well matched antigensAntigens Training dataAﬃnitymaturationRandom mutation of ARB and removal of the leaststimulated ARBsImmunememoryMemory set of mutated ARBsMetadynamics Continual removal and creation of ARBs and memorycells218
K. Polat, S. Gu¨ nes
/ Expert Systems with Applications 34 (2008) 214–221

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks