Advertisement

7 pages
13 views

Issues in Mining Imbalanced Data Sets - A Review Paper

Please download to get full document.

View again

of 7
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
Issues in Mining Imbalanced Data Sets - A Review Paper
Transcript
  Issues in Mining Imbalanced Data Sets - A Review Paper Sofia Visa ECECS DepartmentML 0030University of CincinnatiCincinnati, OH 45221, USAsvisa@ececs.uc.edu Anca Ralescu ECECS DepartmentML 0030University of CincinnatiCincinnati, OH 45221, USAAnca.Ralescu@uc.edu Abstract This paper traces some of the recent progress in the field of learning of imbalanced data. It reviews approaches adoptedfor this problem and it identifies challenges and points outfuture directions in this relatively new field. Introduction Learning with imbalanced class distributions addresses thecase when, for a two-class classification problem, the train-ing data for one class (majority) greatly outnumbers theother class (minority). Recently , machine learning commu-nity acknowledged that the current learning methods (e.g.C4.5, NN) perform poorly in applications dealing with im-balanced data sets (IDS). On the other hand, it was observedthat in many real world domains available data sets are im-balanced. In the literature, the IDS problem is also knownas dealing with rare classes, or with skewed data.The poor performance of the classifiers produced by thestandard machine learning algorithms on IDS is mainly dueto the following factors:         Accuracy:  The standard algorithms are driven by ac-curacy (minimization of the overall error) to which theminority class contributes very little;         Class Distribution:  The current classifiers assumethat the algorithms will operate on data drawn from thesame distribution as the training data;         Error Costs:  The current classifiers assume that theerrors coming from different classes have the same costs.However, upon closer attention to data, it is observed thatthe real data sets do not respect the above factors, as thefollowing examples show: Example 1 (Accuracy driven classifiers fail for somedata sets)  For a data set consisting of     majority examplesand only    minority examples, by assigning all data to themajority class, a    accuracy is achieved. Now, assumingthat the minority class represents a rare disease and it is theclass of interest for the domain expert (here, the health care provider), the classifier is rather useless for this real world  problem. Example 2 (Training and testing distribution are rarelythe same)  Class distribution is an important issue for learn-ing, in general. The training data might be imbalanced but the testing might not and the other way around. However,experimentalstudies show that a balancedclass distributionis not the best for learning (Weiss & Provost 2003), (Visa & Ralescu 2005)and the open question for further research is: What is the best class distribution for learning a giventask?Example 3 (In applications, error cost are different)  Inapplications the error cost are different: consider a cancer versusnon-cancer,fraudversus validaction,system OK ver-sus system failure situation. If the error costs and class dis-tribution are known the correct threshold can be computed easily. But the difficulty is that error costs are hard to assesseven by the human experts in the field, and therefore, thesecosts are rarely known. Further, it is important to mentionthat, when the errors coming from different classes have dif- ferent but unknown cost, classifiers have problems even for the balanced data. In order to give a comprehensive view of the state of theart in this field, we reviewthe most commonlyused methodsfor dealing with IDS, present a chronological review of theevents dedicated to IDS and the lessons learned so far. Inaddition we point out future directions in the field. Methods in Dealing with IDS Standard classifiers such as neural networks, support vec-tor machines and C4.5 were investigated in many researchpapers for the imbalance data problem (Fawcett & Provost1997), (Chan & Stolfo 1998), (Kubat, Holte, & Matwin1998), (Japkowicz 2000), (Nickerson & Milios 2001), (Jap-kowicz & Stephen 2002), (Weiss 2003), (Maloof 2003),(Drummond & Holte 2003), (Estabrooks & Japkowicz2004), (Weiss 2004). It is commonly agreed that they areheavilybiasedin recognizingmostlythe majorityclass sincethey are built to achieve overall accuracy to which the mi-nority class contributes very little. Solutions to the classimbalance problem were proposed both at the data and al-gorithmic levels: different forms of re-sampling address thefirst type of algorithms and adjusting costs (Pazzani  et al.  1994), decision thresholds and recognition based classifica-tion ((Kubat,Holte, & Matwin 1998),(Japkowicz, Myers, &Gluch 1995)), for the second type of algorithms.In the up-sampling approach (Ling & Li 1998), data fromthe minorityclass are duplicateduntil the imbalanceis elim-inated; Likewise, in the down-sampling approach (Kubat &Matwin 1997) data from the majority class are eliminatedto rebalance the classes. Combinations of the two methodsin an ensemble classifier and an effective ratio of up/downsamplingwere alsoinvestigatedin (Estabrooks& Japkowicz2004) and applied to text categorization.A guided resampling study is presented in (Nickerson& Milios 2001): the sampling method takes into account within-class imbalance , assuming that elements of the im-balanced class are grouped in subclusters. Up-sampling isthen guided toward enlarging the under-represented clusters(known also as small disjuncts). Since the up-sampling stepassumes that the numberof subclustersis knownin advance,unsupervised learning methods (e.g. k-means algorithm orself-organizing maps) have to be used prior to up-sampling,in order to obtain such information about the imbalancedclass.In (Zhang & Mani 2003) five different down-samplingalgorithms are presented, each being a variation of the k-nearest neighbor technique. (Chawla  et al.  2002) propose anew up-sampling method(SMOTE): new artificial examplesare created for the minority class by interpolating minority-class examples. The method is investigated for C4.5 andgives better results than random up-sampling. By inter-polating the minority class examples with new data, thewithin class imbalance is reduce (fewer small disjuncts) andC4.5 achieves a better generalization of the minority class,opposed to the specialization effect obtained by randomlyreplicating the minority class examples.(Chan & Stolfo 1998) investigates the best learning classdistribution for fraud detection in a large and imbalancedcredit card data set. Multiple data sets with the desired classdistribution are generated by joining the minority class withsubsets of the majorityclass. The obtainedclassifiers areag-gregated in an ensemble (composite) classifier. A drawback of this approach is that the best learning class distributionmust be investigated prior to the subset generation.Starting from the srcinal imbalanced training set, (Chan& Stolfo 1998) generate several balanced training sets (eachincludingthe minorityclass and a part ofthe majorityclass).Another common approach to deal with the imbalanceaspect is to assign different error penalties: an error on adata point belonging to the small class would have a largerpenalty than one for the large class. However, it can be ar-gued that the effect of assessing penalties is equivalent tochanging the relative data distribution in the two classes, or,in other words, to re-balancing the data. The experimentscarried in (Maloof 2003) suggest that sampling, as well asadjusting the cost matrix, have the same effect as movingthe decision threshold.(Visa & Ralescu 2003) propose a fuzzy based classifierfor IDS which means to be less sensitive to the class imbal-ance by considering a relative frequency to the class size. Inthe same paper the sensitivity of the fuzzy classifier to theFigure 1: Linear separable data.Figure 2: Overlapping data.imbalance in the presence of overlap is investigated. The re-sults show that in fact the overlap affects more the classifierthanthe imbalance. Furtherin (Visa & Ralescu 2004b)otherfactors such as complexity and size in combinationwith im-balance are studied. The results on the artificial data setssuggest that the fuzzy classifier is affected by the imbalanceonly for a combination of high complexity and small sizeof the overall data (in this case the major drawback is thelack of information, rather than the majority:minority ratio- the same issue is discussed later in the current paper forC4.5 ), unlike the neural networks which are consistently bi-ased toward the majority class at any given size (Japkowicz& Stephen 2002). Also, the experimentsfrom (Japkowicz&Stephen2002)showthatC4.5isaffectedbyhighcomplexityand imbalance at any given size. Performance Evaluation on IDS Selecting the best classifier for a given classification prob-lem is the ultimate goal. Accuracy helps in the previousdecision only if the error costs are the same, but this israrely true in practice even for the balanced data. In thesame idea, ROC curves help in deciding the best classifieronly if one classifier’s ROC curve completely dominates theother ROC curves, but this scenario is also rare because  Figure 3: High complexity data.many times multiple curves dominate in different parts of the ROC space. In this case (Provost 2000) recommendsthe ROC convex hull as a discriminant. However, as the re-cent debate within the first workshop dedicated to the ROCanalysis held in conjunction with the European Conferenceon Artificial Intelligence 2004, the ROC curves are difficultand computationally expensive to extend to more than twoclasses. Issues on IDS In this section we analyze the imbalance factor from variousdirections and we will focus on answering questions, suchas:         When does the imbalance matter?         Is data re-balancing necessary? That is, “    ” isthe desired (best) distribution?         Are there some learning algorithms insensitive to theimbalance?First question is worth to investigate since some researchpapers claim that other factors in combination with the im-balance lead to poor performance (Jo & Japkowicz 2004),(Visa & Ralescu 2003), (Visa & Ralescu 2004b). For thesecond question, initial results are reported in (Weiss &Provost 2003) and (Visa & Ralescu 2005), whereas for thethird question there is little work reported (Visa & Ralescu2004b). The Nature of IDS From the application point of view, the nature of the imbal-ance falls in two cases:   The data are naturally imbalanced (e.g. credit card fraudsand rare disease) or,   The data are not naturally imbalanced but it is too expen-sive to obtain data for learning the minority class (e.g.shuttle failure).An open question remains as whether these two types of imbalance should/can be differentiated, and whether theyshould be addressed differently: for instance, decidingagainst (in the first case) or in favor (in the second case)of rebalancing. Class Distribution in IDS Another important aspect in learning IDS is the class dis-tribution, that is the ratio minority/majority examples in thetraining set. (Weiss & Provost 2003) investigate, on sev-eral data sets from the UCI Repository, the effect of variousclass distribution for learning, when testing with the naturalclass distribution. The experimental results show that, forC4.5, neither a    (that is balanced) learning distribu-tion, nor the natural distribution are the best for the learningtask. Similar results are reported in (Visa & Ralescu 2004b)for a fuzzy classifier where, the fuzzy classifier proves to beless sensitivetothelearningclass distribution(its outputwasless variant than C4.5’s output). Imbalance Ratio versus Lack of Information We believe that a major issue in the IDS problem is to dis-tinguish between two components of the IDS: (IR)  the imbalance as the ratio     ; (LI)  the lack of information for the minority class.Both the above components are present in an IDS learningproblembut each, in combinationwith otherfactors (such asoverlap,complexity of the function to be learned, size of thedatasets andsmalldisjuncts),affectsaparticularlearningal-gorithmdifferently. Of course,all thealgorithmssufferfromlackofrepresentation(whatis notpresentin training,cannotbe learned), but it is crucial to determine which ones do notsuffer from the imbalance ratio (Visa & Ralescu 2004b). Example 4  For a data set consisting of     minor-ity:majority examples the imbalance factor     is the sameas in a data set of     . Though, in the first case theminority class is poorly represented and suffers more fromthe    factor than in the second case. Other Factors - Size, Complexity, Overlap andSmall Disjuncts Example 4 shows that the overall size of the data set mattersto the extend to which adds more information and this iscrucial when the data to be learned have high complexity(asin Figure 3). If the curve in the Figure 3 is the real boundarybetween classes, some additional data for the minority classmight come from the area where currently there is no datafor learning (the valley in the middle) and the recognition of the minority class will improve, even though the imbalanceratio    is the same.Clearly, for simple data sets (e.g. linearly separable) anyclassifier produces a good discrimination, regardless of theamount of imbalance IR (Figure 1). So, for low complexitythe imbalance seems not to matter.For overlapping classes (see Figure 2), that is, when thesome data points belong to both classes, it is harder to an-alyze the    and    components. However, it is obviousthat for the accuracy-driven algorithms, the tendency is toreduce the overall error by assigning the overlapped area tothe majority class (since from the overlapping area there aremore majority class data than minority ones). For example,neural networks are biased toward the majority class: the  overlapping examples for the minority class are considerednoise and are discharged.To address the question          , we point out that the fuzzyclassifier proposedin (Visa & Ralescu 2003)learns  indepen-dently eachclass, andrepresentsthem as fuzzysets suchthateachexamplehasa  membershipdegreeto its class computed relatively to the class size  thus giving chances to the minor-ity class too, even when the classes overlap. The importanceof this feature was also pointed out in (Chawla, Japkowicz,& Kolcz 2004).It is also expected that SVM classifiers are less affectedby the imbalance, since only support vectors are consideredin classification. Basically, the idea is that the support vec-tors are selected only from the points around the boundariesso the imbalance should affect less (or not at all). As inthe case of the fuzzy classifier, the imbalance factor affectsSVM classifiers only by the lack of information (    ) andnot by biasing it toward the majority class (    ).(Weiss 2003) and (Jo & Japkowicz 2004) identify smalldisjuncts(sub-clustersinthe minorityclass, verypoorlyrep-resented in the training phase) as another factor when deal-ing with IDS. They claim that up-sampling must be guidedsuch that more artificial up-sampled data must come fromthe small disjuncts in order to overcome the lack of repre-sentation of the minority class. Issues on the Current Approaches Next we point out problems of standard learning methodswhen applied to IDS. Rebalancing - Loss or Gain in Information The sampling method has known drawbacks: under-sampling involves a loss of information (by discarding po-tentially useful data) and over-sampling increases the train-ing size without any gain in information (Provost 2000) or,perhaps even worse, by replicating the same examples leadsto over-fitting. Considering this fact, the best research strat-egy is to concentrate on how machine learning algorithmscan deal most effectively with whatever data they are given.Thus, the research must focus in developing new classi-fiers/learning methods. NN and C4.5 Neural networks have a large range of applications in induc-tive learning, but in the case of imbalanced data sets, neuralnetworks are prone to treat the minority class as noise andtherefore to disregard it. Therefore, in this context, mostoften they are used in conjunction with up/down-samplingof the training data as shown in several research papers. It isobservedthat randomdown-samplingoutperformedrandomup-sampling method (Japkowicz, Myers, & Gluch 1995);up-sampling of the minority class leads to a slower conver-gence of the network and does not bring new information.(Japkowicz, Myers, & Gluch 1995) shows experimentally(on three data sets) that a neural network recognition-basedsystem performs better or equally well to a discriminatingNN approach.Decision trees algorithms (C4.5) otherwise very popularchoice for learning classification rules, are also limited inthe presence of imbalance: a common problem is that theymight loose big parts of the minority class at the pruningphase, and therefore, for such problems, C4.5 without prun-ing performs better than C4.5 with pruning (Weiss 2003).Further, random up-sampling of the minority class leads totrees of largersize and over-fittingof the minority class (e.g.in the extreme case, each example of the minority class willcorrespondto a leaf in the tree, in which case there is no rule”learned” for the minority class). Historical Threat of the IDS Problem The issue of learning of imbalanced classes has recently re-ceived increased attention, as it can be seen from the recentincrease of scientific events dedicated to this topic. Confer-ences gather researchers to discuss new ideas and the resultsof their work. Usually held in conjunctionwith conferences,workshopscanbeseenas offspringsofconferencesthatcon-tribute to the field by guiding research into even newer areasof interest.Several workshops were dedicated specifically to the IDSproblem as follows. AAAI’2000 - Workshop on Learning fromImbalanced Data Sets I The first workshop dedicated to the IDS problem was heldin conjunction with the American Association for ArtificialIntelligence Conference 2000. The main observation at thetime was that there are many domains dealing with imbal-anced data sets, such as:   medical diagnosis (e.g. rare disease and rare genes muta-tions);   network monitoring and behavior profiling - intrusion;   fraud detection (e.g. credit card, phone calls, insur-ance)(Fawcett & Provost 1997);   risk management;   helicopter gear-box fault monitoring;   shuttle system failure;   earthquakes and nuclear explosions;   text classification;   oil spills detection.The issues debated at this workshop can be summarized asfollows (Japkowicz & Holte 2001):1. How to evaluate learning algorithms;2. Evaluation measures: it was commonly agreed that ac-curacy yields misleading conclusions. Instead, ROC andcost curves were proposed;3. One class learning versus discriminating methods;4. Discussions over various data resampling;5. Discussion of the relation between class imbalance prob-lem and cost-sensitive learning;6. The goal of creating classifiers that performs well acrossa range of costs.  ICML’2003 - Workshop on Learning fromImbalanced Data Sets II The main issues discussed on the 2000 workshop guided theresearch on IDS for the second workshop held as part of theInternational Conference on Machine Learning 2003. Forexample, ROC or cost curves were used as method of evalu-ation, rather than accuracy. The workshop was followed byan interesting and vivid panel discussion.(Japkowicz 2003) questionedthe fact that the within classimbalanceis responsiblefor theIDS mall-learning. Theideais that the within class imbalance (the small disjuncts) leadsto a sever lack of representation of some important aspectsof the minorityclass. (Zhang& Mani 2003)investigatevari-ous selection methods for down-samplingthe majority classbased on    Nearest Neighbors algorithms and (Nickerson& Milios 2001) investigate the relative advantage, if any, of down/up-samplingtechniques.Besides NN and C4.5, other classification methods, suchas SVM ((Wu & Chang 2003) and (Raskutti & Kowalczyk 2003)) and, for the first time, a fuzzy classifier (Visa &Ralescu 2003), were investigated for IDS. (Wu & Chang2003) point out potential problems such as skewed bound-ary toward minority class due to its lack of representationof SVM when applied to IDS. Despite the fact that one of the conclusions of the first workshops was that there was aneed for  new  classifiers for IDS, the papers presented at thesecond workshop aimed mainly to tune existing classifiers,e.g. C4.5, to perform better on IDS. (Visa & Ralescu 2003)proposed a fuzzy classifier based on relative class size, thatproved to be less sensitive to the class imbalance. In thesame paper, the effect of the imbalance combined with var-ious degrees of overlap was studied and it was concludedthat the overlap affects more the classification task than theimbalance.We now identify two major (and what we think wrong)directions present in the research papers of the workshop:1. Many papers reported various tuning methods applied todecision trees in order to perform better on IDS, eventhough presentations in the previous workshop showedtheir shortcoming, and it was commonly agreed that newmethods/classifiers are needed for IDS;2. Sampling, under various aspects, was present in half of the papers and was the most debated issue, even though(Maloof 2003)shows that sampling has the same result asmovingthe decisionthresholdor adjustingthe cost matrix(a result known since 1984 (Breiman & Stone 1984)). ACM SIGKDD Exploration 2004 - Special Issue onLearning from Imbalanced Data Sets The sixth issue of SIGKDD Exploration was dedicated en-tirely to the imbalance problem. (Weiss 2004) presents avery good review of the current research on IDS. The otherpapers in the volume address mainly the issue of sampling,feature selection and one class learning.(Guo & Herna 2004) investigate a boosting method com-bined with various up-sampling techniques of the hard toclassify examples. The method improves the prediction ac-curacy for both the classes and does not sacrifice one classfor the other. Results on 17 data sets are presented.(Batista & Monard 2004) suggest that in fact, it is notimbalance, but other factors such as the overlap betweenclasses which hinder the learning system. This fact waspointed out firstly in (Visa & Ralescu 2003). (Jo & Japkow-icz 2004) suggest that the small disjuncts are responsible forthe imbalance problem and investigates the idea on artificialand real domains. Imbalanced Data in Midwest Artificial Intelligenceand Cognitive Science Conference (MAICS) Learning in various forms is the major theme for MAICSconference and IDS is identified as a learning drawback insome MAICS papers too. For example, (Lei & Cheh 2003)investigate the performance of a rule-based classifier forbankruptcy data. They observe that decision trees learnerslearn well the non-bankruptcy data but consistently learnspoorlythe bankruptcydata. We suspect that, since their datasets are highly imbalanced (e.g. in one of their data sets theminorityclass accountsfor       of all data:    bankruptcyand    non-bankruptcy data), the decision trees over-fitthe non-bankruptcyclass, which is the majority class here.(Kamei & Ralescu 2003) present a piecewise linear sepa-rable SVM classification. As discussed earlier, the SVM arebetter suited to IDS since only support vectors are consid-ered in classification (only the    componentaffects SVM).The SVM described in this paper can be applied for IDS:each iteration considers a subset of all data (containing datafrom both classes) and the optimum hyperplane using SVMis selected. The subsets can be formed such that there is noimbalance in the training. Network intrusion is an exampleof highly imbalanced domain, and (Miller & Inoue 2003)introduces a framework for intrusion detection - SPIDER.(Visa & Ralescu 2004a) present initial results of up-sampling methods based on various approaches to aggrega-tion of class information such as spread, imbalance factorand the distance between the classes. Artificially generateddata are used for experiments. The performance of each up-sampling method is evaluated with respect to how well theresulting data set reflects the underlyingsrcinal class distri-bution, in terms of mean, standard deviation and (empirical)distribution function. Some Lessons Learned Some lessons can be drawn from reviewing the emergenceof IDS as a separate and important issue in machine learn-ing:   Many papers report on improvementsof one method overanother for some real data sets, e.g. UCI Repository.On the other hand, the current learning algorithms might”overfitt” the UCI Repository (Lavrac & Fawcett 2004).Therefore, papers that show limitations of current learn-ing methods are of interest as well.   People involved in applications of machine learning haveknown for some time that IDS create problems. How-ever,the machinelearningcommunitystarted topaymore
Related Documents
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x