Essays

8 pages
4 views

One Time Mining by Multi-Core Preprocessing on Generalized Dataset

Please download to get full document.

View again

of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. Many industries are interested in developing the association rules from their databases
Transcript
  [Singh ,  3(2): February, 2014] ISSN: 2277-9655 Impact Factor: 1.852   http: // www.ijesrt.com (C)  International Journal of Engineering Sciences & Research Technology [795-802]   IJESRT   INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY One Time Mining by Multi-Core Preprocessing on Generalized Dataset Aviral Kumar Singh *1 , S. R. Tondon 2 , Tarun Dhar Diwan 3   aviralkumarsingh@live.com  Abstract   One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. Many industries are interested in developing the association rules from their databases due to continuous retrieval and storage of huge amount of data. The discovery of interesting association relationship among business transaction records in many business decision making process such as catalog decision, cross-marketing, and loss-leader analysis. The enormity and high dimensionality of datasets typically available as input to problem of association rule discovery, and the time consuming operation in this discovery process is the computation of the frequency of interesting subset of items (called candidates) in the database of transactions. Hence, it is has become vital to develop a method that will make speedup the preprocessing computation. In this paper, We have proposed An Integrated approach of Parallel Computing and ARM for mining Association Rules in Generalized data set that is fundamentally different from all the previous algorithms in that multi-core preprocessing is done and by avoiding recurring scan of dataset number of passes required is reduced. The response time is calculated on space delimited text dataset. Keywords : Data Mining, Association Rule Mining (ARM), Association rules, Apriori algorithm, Frequent pattern. Introduction The rapid development of computer technology, especially increased capacities and decreased costs of storage media, has led businesses to store huge amounts of external and internal information in large databases at low cost. Mining useful information and helpful knowledge from these large databases has thus evolved into an important research area [3, 2, 1]. Association rule mining (ARM) [18] has become one of the core data mining tasks and has attracted tremendous interest among data mining researchers. ARM is an undirected or unsupervised data mining technique which works on variable length data, and produces clear and understandable results. Association Rule Mining (ARM) algorithms [17] are defined into two categories; namely, algorithms respectively with candidate generation and algorithms without candidate generation. In the first category, those algorithms which are similar to Apriori algorithm for candidate generation are considered. Eclat may also be considered in the first category [8]. In the second category, the FP-Growth algorithm is the best–known algorithm. The main drawback of earlier algorithms is the repeated scans over large database. This may be a cause of decrement in CPU performance, memory and increment in I/O overheads. The performance and efficiency of ARM algorithms mainly depend on three factors; namely candidate sets generated, data structure used and details of implementations [8]. In this paper we have proposed an Algorithm which uses these three factors. Suppose if there are 104 frequent 1 itemsets, Apriori algorithm may produce 107 candidate 2 itemsets, count them and judge their frequency [11]. Besides, it may produce as many as 2100 (about 1030) candidate itemsets in order to find the frequent itemset which includes 100 items.What's more, it may scan the database many times to check a lager candidate through matching mode. Transactional database is considered as a two dimension array which works on generalized value dataset. The main difference between proposed algorithm and other algorithms is that instead of using transactional array in its natural form, our algorithm uses transpose of array i.e. rows and columns of array are interchanged and transposition is done using parallel matrix transpose algorithm (Mesh Transpose) [20]. The parallel architecture that lends itself most naturally to matrix operations is the mesh. Indeed, an n x n mesh of processors can be regarded as a matrix and is therefore perfectly fitted to accommodate an n x n data matrix, one element per processor. This is precisely the approach we shall use to compute the transpose of an n x n matrix initially stored in an n x n mesh of processors. We find that the time taken for matrix transpose decreases with an increase in the number of processors. We also observe that the speedup is very high for small as well as very large size of matrix when we increase the number of processors. The idea of our algorithm is quite simple.  [Singh ,  3(2): February, 2014] ISSN: 2277-9655 Impact Factor: 1.852   http: // www.ijesrt.com (C)  International Journal of Engineering Sciences & Research Technology [795-802]   Since the diagonal elements are not affected during the transposition, that is, element aii of A equals element aii of AT, the data in the diagonal processors will stay stationary. The advantage of using transposed array is to calculate support count for particular item. There is no need to repeatedly scan array. Only by finding the row sum of the array will give the required support count for particular item, which ultimately results in increased efficiency of the algorithm. The remainder of this paper is organized as follows: Section 2 provides a brief review of the related work. In Section 3, we explain Frequent Itemset and Association Rule Mining through Apriori Algorithm. In Section 4, we introduce our approach of frequent itemset generation using parallel preprocessing. An illustration of the algorithm and experiment analysis is presented in section 5 and section 6 respectively. Finally, we concluded our work. Related Work One of the most well known and popular data mining techniques is the Association rules or frequent item sets mining algorithm. The algorithm was srcinally proposed by Agrawal et al. [4] [5] for market basket analysis. Because of its significant applicability, many revised algorithms have been introduced since then, and Association rule mining is still a widely researched area. Agrawal et. al. presented an AIS algorithm in [4] which generates candidate item sets on-the-fly during each pass of the database scan. Large item sets from previous pass are checked if they are present in the current transaction. Thus new item sets are formed by extending existing item sets. This algorithm turns out to be ineffective because it generates too many candidate item sets. It requires more space and at the same time this algorithm requires too many passes over the whole database and also it generates rules with one consequent item. Agrawal et. al. [5] developed various versions of Apriori algorithm such as Apriori, AprioriTid, and AprioriHybrid. Apriori and AprioriTid generate item sets using the large item sets found in the previous pass, without considering the transactions. AprioriTid improves Apriori by using the database at the first pass. Counting in subsequent passes is done using encodings created in the first pass, which is much smaller than the database. This leads to a dramatic performance improvement of three times faster than AIS. Scalability is another important area of data mining because of its huge size. Hence, algorithms must be able to “scale up” to handle large amount of data. Eui-Hong et. al [16] tried to make data distribution and candidate distribution scalable by Intelligent Data Distribution (IDD) algorithm and Hybrid Distribution (HD) algorithm respectively. IDD addresses the issues of communication overhead and redundant computation by using aggregate memory to partition candidates and move data efficiently. HD improves over IDD by dynamically partitioning the candidate set to maintain good load balance. Different works are reported in the literature to modify the Apriori logic so as to improve the efficiency of generating rules. These methods even though focused on reducing time and space, in real time still needs improvement. Frequent Item Set And Association Rule The aim of Association rule mining is exploring relations and important rules in large datasets. A dataset is considered as a sequence of entries consisting of attribute values also known as items. A set of such item sets is called an item set. Frequent item sets are sets of pages which are visited frequently together in a single server session. Let I ={ I1, I2, … , Im }be a set of items. Let D, the task-relevant data, be a set of database transactions where each transaction T is a set of items such that T ⊆  I. Each transaction is associated with an identifier, called TID. Let A be a set of items. A transaction T is said to contain A if and only if A ⊆  T. An association rule is an implication of the form A ⇒ B, where A ⊂ I, B ⊂ I, and A ∩ B= ∅ . The rule A ⇒ B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A ∪ B (i.e., the union of sets A and B, or say, both A and B). This is taken to be the probability, P(A ∪ B). The rule A ⇒  B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B. This is taken to be the conditional probability, P(B|A). That is, support(A ⇒ B) =P(A ∪ B)……………..(2.1) confidence(A ⇒ B) = P(B|A)………….(2.2) A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. The set {bread, butter} is a 2-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset, it is also known, as the frequency, or support count. If the relative support of an itemset I satisfies a pre specified minimum support threshold then I is a frequent itemset. The set of frequent k-itemsets is commonly denoted by Lk. From Equation (2.2), we have   ⇒  = , =  ∪  = ∪      [Singh ,  3(2): February, 2014] http: // www.ijesrt.com (C)  Inte Let τ  = I1, I2... Im be a set of bi called items. Let T be a database of tran transaction t is represented as a binary vec 1 if t bought the item Ik, and t[k] = 0 othe one tuple in the database for each transacti set of some items in τ . We say that satisfies X if for all items Ik in X, t[k] = By an association rule, we mean an imp form X ⇒ Ij, where X is a set of some item a single item in τ  that is not present in X. Ij is satisfied in the set of transaction confidence factor 0 ≤  c ≤  1 if at least c% in T that satisfy X also satisfy Ij. We notation X ⇒ Ij | c to specify that the rule confidence factor of c.[3] A.   Apriori Algorithm The Apriori algorithm is one popular algorithms for mining frequent association rules [4]. It introduces a meth candidate itemsets Ck in the pass k of database using only frequent itemset previous pass. The idea rests on the fact th a frequent itemset must be frequent as w can be generated by joining two itemset pruning those that contain any subset that as shown in Fig1. : Apriori Algorithm Our Approach Association rule and frequent i has become now a widely research area an ISS Impact  national Journal of Engineering Sciences & Research [795-802]   ary attributes, sactions. Each or, with t[k] = rwise. There is on. Let X be a transaction t 1. lication of the s in τ , and Ij is The rule X ⇒  s T with the of transactions will use the X ⇒  Ij has a of the most patterns and od to generate a transaction Lk−1 in the at anysubset of ll. Hence, Ck in Lk−1 and is not frequent emset mining d hence, faster and faster algorithms have Association Rule Mining algorith Growth requires repeated scans All the input/output overheads t during repeated scanning the enti performance of CPU, memory an Much work has been c the efficiency of the apriori alg I/O time and minimizing the se However, all these works suffer over the databse at least once. algorithms can still be improve required for counting the suppor We aim to obtain an efficient a the time needed to count the itemsets. The process of finding large i following parts. •   Parallel Data Preprocess •   Generating candidate set  A.   Parallel Data Preproc The idea of our algorith the diagonal elements are no transposition, that is, element aii of A T , the data in the diagon stationary. An n x n mesh of pro as a matrix and is therefo accommodate an n x n data m processor. A(i,j ) is used to store the algorithm terminates. B(i,j) received from P(i,j + 1) or P(i - 1 or top neighbors. C(i,j) is used from P(i,j - 1) or P(i + 1,j), that i neighbors. N: 2277-9655 Factor: 1.852  Technology been presented.The ms such as Apriori, FP- ver the entire database. at are being generated re database decrease the   d I/O overheads. rried out on improving orithm by reducing the of candidate itemsets. from problem of scans The efficiency of these by reducing the time s of candidate itemsets. gorithm which reduces supports of candidate emsets is divided into ing s. ssing   m is quite simple. Since t affected during the of A equals element aii l processors will stay cessors can be regarded e perfectly fitted to atrix, one element per aij initially and aji when is used to store data ,j), that is, from its right to store data received , from its left or bottom  [Singh ,  3(2): February, 2014] http: // www.ijesrt.com (C)  Inte Matrix to be transposed, stored in mesh of Example Executi   ISS Impact  national Journal of Engineering Sciences & Research [795-802]   processors. n N: 2277-9655 Factor: 1.852  Technology  [Singh ,  3(2): February, 2014] http: // www.ijesrt.com (C)  Inte B.   Candidate set Generation We propose a new algorith Transactional database is considered as a array which works on generalized valu main difference between proposed algori algorithms is that instead of using transac its natural form, our algorithm uses transp rows and columns of array are inte transposition is achieved using parallel m algorithm. ISS Impact  national Journal of Engineering Sciences & Research [795-802]   m in which wo dimension dataset. The thm and other tional array in se of array i.e. rchanged and atrix transpose An Illustration Suppose we have a tr which the user transactions fro from A1 to A5 are stored in t values, which is shown in Table- Consider the transpose as shown in Table-1 is stored Parallel Transposition that can b algorithm. Assume the user spe is 40%, and then the steps for item sets in proposed algorithm NULL set is reached. In our dataset will be used in the trans candidate set and frequent ite will be changed as compared to Then the candidate 2-it by performing dot-multiplicatio array consist of generalized valu be produce in the form of 1. If th the respective rows have 1, othresultant cell. In this approach, array consisting of candidate 2-it   Algorithm  // Transpose the transactional datab1. Transpose(Data Set) 2. Read the database to determine L1 using sum of rows. 3. L 1 = Frequent 1- items 4. While (k-1 ≠  NULL s Begin C k  : = Call Gen_candidat Call Prune (C k  ) for all itemsets i   I do Calculate the support va of array; L k   := All candidates in C k:=k+1 End 5. End of step-4 End Procedure N: 2277-9655 Factor: 1.852  Technology nsactional database in T1 to T5 and items e form of generalized 1(Fig 4) f transactional database in Table-2 by applying e used in our proposed ified minimum support generating all frequent will be repeated until lgorithm, transactional posed form. Therefore, set generation process priori algorithm. mset will be generated of rows of array, as s, the resultant cell will e corresponding cells of rwise 0 will be in the we will receive a new emsets to get the higher   se count the support of C1 to ets and k:= 2 t) do e_itemsets (L k  -1) lues using dot-multiplication k with a minimum support;
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x