9 pages

A Design Methodology for a Document Indexing Tool Using Pragmatic Evidence in Text *

of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
A Design Methodology for a Document Indexing Tool Using Pragmatic Evidence in Text *
  A Design Methodology for a Document Indexing ToolUsing Pragmatic Evidence in Text ∗ Chrysanne DiMarcoDavid R. Cheriton School of Computer ScienceUniversity of WaterlooWaterloo, Ontario, Canadacdimarco@uwaterloo.caRobert E. MercerDepartment of Computer ScienceThe University of Western OntarioLondon, Ontario, Canadamercer@csd.uwo.caVictoria L. RubinFaculty of Information and Media StudiesThe University of Western OntarioLondon, Ontario, Abstract The huge increase in volume of online literature has led to a parallel surge in research into methods for retrievingmeaningful information from this textual data—“content extraction” has emerged as a prominent field in naturallanguage computing. However, little progress has as yet been made in determining the pragmatic content of a doc-ument, ‘hidden’ meaning such as the attitudes of the writer toward her audience, the intentions being communicated,the intra-textual relationships between document objects,and so forth. But pragmatic information carries a great deal of the underlying meaning in a document, and the in-ability to access this information means that current content extraction methods are very uninformed.Our goal is to develop natural language systems capableof extracting this pragmatic information in text to providemore meaningful document understanding. To this end, weare developing automated methods, both discourse-based and using Machine Learning techniques, to recognize and interpret pragmatic cues in text. This pragmatic evidencemay then be used to provide more-sophisticated document indexing to guide information extraction by providing de-tailed information on the fine-grained nature of the linkingrelationship between documents. ∗ Authors are listed in alphabetical order. An earlier version of this pa-perwasgivenasaposteratthe2004JointConferenceonHumanLanguageTechnology/North American Association for Computational Linguistics(HLT-NAACL) (BioLink 2004: Workshop on Linking Biological Litera-ture, Ontologies and Databases: Tools for Users), Boston, May 2004. 1 Introduction Documents contain “text objects” that have many infor-mation retrieval uses. These text objects include: textualitems, such as noun phrases, and metadata items, such ascitations to other articles, hyperlinks to other documents orweb pages, and XML attributes. Respective uses of thesetext objects include: keyword indexing to form links be-tween keywords and documents; citation indexes; and XMLattributes as an important metadata search item.While it is a straightforward task to associate keywordswith documents or build citation indexes which facilitatesearches that ensure a high rate of recall in a search, thepresence of a keyword or citation link does not necessar-ily mean a correspondingly high search precision. To im-prove search precision, each link should ideally be labelledwith a domain-specific descriptive category that indicates alikely reason for the link. We propose to develop automatedmethods of link classification providing such typed links toenable more-effective literature indexing and analysis tools.Our initial task is to construct an  annotation tool  formanually classifying rhetorical and other pragmatic cues inonline texts to provide a training corpus for developing ourautomated document-link classification system. 2 The Problem 2.1 Motivation With the explosion in the amount of online literature, ourcurrent techniques for information exploration have been  overwhelmed. If we could recognize and use fine-grainedrelationships among documents to assist navigation throughinformationnetworks, wecouldbetteraddressthisproblem.Suppose that we wish to label a link to the following newsarticle which is cited by a competitor company analysis:“The U.S. Food and Drug Administration is planningto reverse additional patent protection for Biovail Corp.’sTiazac, setting the stage for potential generic competitionagainst the Mississauga company’s flagship drug.” (TheGlobe and Mail, Saturday 5 March 2001, page B2.)Suppose also that we wish to label this link with either“Favourable development for competitor” or “Unfavourabledevelopment for competitor”. If we extract just the posi-tive phrase “additional patent protection for  [ competitor’sproduct ] ” then, without additional information, this articlewould be labelled as “Favourable”. However, the positivephrase is obviously in the negative context indicated by “re-verse’, so it should have been labelled as “Unfavourable”.If the verb had instead been “continue” (a positive context)then the positive sense would again prevail.It is obvious from this example that an analysis of thetext object context is crucial. What is not obvious is thatthe context could be structurally larger than just the enclos-ing sentence, even as large as a paragraph, the entire doc-ument, or a set of documents. The goal of this project isto develop new methods for discovering contextual infor-mation vital to the interpretation of text objects found indocuments. This information can then be used to label linksto the document that use the textual object. Although deepanalysis of text would be required for complete understand-ingofallthenuancedrelationshipsbetweendocuments, itisour contention that surface-cue and stylistic analysis, easierand more tractable than full syntactic and semantic under-standing, can provide much of the information that will beneeded. 2.2 The Approach We are bootstrapping the development of a set of meth-ods and software tools for the automated classification of links between documents in online corpora by focusing ini-tially on the problem of automated citation classification inscientific articles. This is a particularly challenging prob-lem as there can be upwards of 35 citation categories usedin scholarly writing, with fine-grained distinctions amongthe category definitions. Determining the purpose of a cita-tion can involve recognizing linguistic features at all levelsof the text: lexical cues, syntactic arrangement, and over-all discourse structure. We have demonstrated that auto-mated citation classification is feasible, but to improve theperformance of our classifier we need more-sophisticatedtechniques blending discourse understanding with statisti-cal methods for large-scale corpus analysis.Once we have determined the purpose of a citation, wecan then use this knowledge to group together articles andauthors into clusters that will allow better navigation of the literature in a subject domain, and mapping to socialnetworks within a scientific community. We are applyingknowledge from Computational Linguistics and MachineLearning to develop methods and software tools for auto-matically determining the function of citations. It is ex-pected that these results will then be applicable to relatedproblems in classifying other types of links and hyperlinksamong documents.Our resources include specialized repositories of biomedical articles (10,000) and physics articles (30,000),aswellastheentireBioMedCentralcorpus. Ourinitialgoalis to build a training set of manually classified citations inbiomedical articles (using a set of 1000 protein-interactionarticles we have curated from the larger biomedical corpus)that we could then use for developing our learning algo-rithms and for building scientific social networks.We have developed an initial annotation tool for manu-ally classifying citations in scientific articles and now planto extend the tool to classify other types of surface prag-matic cues (e.g., hedging cues, indicators of uncertainty).These cues will then provide a training corpus to developautomated methods for classifying the types of links be-tween documents.Our planned methodology is as follows:1. Development of Machine Learning algorithms (e.g.,using Hidden Markov Models, Conditional RandomFields) for detection of linguistic features in textrelevant to citation function (R. Radoulov, Master’sstudent, Waterloo).2. Development of Machine Learning methods andsoftware tools for automated classification of cita-tions (J. Taylor, PhD student, UWO; R. Radoulov,Master’s student, Waterloo).3. Analysis of discourse and argumentation structure(e.g., using lexical chaining, lexical style, classicalargumentation models) as cues to citation functionand inter-document relations (T. Maynard, Master’sstudent, UWO; B. White, PhD student, UWO; C.DiMarco; R. Mercer, V. Rubin).4. Using citation network analysis to map the structureof scientific communities (F. Kroon, PhD student,Waterloo).  3 The Springboard for our Research:Citation Classification 3.1 Our goal: A tool for better documentindexing Indexing tools, such as CiteSeer [3], play an importantrole in the scientific endeavour by providing researcherswith a means to navigate through the network of schol-arly scientific papers using the connections provided by ci-tations. Citations relate articles within a research field bylinking together works whose methods and results are insome way mutually relevant. Customarily, authors includecitations in their papers to indicate works that are foun-dational in their field, background for their own work, orrepresentative of complementary or contradictory research.Another researcher may then use the presence of citationsto locate articles she needs to know about when entering anew field or to read in order to keep track of progress in afieldwheresheisalreadywell-established. But, withtheex-plosion in the amount of scientific literature, a means to pro-vide more information in order to give more intelligent con-trol to the navigation process is warranted. A user normallywants to navigate more purposefully than “Find all articlesciting a source article”. Rather, the user may wish to knowwhether other experiments have used similar techniques tothoseused in the sourcearticle, or whether other works havereported conflicting experimental results. In order to navi-gate a citation index in this more-sophisticated manner, thecitation index must contain not only the citation-link infor-mation, but also must indicate the function of the citation inthe citing article.The near-term goal of our research project is the imple-mentation of an indexing tool for scholarly scientific liter-ature which uses rhetorical and other pragmatic cues in thecontext surrounding a citation to provide information abouttherelationshipbetweenthetwopapersconnectedbytheci-tation. Ultimately, we hope to apply the methods and toolswe will develop in classification of more-general kinds of document links to enhance literature indexing schemes, im-prove document retrieval precision, and advance social net-work analysis. 3.2 The aim of citation indexing A  citation  may be formally defined as a portion of a sen-tence in a citing document which references another docu-ment or a set of other documents collectively. For example,in sentence 1 below, there are two citations: the first cita-tion is  Although the 3-D structure...progress , with the setof references (Eger et al., 1994; Kelly, 1994); the second ci-tationis itwasshown...submasses withthesinglereference(Coughlan et al., 1986). (1)  Although the 3-D structure analysis by x-ray crys-tallography is still in progress (Eger et al., 1994;Kelly, 1994), it was shown by electron microscopythat XO consists of three submasses (Coughlan etal., 1986).A  citation index  enables efficient retrieval of documentsfrom a large collection—a citation index consists of sourceitems and their corresponding lists of bibliographic descrip-tions of citing works. The use of citation indexing of sci-entific articles was invented by Dr. Eugene Garfield in the1950s as a result of studies on problems of medical in-formation retrieval and indexing of biomedical literature.Dr. Garfield later founded the Institute for Scientific Infor-mation (ISI), whose Science Citation Index [4] is now oneof the most popular citation indexes. Recently, with the ad-vent of digital libraries, Web-based indexing systems havebegun to appear (e.g., ISI’s ‘Web of Knowledge’, CiteSeer[3]).Authors of scientific papers normally include citationsin their papers to indicate works that are connected in animportant way to their paper. Thus, a citation connectingthe source document and a citing document serves one of many functions. For example, one function is that the cit-ing work gives some form of credit to the work reportedin the source article. Another function is to criticize pre-vious work. Other functions include foundational works intheir field, background for their own work, works which arerepresentative of complementary or contradictory research.Determining the nature of the exact relationship between aciting and cited paper, often requires some level of under-standing the text that the citation is embedded in. 3.3 Citation indexing in biomedical liter-ature analysis In the biomedical field, a domain of particular interestto us, we believe that the usefulness of automated citationclassification in literature indexing can be found in both thelarger context of managing entire databases of scientific ar-ticles or for specific information-extraction problems. Onthe larger scale, database curators need accurate and effi-cient methods for building new collections by retrieving ar-ticles on the same topic from huge general databases. Sim-ple systems (e.g., [1], [13]) consider only keyword frequen-cies in measuring article similarity. More-sophisticated sys-tems, such as the Neighbors utility [22], may be able to lo-catearticlesthat appeartoberelated in some way(e.g., find-ing related Medline abstracts for a set of protein names [2]),but the lack of specific information about the nature andvalidity of the relationship between articles may still makethe resulting collection a less-than-ideal resource for subse-quent analysis. Citation classification to indicate the natureof the relationships between articles in a database would  make the task of building collections of related articles botheasier and more accurate. And, the existence of additionalknowledge about the nature of the linkages between articleswould greatly enhance navigation among a space of docu-ments to retrieve meaningful information about the relatedcontent.A specific problem in information extraction that maybenefit from the use of citation categorization involves min-ing the literature for protein-protein interactions (e.g., [2],[13], [21]). Currently, even the most-sophisticated systemsare not yet capable of dealing with all the difficult problemsof resolving ambiguities and detecting hidden knowledge.For example, Blaschke et al.’s system [2] is able to handlefairly complex problems in detecting protein-protein inter-actions, including constructing the network of protein inter-actions in cell-cycle control, but important implicit knowl-edge is not recognized. In the case of cell-cycle analysis for  Drosophila , their system is able to determine that relation-ships exist between  Cak ,  Cdk7 ,  CycH , and  Cdk2 :  Cak  in-hibits/phosphorylates  Cdk7 ,  Cak  activates/phosphorylates Cdk2 ,  Cdk7  phosphorylates  Cdk2 ,  CycH  phosphorylates Cak  and  CycH  phosphorylates  Cdk2 . However, the sys-tem is not able to detect that  Cak  is actually a complexformed by  Cdk7  and  CycH , and that the  Cak  complex reg-ulates  Cdk2 . While the earlier literature describes inter-relationships among these proteins, the recognition of thegeneralization in their structure, i.e., that these proteins arepart of a complex, is contained only in more-recent articles:“There is an element of generalization implicit in later pub-lications, embodying previous, more dispersed findings. Aclear improvement here would be the generation of asso-ciated weights for texts according to their level of gener-ality” [2]. Citation categorization could provide just thesekind of ‘ancestral’ relationships between articles—whetheran article is foundational in the field or builds directly onclosely related work—and, if automated, could be used informing collections of articles for study that are labelledwith explicit semantic and rhetorical links to one another.Such collections of semantically linked articles might thenbe used as ‘thematic’ document clusters (cf. Wilbur [23]) toelicit much more meaningful information from documentsknown to be closely related.An added benefit of having citation categories availablein text corpora used for studies such as extracting protein-protein interactions is that more, and more-meaningful, in-formation may be obtained. In a potential application,Blaschke et al. [2] noted that they were able to discovermany more protein-protein interactions when including inthe corpus those articles found to be related by the Neigh-bors facility [22] (285 versus only 28 when relevant proteinnames alone were used in building the corpus). Lastly, verydifficult problems in scientific and biomedical informationextraction that involve aspects of deep-linguistic meaningmay be resolved through the availability of citation catego-rization in curated texts: synonym detection, for example,may be enhanced if different names for the same entity oc-curinarticlesthatcanberecognizedasbeingcloselyrelatedin the scientific research process. 4 Our Guiding Principles 4.1 Using the ‘rhetoric of science’ The automated labelling of citations with a specific cita-tion function requires an analysis of the linguistic featuresin the text surrounding the citation, coupled with a knowl-edge of the author’s pragmatic intent in placing the citationat that point in the text. The author’s purpose for includ-ing citations in a research article reflects the fact that re-searchers wish to communicate their results to their scien-tific community in such a way that their results, or  knowl-edge claims , become accepted as part of the body of sci-entific knowledge. This persuasive nature of the scientificresearch article, how it contributes to making and justifyinga knowledge claim, is recognized as the defining propertyof scientific writing by rhetoricians of science, e.g., [7], [8],[9], [17]. Style (lexical and syntactic choice), presentation(organization of the text and display of the data), and ar-gumentation structure are noted as the rhetorical means bywhich authors build a convincing case for their results.Our approach to automated citation classification isbased on the detection of fine-grained linguistics cues inscientific articles that help to communicate these rhetori-cal stances and thereby map to the pragmatic purpose of citations. As part of our overall research methodology, ourgoal is to map the various types of pragmatic cues in sci-entific articles to rhetorical meaning. Our previous work has described the importance of   discourse cues  in enhanc-ing inter-article cohesion signalled by citation usage [15],[12]. We have also been investigating another class of prag-matic cues,  hedging cues , [16], that are deeply involved increating the pragmatic effects that contribute to the author’sknowledge claim by linking together a mutually support-ive network of researchers within a scientific community.In extending our work to more-general types of documentlinks, we are exploring other types of pragmatic connota-tions, including  certainty categorization  and how explicitlymarked certainty can be predictably and dependably identi-fied from newspaper article data. Certainty identification, inparticular, can serve as a foundation for a novel type of textanalysis that can enhance question-and-answering, search,and information retrieval capabilities ([18], [19]). Certaintyidentification is a part of the new and exciting direction ininformation retrieval, natural language processing, and text-mining, concerned with exploration of subjective, attitudi-nal, and affective aspects of texts [20].  4.2 Results of our previous studies In our preliminary study [15], we analyzed the frequencyof the cue phrases from [14] in a set of scholarly scientificarticles. We reported strong evidence that these cue phrasesare used in the citation sentences and the surrounding textwith the same frequency as in the article as a whole. In sub-sequent work [12], we analyzed the same dataset of articlesto begin to catalogue the fine-grained discourse cues thatexist in citation contexts. This study confirmed that authorsdo indeed have a rich set of linguistic and non-linguisticmethods to establish discourse cues in citation contexts.Another type of linguistic cue that we are studying is re-lated to hedging effects in scientific writing that are usedby an author to modify the affect of a ‘knowledge claim’.Hedging in scientific writing has been extensively studiedby Hyland [9], including cataloging the pragmatic func-tions of the various types of hedging cues. As Hyland [9]explains, “[Hedging] has subsequently been applied to thelinguistic devices used to qualify a speaker’s confidence inthe truth of a proposition, the kind of caveats like  I think  ,  perhaps ,  might  , and  maybe  which we routinely add to ourstatements to avoid commitment to categorical assertions.Hedges therefore express tentativeness and possibility incommunication, and their appropriate use in scientific dis-course is critical (p. 1)”.The following examples illustrate some of the ways inwhich hedging may be used to deliberately convey an atti-tude of uncertainty or qualifification. In the first example,the use of the verb  suggested   hints at the author’s hesitancyto declare the absolute certainty of the claim: (2)  The functional significance of this modulationis suggested by the reported inhibition of MeSo-induced differentiation in mouse erythroleukemiacells constitutively expressing c-myb.In the second example, the syntactic structure of the sen-tence, a fronted adverbial clause, emphasizes the effect of qualification through the rhetorical cue  Although . The sub-sequent phrase,  a certain degree , is a lexical modifier thatalso serves to limit the scope of the result: (3)  Although many neuroblastoma cell lines showa certain degree of heterogeneity in terms of neuro-transmitter expression and differentiative potential,each cell has a prevalent behavior in response to dif-ferentiation inducers.In [16], we showed that the hedging cues proposed byHyland occur more frequently in citation contexts than inthe text as a whole. With this information we conjecturethat hedging cues are an important aspect of the rhetoricalrelations found in citation contexts and that the pragmaticsof hedges may help in determining the purpose of citations.We investigated this hypothesis by doing a frequencyanalysis of hedging cues in citation contexts in a corpus of 985 biology articles. We obtained statistically significantresults (summarized in Table 1) indicating that hedging isused more frequently in citation contexts than the text asa whole. Given the presumption that writers make stylis-tic and rhetorical choices purposefully, we propose that wehavefurtherevidencethatconnectionsbetweenfine-grainedlinguistic cues and rhetorical relations exist in citation con-texts.Table1showstheproportionsofthevarioustypesofsen-tences that contain hedging cues, broken down by hedging-cue category (verb or nonverb cues), according to the dif-ferent sections in the articles (background, methods, resultsand discussion, conclusions). For all but one combination,citation sentences are more likely to contain hedging cuesthan would be expected from the overall frequency of hedgesentences (  p  ≤  . 01 ). Citation ‘window’ sentences (i.e., sen-tences in the text close to a citation) generally are also sig-nificantly (  p  ≤  . 01 ) more likely to contain hedging cuesthan expected, though for certain combinations (methods,verbs and nonverbs; res+disc, verbs) the difference was notsignificant.Tables 2, 3, and 4 summarize the occurrence of hedgingcues in citation ‘contexts’ (a citation sentence and the sur-rounding citation window). Table 5 shows the proportion of hedge sentences that either contain a citation, or fall withina citation window; Table 5 suggests (last 3-column column)that the proportion of hedge sentences containing citationsor being part of citation windows is at least as great as whatwould be expected just by the distribution of citation sen-tences and citation windows.Table 1 indicates (statistically significant) that in mostcases the proportion of hedge sentences in the citation con-texts is greater than what would be expected by the distribu-tion of hedge sentences. Taken together, these conditionalprobabilities support the conjecture that hedging cues andcitation contexts correlate strongly. Hyland [9] has cata-logued a variety of pragmatic uses of hedging cues, so itis reasonable to speculate that these uses can be mappedto the rhetorical meaning of the text surrounding a citation,and from thence to the function of the citation. 5 Our Design Methodology The indexing tool that we are designing is an enhancedcitation index. The feature that we are adding to a standardcitation index is the function of each citation, that is, givenan agreed-upon set of citation functions, we want our toolto be able to automatically categorize a citation into one of these functional categories. To accomplish this automaticcategorization we are using a decision tree—currently, weare building the decision tree by hand, but in future we in-
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks