6 pages

Veracity Tools for Information Quality Assessment

of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Veracity Tools for Information Quality Assessment
  Extending Information Quality Assessment Methodology: A New Veracity/Deception Dimension and Its Measures Victoria L. Rubin and Tatiana Vashchilko Language and Information Technology Research Lab (LIT.RL) Faculty of Information and Media Studies University of Western Ontario North Campus Building, Room 260, London, Ontario, Canada N6A 5B7Affiliation,  ABSTRACT This paper extends information quality (IQ) assessment methodology by arguing that veracity/deception should be one of the components of intrinsic IQ dimensions. Since veracity/deception differs contextually from accuracy and other well-studied components of intrinsic IQ, the inclusion of veracity/deception in the set of IQ dimensions has its own contribution to the assessment and improvement of IQ. Recently developed software to detect deception in textual information   represents the ready-to-use IQ assessment (IQA) instruments. The focus of the paper is on the specific IQ problem related to deceptive messages and affected information activities as well as IQA instruments (or tools) of detecting deception to improve IQ. In particular, the methodology of automated deception detection in written communication provides the basis for measuring veracity/deception dimension and demonstrates no overlap with other intrinsic IQ dimensions. Considering several known deception types (such as falsification, concealment and equivocation), we emphasize that the IQA deception tools are primarily suitable for falsification. Certain types of deception strategies cannot be spotted automatically with the existing IQA instruments based on underlying linguistic differences between truth-tellers and liars. We propose the  potential avenues for the future development of the automated instruments to detect deception taking into account the theoretical, methodological and practical aspects and needs. Blending multidisciplinary research on Deception Detection with the one on IQ in Library and Information Science (LIS) and Management Information Systems (MIS), the paper contributes to IQA and its improvement by adding one more dimension, veracity/deception, to intrinsic IQ. Keywords Deception detection, lying, veracity, natural language  processing, online tools, information quality assessment. INTRODUCTION Deception in written communication represents one of the information quality (IQ) problems by intentionally and knowingly creating a false belief or false conclusion on the  part of the sender in the mind of the receiver of the information (e.g., Buller and Burgoon, 1996; Zhou, et al., 2004). Deceptions are prominently featured in several domains (e.g., politics, business, personal relations, science,  journalism, per Rubin 2010) with the corresponding user groups (such as news readers, consumers of products, health consumers, voters, or employers) influenced by decreased information quality. Recent research on deception detection and emerging technologies to identify the veracity of written messages demonstrates wide-range  problems related to deceptive messages and importance of deception detection in textual information. “ With the massive growth of text-based communication, the potential for people to deceive through computer-mediated communication has also grown and such deception can have di sastrous results,” (Fuller et a l. 2011, p. 8392). However, the IQ research seems to undervalue the role of deception in improving IQ (Lee et al. 2002, Stvilia et al. 2007, Knight and Burn 2005). This paper attempts to fill this gap by blending the multidisciplinary literature on deception detection with the IQ one. The goal is to extend the IQA methodology and framework by demonstrating that veracity/deception dimension as one of the independent components of intrinsic IQ 1  and should be measured separately in IQA for subsequent IQ improvement. First, the paper reviews recent literature on methodology of automated deception detection in written communication and IQ. Second part theorizes the problem of deception and affected users to argue that veracity/deception of information is a separate dimension of intrinsic IQ, and as such should have its own IQA. The third part describes the method to identify the existing tools on deception detection and main features of the computational tools of verbal lie detection. The next section compares the use of different 1   “Intrinsic IQ implies that information has quality in its own right” ( Lee et al. 2002, p. 135). SIST 2012, October 28-31, 2012, Baltimore, MD, USA. Copyright © 2010 Victoria L. Rubin and Tatiana Vashchilko  tools for the deception detection. The final part discusses the implications for the future research and practice on improving information quality through deception detection if veracity/deception dimension is one of the significant components of IQA. LITERATURE REVIEW The complete automation of deception detection in written communication is mostly based on the linguistic cues derived from the classes of words from the Linguistic Inquiry and Word Count (LWIC) (Pennebaker et al., 2001). The main idea of LWIC coding is text classification according to truth conditions. LWIC has been extensively employed to study deception detection (Vrij et al., 2007, Hancock et al., 2008, Mihalcea and Strapparava, 2009). Mihalcea and Strapparava (2009) used 100 stories on three controversial topics with word frequencies used to train  Naïve Bayes and Support Vector Machines classification algorithms. In 70% of the cases, Mihalcea and Strapparava (2009) correctly classified stories into deceptive and truthful categories. A measure of saliency for deception detection of every word class based on the dominance score and word coverage as a weight of the linguistic item in corpora are basis to identify the distinctive characteristics of the deceptive texts. Partial automation in detecting deception in textual information appeal to higher levels of linguistic analysis, semantic and discourse levels (Rubin and Vashchilko 2012). The main differences between the two is in the reliance on only automated identification of linguistic cues or the partial automation of the underlying semantic meaning of the entire text under consideration. Researchers and practitioners have certain trade-offs when choosing the complete or partial automation, or the combination of both methods. The timing of the analysis is definitely greater in the partial automation, and the comparison of the precision of these methods needs additional research. There are, however, drawbacks in every approach. The LWIC results serve as the independent variables for subsequent logit analysis and the input for the classification algorithms to judge the deceptiveness of a text. One of the advantages of using the logit analysis based on LWIC coding for deception detection is its application across various subject areas and achieved 67% accuracy rate (Newman et al. (2003). However, Vrij et al. (2007) compared the LWIC approach to the manual coding to detect deception, and concluded that the manual analysis might be better than the LWIC-used computational analysis. The most recent analysis of automated deception detection, however, with the corresponding software of detecting fake online review demonstrated a significant improvement of computational approaches over human abilities to detect deception (Ott et al. 2011). The goal of Ott et al. (2011) was to identify fake reviews of products and services over the Internet by using four classification methods: two ML classifiers (Naïve Bayes and Support Vector Machines) and genre identification through the frequency distribution of part-of-speech (POS) tags and n- gram-based text categorization. Research on IQ defines and assesses information quality  based on the usefulness of information or its “fitness  for use” by delineating various dimensions along which IQ can  be measured quantitatively (Juran, 1992, Lee et al. 2002, Stvilia et al. 2007, Knight and Burn 2005). One of the four major dimensions of IQ is intrinsic IQ, in which various authors assigned such components as accuracy,  believability, reputation, objectivity (Wang and Strong 1996), accuracy and factuality (Zmud 1978), belivability, accuracy, credibility, consistency and completeness (Jarke and Vassiliou 1997), accuracy, precision, reliability, freedom from bias (Delone and McLean 1992), accuracy and reliability (Goodhue 1995), accuracy and consistency (Ballou and Pazer 1985), correctness and unambiguous (Wand and Wang 1996). However, almost no literature considers the deception as one of the components of the intrinsic IQ despite extensive research on deception detection and its importance for written communication communication . METHODOLOGY   To identify the existing computational tools on deception detection in written statements, we conducted a thorough search of and search engines as well as the websites of the well-known scholars on deception detection (mainly from the list of the scholars who organized or participated in the European Chapter of the Association for Computational Linguistics (EACL) 2012 workshop on Deception Detection). The inclusion criteria for the tools are English language, complete or  partial use of computational algorithms to analyze the written statements and the availability of the tools on-line (for free use or purchase). Automated deception detection is a cutting-edge technology that is emerging from the fields of Natural Language Processing (NLP) and Machine Learning, building on years of research in interpersonal psychology and communication studies on deception. Much has been written in Library and Information Science (LIS) on credibility assessment and a variety of ways and checklist schemes to verify the credibility and stated cognitive authority of the information  providers (e.g., Rieh 2010, Fogg and Tseng 1999). Passing the deception detection test can verify the source’s intention to create a truthful impression in the readers’ mind, supporting sources trustworthiness and credibility. On the other hand, failing the test immediately alerts the user to  potential alternative motives and intentions and necessitates further fact verification. The main two reasons for using automation in deception detection are to increase objectivity by decreasing potential human bias in detecting deception (reliability of deception  detection), and improve the speed in detecting deception (time processing of large amounts of text), which is especially valuable in law enforcement 2  (Hutch et al 2012). However, Hutch et al (2012) demonstrates that computational tools might provide conflicting findings on the direction of the effect of the same linguistic categories on the level of deception in textual information. Diverse targeted audience for deception detection does not lead to much variety in existing s oftware’s methodology on deception detection. Several successful studies have demonstrated the effectiveness of linguistic cue identification, as the language of truth-tellers is known to differ from that of deceivers (e.g., Bachenko, Fitzpatrick, and Schonwetter 2008, Larcker and Zakolyukina 2010). The majority of the text-based analysis software uses different types of linguistic cues. Some of the common linguistic cues are the same across all deception software types, whereas other linguistic cues are derived specifically for the specialized topics help to generate additional linguistic cues. For example, Moffit and Giboney ’s (2012)  software calculates the statistics of various linguistic features present in the written textual information (number of words, etc.) independently on its content, and subsequently this statistics can be used for classification of the text as deceptive or truthful. RESULTS   The use of language represented by linguistic items changes under the influence of different situational factors, genre, register, speech community, text and discourse type (Crystal 1969). Therefore, the tools (or more precisely, the verbal cues) for deception detection across various knowledge domains could differ though the computational algorithms might be the same. Thus, it is important to categorize the deception tools by the subject area, for which the tool was srcinally developed, and other main features. These subject areas though might not be important, if the linguistic based cues (LBC) are used on the basis of the general linguistic knowledge (Höfer et al. 1996).  Nevertheless, if the subject areas are highly specialized, then the researcher should account for it (Höfer et al. 1996; Porter and Yuille 1996; Steller and Köhnken 1989). Zhou and colleagues (2004) developed Text-based Asynchronous Compute-Mediated Communication (TA-CMC) and reviewed five main systems developed for the analysis of the deception detection in textual communication, Criteria-Based Content Analysis (CBCA), Reality Monitoring (RM), Scientific Content Analysis (SCAN), Verbal Immediacy (VI) and Interpersonal Deception Theory (IDT). Each of the systems developed 2   The law enforcement has typically about only a few hours to interrogate the potential suspects and evaluate the truthfulness of their statements, so the automation of deception detection can be very helpful.  criteria for classification a textual information either as deceptive or truthful. These systems have been developed theoretically and methodologically with individually written computer programs accompanying each of the systems (Zhou et al 2004). With the advances of the deception research, and demand from the practitioners for the stable tools on quick and accurate deception detection, scholars have begun to generate software for deception detection. However, the majority of the software offers on-line evaluation tools without algorithm provision (Chandramouli and Subbalakshmi 2012), or with the provision of API (Ott et al. 2011, Moffit and Giboney 2012), and customizable dictionaries (Moffit and Giboney 2012). Three free software (Chandramouli and Subbalakshmi 2012, Ott et al. 2011, Moffit and Giboney 2012) were evaluated, and LWIC software was left for future evaluation. The two software (Chandramouli and Subbalakshmi 2012, Ott et al. 2011) are not only calculating statistics for various linguistic features, but also  providing right away the answer of whether a text is deceptive or not based on classification algorithms. The other two software (Moffit and Giboney 2012 and LWIC) calculate the statistics of the linguistic features, but then the user needs to use their own classification algorithms to analyze the derived statistics and conclude on whether the text is deceptive or truthful. There are both advantages and disadvantages with both approaches. The first two software do not provide any output of the calculated statistics, and, therefore, the users cannot employ their own classification algorithms. The latter two software do supply statistics, but do not offer any options for classification of analyzed text. The software developed by Rajarathnam Chandramouli’s team at Stevens Institute of Technology gives three answers with regard to analyzed text: deceptive, true, or neutral  based on 40 deceptive cues. However, in addition to this final conclusion, an interest is in the types of indicators  pointing to the considered text being deceptive or truthful. Table 1 compares the software by running four stories, two truthful and two deceptive. Table 1. Comparison of Deception Detection Software Story Software of Chandramouli and Subbalakshmi (2012) Review Sceptic (Ott et al 2011) Story 1 (truthful) Deceptive Truthful Story 2 (truthful) Deceptive Truthful Story 3 (deceptive) Deceptive Deceptive Story 3 (deceptive) Deceptive Deceptive The Chandramouli’s software identified all the stories as deceptive: the truthful stories were classified as deceptive  based on the percentage of words longer than 6 letters and  total number of pronouns, whereas classification of deceptive stories did not identified specific deceptive cues (Appendix 1 lists the stories and the screenshots with the results of the analysis of with this software). By contrast, Ott and colleagues’  (2011) software identified correctly all four stories with assigning a contribution role to deception or truthfulness to any meaningful word in a text (see Appendix 1 screenshots). The correct identification of truth and deceptive stories are especially surprising, since the software is srcinally developed for the analysis of the reviews of products and services, which might imply the use of specific vocabulary. However, the stories for the evaluation in our paper are life experience stories with the general lexicon (used in Rubin and Conroy 2012). This supports Höfer and colleagues’  (1996) premise that the linguistic based cues used on the basis of the general linguistic knowledge are independent of the subject area. This implies that potentially Ott and colleagues’  (2011) software might be applied to the deception detection  beyond online reviews of products and services. DISCUSSION  Deception Types Considering several known deception types (such as falsification, concealment and equivocation, per Burgoon and Buller 1994), we emphasize that the tools are primarily suitable for falsification only. For a recent review and unification of five taxonomies into a single feature-based classification of information manipulation varieties, see Rubin and Chen (  Forthcoming  ). Certain types of deception strategies cannot be spotted automatically based on underlying linguistic differences between truth-tellers and liars. For instance, concealment is a deceptive strategy that requires careful fact verification, likely to be performed by humans regardless of the state-of the-art in automated deception detection. Another example is illustrated by Rubin and Conroy (  Forthcoming  ): “ would current methods  be able to distinguish mostly genuinely truthful résumés from deceptive, given the fact that the résumé genre already requires persuasive rhetoric”  and embellishments? In the last few years, deception detection or perhaps, a  better term would be conceptual tools dealing with language accuracy and factuality have increased in importance in various subject areas due to rising amount of digital information and the number of its users. Journalism, online marketing, proofreading and political science are to name a few. For example, in political science  Politifact   (albeit based on man-powered fact-checking) and TruthGoggles  sort the true facts in politics helping citizens to develop better understanding of politicians statements. McManus’s (2009)  BS Detector   and Sagan’s (1996)  Baloney Detection Kit   help readers to detect fraudulent and fallacious arguments, as well as check the facts in the news of various kind, economic, political, scientific. In  proofreading, Stylewriter   and  AftertheDeadline  help users to identify stylistic and linguistic problems related to their writings. These tools use not only linguistic cues to resolve truth/falsification problems, but also experts’ opinions, and additional necessary sources to establish the factuality of events and statements. The recently developed software that detect deception in textual information are potential future venues for research in information quality assessment, and several tools we have identified can be considered ready-to-use IQA instruments. Since veracity/deception differs contextually from accuracy and other well-studied components of intrinsic information quality, the inclusion of veracity/deception in the set of IQ dimensions has its own contribution to the assessment and improvement of IQ. The veracity/deception dimension should be one of the components of intrinsic IQ dimensions. One of the questions that researchers might want to consider when developing a deception detection tool for a certain area of application is whether the recent advances in the deception detection in other areas could be successfully applied in a new area. For example, could the deception detection in opinions on the Internet can be applied to the deception detection in such areas as law enforcement especially for the commercial law enforcement? The very few evaluative texts in the example above demonstrated that this might be possible. However, extensive evaluations of these software required to generalize the findings. CONCLUSIONS We are awaiting major break-through in the area to support and enhance human abilities in lie/truth discernment. They value though is in raising awareness for a presence of  potential deception to counteract gullibility and truth bias. Little is known about the applicability of various automated deception detection tools for written communication in various subject area. The tools became available for public in the last year or two with the predominant methodology of text classification into deceptive or truthful based on linguistic based cues’ statistics. The precision of classification varies with some software unable to identify the truthful stories in particulate contexts. While transparency in methods sounds like a recipe on how to lie, some verifiability of the tool’s performance is needed to establish the basis for decision  –   why does this sound like a lie? ACKNOWLEDGMENTS This research is funded by the New Research and Scholarly Initiative Award (10-303),   entitled Towards Automated  Deception Detection: An Ontology of Verbal Deception Cues for Computer-Mediated Communication  (Academic Development Fund at the University of Western Ontario). REFERENCES J . Bachenko, E. Fitzpatrick and M. Schonwetter, 2008. Verification and Implementation of Language-Based  Deception Indicators in Civil and Criminal Narratives. In  Proceedings of the 22nd International Conference on Computational Linguistics . Manchester, UK: ACL. D.P. Ballou, H.L. Pazer, Modeling data and process quality in multi-input, multi-output information systems, Management Science 31 (2), 1985, pp. 150  –  162. Burgoon, J. K., & Buller, D. B. (1994). Interpersonal Deception: V. Accuracy In Deception Detection. Communication Monographs , 61(4), 303. R. Chandramouli and K. Subbalakshmi. 2012. Text Analytics:  Deception Detection and Gender Identification from Text  . W.H. Delone, E.R. McLean, Information systems success: the quest for the dependent variable,  Information Systems  Research  3 (1), 1992, pp. 60  –  95. B. J. Fogg and H. Tseng. 1999. The elements of computer credibility. Paper read at SIGCHI conference on Human factors in computing systems: the CHI is the limit, at Pittsburgh, Pennsylvania, United States. D.L. Goodhue, Understanding user evaluations of information systems,  Management Science  41 (12), 1995, pp. 1827  –   1844. J.T. Hancock, L.E. Curry, S. Goorha, and M. Woodworth. 2008. On lying and being lied to: A linguistic analysis of deception in computer-mediated communication.  Discourse Processes,  45(1):1  –  23.   V. Hauch, I. Blandón-Gitlin, J. Masip and S. Ludwig Sporer. 2012.  Linguistic Cues to Deception Assessed by Computer  Programs:  A Meta-Analysis. M. Jarke, Y. Vassiliou, Data warehouse quality: a review of the DWQ project,  Proceedings of the Conference on  Information Quality, Cambridge, MA, 1997, pp. 299  –  313. D.F. Larcker and A.A. Zakolyukina, 2010.  Detecting  Deceptive Discussions in Conference Calls : Stanford University Rock Center for Corporate Governance Working Paper Series No. 83. J. H. McManus. (2009).  Detecting Bull: How to Identify Bias and Junk Journalism in Print, Broadcast and on the Wild Web.  Sunnyvale, CA: Unvarnished Press. R. Mihalcea and C. Strapparava. 2009. The lie detector: Explorations in the automatic recognition of deceptive language. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 309  –  312. Association for Computational Linguistics. K. Moffit and J.S.Giboney. 2012.    M. L. Newman, J. W. Pennebaker, D. S. Berry, and J. M. Richards. 2003.  Lying words: Predicting deception from linguistic styles.  Personality and Social Psychology Bulletin, 29: 665-675. M. Ott, Y. Choi, C. Cardie, and J. T. Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. In the Proceedings of the 49th Annual Meeting of the  Association for Computational Linguistics , p309  –  319. Portland, Oregon, June 19-24, 2011. J. W. Pennebaker, M. E. Francis, and R. J. Booth. 2001.  Linguistic Inquiry and Word Count. Erlbaum Publishers, Mahwah,NJ. S. Y. Rieh. 2010. Credibility and Cognitive Authority of Information. In  Encyclopedia of Library and Information Sciences , Third Edition, edited by B. M and M. N. Maack  New York: Taylor and Francis Group. V. L. Rubin. (2010). On Deception and Deception Detection: Content Analysis of Computer-Mediated Stated Beliefs. In the  Proceedings of the American Society for  Information Science and Technology Annual Meeting  , October 22-27. V. L. Rubin and Y. Chen (  Forthcoming  ). Information Manipulation Classification Theory for LIS and NLP V. L. Rubin & Conroy, N. (2012). Discerning truth from deception: Human judgments and automation efforts.  First Monday, 17(3). Retrieved from V. L. Rubin & Conroy, N. (  Forthcoming  ). Deception in Professional Biographic Documentation: Résumés, Application Letters and Job Descriptions. V. L. Rubin, V. L & Vashchilko, T. (2012). Identification of Truth and Deception in Text: Application of Vector Space Model to Rhetorical Structure Theory. T he Proceedings of the 13th Conference of the European Chapter for the  Association for Computational Linguistics: Compu-tational Approached to Deception Detection Workshop  (EACL 2012), Avignon, France, April 23, 2012,   Sagan, C. (1996). The demon-haunted world: Science as a candle in the dark. . New York: Random House. A.Vrij, S. Mann, S. Kristen, and R. P. Fisher. 2007. Cues to deception and ability to detect lies as a function of police interview styles.  Law and Human Behavior  , 31(5), 499-518. Y. Wand, R.Y. Wang, Anchoring data quality dimensions in ontological foundations, Communications of the ACM   39 (11), 1996, pp. 86  –  95. R.Y. Wang, D.M. Strong, Beyond accuracy: what data quality means to data consumers,  Journal of Management  Information Systems  12 (4), 1996, pp. 5  –  34. R. Zmud, concepts, theories and techniques: an empirical investigation of the dimensionality of the concept of information, Decision Sciences 9 (2), 1978, pp. 187  –  195.
Related Documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks