Government & Politics

9 pages
8 views

On the visibility of information on the Web: an exploratory experimental approach

Please download to get full document.

View again

of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
On the visibility of information on the Web: an exploratory experimental approach
Transcript
  Research Evaluation    August 2006 0958-2029/06/020107-09 US$08.00 ©  Beech Tree Publishing 2006 107  Research Evaluation , volume 15, number 2, August 2006, pages 107–115, Beech Tree Publishing, 10 Watford Close, Guildford, Surrey GU1 2EP, England Search-engine operations On the visibility of information on the Web: an exploratory experimental approach Paul Wouters, Colin Reddy and Isidro Aguillo On the Web, information that is not presented by  the search engine in response to a specific query is in fact inaccessible. This problem has been  defined as “the invisible web” or the “deep web”  problem. Specific characteristics of the format of information may make it inaccessible to the  crawlers that create the databases of search en- gines. We explore the dimensions of the invisible Web in the European Research Area. We pro- pose that information visibility is an emergent  property of the Web as complex system. Visibility  of information is a highly unstable feature that is  determined by a complex interaction between the local structure of the web environment, the  search engine, the websites on which the infor- mation resides, the format of the information,  and the temporal dimensions of the search. Paul Wouters is at the Virtual Knowledge Studio for the Hu-manities and Social Sciences, Royal Netherlands Academy of Arts and Sciences, Amsterdam, Netherlands: Email: paul.wouters@vks.knaw.nl. Colin Reddy is at the Centre for Society and Genomics, Radboud University of Nijmegen, Nij-megen, Netherlands. Isidro Aguillo is at the Centre for Scientific Information and Documentation (CINDOC), Spanish National Research Council, Madrid, Spain. AVING ACCESS TO scientific information is a key concern for the research community and those involved in the evaluation of re-search (OECD, 1998; Wouters and Schröder, 2000; Schroeder, 2001). Since the emergence of the World Wide Web as the main interface to the information available on the Internet, search engines have be-come the “obligatory point of passage” (Latour, 1988) for information searches. Information on the Internet can be found only through search engines (either commercial or tailor-made) or research crawlers; browsing the Internet is not feasible (Law-rence, 2001). Most users do not have the expertise or the motivation to develop tailor-made research crawlers and rely on search engines. This means that information not presented by the search engine in response to a specific query is, for the majority of users, in fact inaccessible. This is the case even if the website on which the information resides is ac-cessible (Lawrence and Giles, 1999a). This   problem   has   been   defined   as   “the   invisible   Web”   problem   (Sherman   and   Price,   2001)   or   the   “deep   Web”   problem   (Pedley,   2001).   The   core   idea   of    this   literature   is   that   specific   characteristics   of    infor-mation   formats   may   make   this   information   inaccessi-ble   to   the   crawlers   that   create   the   databases   of    the   search   engines.   In   the   early   days   of    the   debate   about   the   invisible   Web,   search   engines   could   not   index   cer-tain   file   types   (eg   .pdf    files).   Also,   information   resid-ing   on   websites   requiring   logins   and   other   forms   of    non-automated   interaction   does   not   show   up   in   search   engine   results   and   is   hence   part   of    the   invisible   Web.   The   invisible   Web   was   therefore   seen   as   a   specific   part   of    the   Web,   which   was   thought   to   be   four   to   five   times   bigger   than   the   visible   Web.   This   approach   led   to the conclusion that two tasks were urgent: H  Visibility of information on the Web 108  Research Evaluation    August 2006    1. The further development of search engines to en-able them to index complex file types such as .pdf and images; 2. The creation of new tools to access the deep or invisible Web. Work on the first task has changed search engines: they can now index a great variety of file types. The problem of non-automated interaction with data-bases nevertheless still exists. The invisible Web has therefore far from disappeared. The second task has created a small industry of companies and dedicated websites aiming to uncover specific segments of the invisible Web. Often these websites are basically Yahoo!-like web directories, devoted to a limited number of areas or fields of interest (Lin and Chen, 2002). Although this line of research has produced im-portant tools and insights, its definition of the invisi-ble Web ignores additional factors that may make information invisible in a specific search. Many of these factors have been recognised in the literature, but they have, as far as we know, not been included in a theory-based redefinition of visibility of infor-mation on the Web. In the next section, we discuss the way the invisible Web has been discussed in the literature including the different concepts of visible, deep, or opaque Web that have been proposed. The third section discusses the limitations and problems of this approach to the invisible Web and proposes a new definition. The fourth section then presents the methods and results of an exploratory experiment on information visibility in the European Research Area (ERA) based on a recall measure, well known in the field of information retrieval (Salton, 1986). Related to this, we conceptualise the Web as a complex sys-tem with emerging properties that are not reducible to the properties of the different components of the Web. Lastly, we draw conclusions about the nature of information visibility on the Web. The invisible Web in the literature Sherman and Price (2001) define the invisible Web as: Text pages, files or other often high-quality au-thoritative information available via the World Wide Web that general-purpose search engines cannot, due to technical limitations, or will not, due to deliberate choice add to their indices of Web pages. Sometimes also referred to as the “Deep Web” or “Dark Matter”. Thus, the construction of visibility is determined by the search engines. “What may be invisible today may become visible tomorrow, should the engines decide to add the capability to index things that they cannot or will not currently index” (Sherman and Price,   2001).   Invisibility   is   a   dynamic   attribute,   liable   to change at the impetus of the search engines. Sherman and Price propose four types of invisibility: 1. The Opaque Web , consisting of “files that can be, but are not, included in search engine indices” (Sherman and Price, 2001). It is argued that “sim-ply because one, fifty or five thousand pages from a site are crawled and made searchable, there is no guarantee that every page from a site will be crawled and indexed”. The reasons behind this are in part due to search engines limiting the depth of crawl on predominantly economic grounds. The frequency with which crawls are undertaken also plays a part. It is suggested that if a new website has been created between crawls, it will remain ‘invisible’ until found on the next search engine update crawl. Another factor here is the limit search engines place on the number of results they display for an enquiry (“typically between 200 and 1000 documents” (Sherman and Price, 2001) even though search engines will show many times more documents as “being found”. Obviously, those not part of the actual pages displayed will remain invisible, despite being part of the search engine’s indexes. The final factor involved is that of “disconnected URLs” because search engines cannot crawl pages that are not linked to by other pages.   2. The Private Web   refers to pages that, while being technically indexable, have been deliberately ex-cluded from search engines. This may be due to pages being password-protected, use of a “ro-bots.txt file to disallow a spider from accessing the page” or “use of the ‘noindex’ meta tag to prevent the spider from reading past the head por-tion of the page and indexing the body” (Sherman and Price, 2001). 3. The Proprietary Web  is part of the Invisible Web because of the conditions that are put on accessibility of the information. “Search engines cannot for the most part access pages on the Proprietary Web, because they are only accessible to people who have agreed to special terms in exchange for viewing the content” (Sherman and Price, 2001). Users may have to register (sometimes by paying a fee) in order to gain access to the pages. Sherman and Price draw a distinction between these pages and database services such as LexisNexis. Because they use legacy database systems “that existed long before the Web came into being”, they are not considered to be web or Internet providers, although they do offer web access to information.   4. The Truly Invisible Web  consists, according to Sherman and Price, of those web pages that for technical reasons cannot be indexed or spidered by search engines. “A definition of what consti-tutes a truly invisible resource must necessarily be somewhat fluid, since the search engines are con-stantly improving and adapting their methods to embrace new types of content” (Sherman and  Visibility of information on the Web Research Evaluation    August 2006 109 Price, 2001). There are different information types included in this category; first, file types that can-not be handled by the crawling technology of the day. Dynamically generated information may also be part of the “truly invisible Web”, as well as in-formation contained in databases.   The four types of invisibility conceptualised in this way affects information differently. This is summa-rised in Table 1 (Sherman and Price, 2001). As can be seen from the table, an important focus in Sherman and Price’s conceptualisation is the format of the information: the file type and format. A sec-ond important element is the way the information is generated or made accessible. The notion of the invisible Web is actually a metaphor that tries to account for the phenomenon that information residing on websites may not be included in the results of searches for that informa-tion. Other metaphors to deal with this phenomenon have been proposed in the literature, although they do not seem to have caught on as much as the meta-phor of invisibility. Pedley has proposed the notion of the “Deep Web” because “whilst the information is somewhat hidden, it is still available if a different technology is employed to access it (Pedley, 2001). Bergman (2001) from Brightplanet, a software com-pany, has used the metaphor of the Deep Web. It defines it as: content that resides in searchable databases, the results from which can only be discovered by a direct query. Without the directed query, the database does not publish the result. When que-ried, deep Web sites post their results as dy-namic Web pages in real-time. Though these dynamic pages have a unique URL address that allows them to be retrieved again later, they are not persistent. (Bergman, 2001) The rationale behind the choice for “deep Web” rather than “invisible Web” is that the latter is seen as “inaccurate”: We avoid the term “invisible Web” because it is inaccurate. The only thing “invisible” about searchable databases is that they are not index-able or queryable by conventional search engines. The real problem is not the “visibility” or “invisibility” of the Web, but the spidering technologies used by conventional search en-gines to collect their content. For these reasons, we have chosen to call information in search-able databases the Deep-Web. Yes, it is some-what hidden, but clearly available if different technology such as ours is used to access it. (Brightplanet; see Bergman, 2001) Another metaphor, offered by Bailey et al , is the one of “Dark Matter”. This is defined as “information on the Web that is not or cannot be discovered by an individual or a search engine” (Bailey et al , 1999). In a similar fashion to Sherman and Price, a “classi-fication taxonomy” is proposed as to “why material is not discoverable”. Bailey et al  (1999) define four different categories: 1.  Rejected Dark Matter refers to pages that are “re- jected”. The rejection can be by a human, for example, “due to a lack of interest in the contents indicated by a link”, or mechanical, arising from search engines’ crawl policies. 2.  Restricted Dark Matter   refers to material that is “publicly linked to”, but requires permission to be accessed, for example, sites requiring registration and password protection. 3. Undiscovered Dark Matter   refers to situations where “the necessary links which locate the mate-rial are never found”, or where “material is not publicly linked to at all”. The latter instance is termed as “private dark matter”. 4.  Removed Dark Matter   refers to material that was once available, but is no longer. This may be due to material being replaced by more up-to-date in-formation (“ephemeral dark matter”), or situations where the url is simply no longer available (“dead dark matter”).   We would like to note that the empirical material informing this classification is too old to be still use-ful. Nevertheless, the concept of Dark Matter is in-teresting. The authors define this by referring to the perspective of the information searcher: Dark Matter for a person or Web crawler con-sists of pages that they cannot reach and view, but which another observer can. Dark Matter is important to our understanding of the Web in Table 1. Types of invisible web content Type of invisible web content Why it’s invisible Disconnected page No links for crawlers to find the page Page consisting primarily of images, audio, or video Insufficient text for the search engine to “understand” what the page is about Pages consisting primarily of PDF or Postscript, Flash, Shockwave, Executables (programs) or compressed files (.zip, .tar, etc) Technically indexable, but usually ignored, primarily for business or policy reasons Content in relational databases Crawlers can’t fill out required fields in interactive forms Real-time content Ephemeral data; huge quantities; rapidly changing information Dynamically generated content Customised content is irrelevant for most searchers; fear of “spider traps” Source  : Sherman and Price, 2001, Table 4.2  Visibility of information on the Web 110  Research Evaluation    August 2006    that the portion of the Web any of us can see depends on our viewpoint. Different observers see different overlapping sections of the Web. However, no one can see all of the Web, even if they want to. (Bailey et al , 1999) Second, Bailey et al  provide an operationalisation of Dark Matter in network analysis terms, as reachabil-ity. This makes it possible mathematically to define “shades of darkness” in terms of sets of nodes with a certain amount of reachability from a specific ob-server. Reachability is defined as the aggregate re-sult of different factors: 1  the set of nodes that are reachable in one step from the location of the ob-server;   the   set   of    access   permissions   that   the   observer   has; the link extraction policy of the observer; the page-loading policy of the observer; and the capa-bilities to generate links from visited pages. Thus, reachability is a relationship between a set of web pages and a specific observer. The degree of dark-ness (or lightness) of a set of web pages is then defined relative to an observer. The concept of Dark Matter also informs the way Bailey et al  define the “publicly indexable Web”. This is the set of pages that are light for many observers. There have been various attempts at quantifying the size of the invisible Web. For example, Law-rence and Giles (1999a) state: “Search engines do not index sites equally, may not index new pages for months, and no engine indexes more than about 16% of the web”. Brightplanet’s research indicated that: Public information on the deep Web is cur-rently 400 to 550 times larger than the com-monly defined World Wide Web. The deep Web contains 7,500 terabytes of information compared to 19 terabytes of information in the surface Web. More than 20,000 deep Web sites exist. (Bergman, 2001) Sherman and Price estimated on the contrary that the Invisible Web would be far larger than the Visible Web. As for Dark Matter: experiments suggest that as little as 12.25% of all existing information on a server may be reachable to the majority of search engines. The remaining 87.75% of pages may thus be dark to many users. As much of 37% of all ex-isting information on a server consists of load-able Web pages which remain as undiscovered dark matter. (Bailey et al , 1999) Defining invisibility as emergent property This short overview makes clear how important the operation of search engines is for the nature and scope of the invisible Web. It has also indicated that it is not simply a matter of interaction between a piece of information and a particular search engine; many other factors intervene. It therefore makes sense to develop a slightly more complex definition of invisibility of information. A first step is to define the way search engine crawlers and indexers may influence the visibility of information. The following limitations and con-straints have been reported: •   The depth of crawling may be limited (this is de-fined as the distance in terms of links from the site of entry into a specific subdomain that will be fol-lowed by the spider). •   The time spent on a particular website may be limited. •   The spider may be instructed to ignore links to specific file or MIME types. •   Larger pages are not indexed completely if the amount of content exceeds a specific limit (101 k in Google). •   The spider may be instructed to ignore links with specific file extensions •   The spider may or may not be instructed to de-duce new links from existing links (either by ‘go-ing up’ in the tree structure to the base home page; or by deducing new links by analogy). •   The spider may follow different policies with re-spect to the ethics of Web-crawling. •   The spider may follow different strategies in the order of following links. •   The spider may be instructed to ignore links to mirror sites. •   The spider may be instructed to ignore duplicate or near-duplicate files. •   The spider may be instructed to ignore ‘old’ information. •   The spider may be instructed to skip urls that refer to scripts (such as cgi-bin) or anticipate user re-quests such as ?-marks (also to prevent spider traps). •   The spider may be instructed not to query data-bases. •   The spider may be instructed to evade spider traps, for example by ignoring links if their urls are repetitions of links retrieved from the same page (with or without threshold values). In addition, the frequency of the crawl in relation to the different updating frequencies of websites will An operationalisation of Dark Matter in network analysis terms, makes it possible mathematically to define “shades of darkness” in terms of sets of nodes with a certain amount of reachability from a specific observer  Visibility of information on the Web Research Evaluation    August 2006 111 have a large impact on the results. Another major difference between search engines is the file types that they index. As noted, this has changed consid-erably since the first-generation search engines, and the most frequently used file types can now be in-dexed by the most popular search engines, such as Google. Search engines do not, of course, present all hits in their databases but make a selection on the basis of a measure of relevance. This means that the way the database is indexed and sorted, and the way hits are sorted and presented, are additional factors influencing visibility of information. Since most search engines are commercial operations, and their business model depends on advertisements related to queries and search results, these operations are not neutral. Paid placement and paid inclusion clearly influence information visibility (Introna and Nis-senbaum 2000). For the sake of continuity of discussion, we pro-pose to keep using the concept of “visibility” but relate this, following Bailey et al , to the perspective of the observer. In general one can say that a naive realist perspective on the Web is confounded by the problem that it is utterly unknown “what the Web really is”, for both theoretical and practical reasons. What the Web looks like is dependent on the net-work position, the technologies used to observe it, and the conceptual perspective of the observer. For this reason, one can state that the reality of the Web is actually constructed   (making it nonetheless very real) in the interaction between an observer, the search engine and the raw materials with which the search engine’s database has been created (i.e. the crawling and indexing practices upon the unknown graph structure of the Web). This notion is an ex-tended reformulation of the notion of visibility as a relational phenomenon mentioned above. In other words, information can be called invisible in a certain search context (of a specific search tech-nology) if: •   that information is not part of the results of the search; and •   that information does meet the criteria of rele-vance as formulated in the search; and    •   that information would in principle be retrievable if an observer knew its exact location on the Web. This means that visibility of information (sometimes resulting in invisibility, sometimes in overexposure) is a highly unstable feature that is determined by a complex interaction between the local structure of the web environment, the search engine, the website on which the information resides, the format of the information, and the temporal dimensions of the search. Interestingly, this leads us back from a “new problem” as the Invisible Web has been claimed to be, to a well-known problematic, the balance be-tween precision and recall in the field of information retrieval (Salton, 1986). We are in this discussion mainly interested in recall, defined as the intersec-tion between the set of existing web pages and the set of retrieved web pages. 2  Taking this definition of invisibility of informa-tion, we can determine which factors are shaping visibility on the basis of our literature review and the overview of key search engine characteristics. Since we do not know the exact characteristics of most publicly available search engines, we take these search engines as a black box and analyse its out-come with respect to a specific search query. The following characteristics of information on the Web are candidate factors influencing informa-tion visibility: •   the number of inlinks to the page containing the information; •   the depth at which the information is located within a domain; •   the file extension and the MIME type of the file containing the information; •   the metatags with which the web page is marked; •   the updating frequency of the website or page; •   the accessibility of the information itself; •   the format of the url at which the information is located; and lastly •   the total of these ‘visibility characteristics’ of the inlinking pages. The last item points to the recursive and reflexive nature of the problem of the invisibility. The role of these ‘visibility characteristics’ of information may be made a bit clearer by giving some examples. In-formation contained in a database that must be que-ried, but will be omitted by most search engine crawlers, would become part of the invisible Web as defined above, although the information would, of course, be perfectly visible if the database was que-ried. The same holds for information of web pages with a question mark, cgi-bin or other script sign in the url. The depth level may be a particularly sig-nificant factor for research-related information, be-cause scientific websites can be quite complex and may be located at a deep level of a particular univer-sity subdomain. Last but not least, the file extension and the file type can lead to the exclusion of infor-mation from the search-engine index database. Methods In order to measure information invisibility as de-fined above, we created a reference data set on the basis of the Science Citation Index. Because of the possible variability of the Web in the countries con-stituting the ERA, we wished to ensure that a wide range of institutes could be selected. We therefore decided to examine websites of institutions of a spe-cific scientific discipline that was of importance to Europe and also well represented on the Web. In addition, we considered it important that, given the
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x