Reports

16 pages
7 views

Longitudinal study of content and elements in the scientific web environment

Please download to get full document.

View again

of 16
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
Longitudinal study of content and elements in the scientific web environment
Transcript
    Longitudinal Study of Contents and Elements in the Scientific Web environment Journal of Information Science, © CILIP 2005 Longitudinal Study of Contents and Elements in the Scientific Web environment José Luis Ortega  Internet Lab, Centro de Información y Documentación Científica (CSIC) Isidro Aguillo  Internet Lab, Centro de Información y Documentación Científica (CSIC) José Antonio Prieto  Internet Lab, Centro de Información y Documentación Científica (CSIC) Correspondence to: José Luis Ortega Priego. Internet Lab, Centro de Información y Documentación Científica (CSIC). Joaquín Costa, 22. 28002 Madrid (Spain). E-mail: jortega@cindoc.csic.es Abstract. The aim of this work is the longitudinal study of the evolution and the state of 738 web sites in two different points in time (1997 and 2004). It tries to establish the rate of growth and decay of the Web and all the web elements. To this end, the structure and the contents of these web sites are extracted through a crawler and compared at the two different moments in time. The main results confirm a growth of web contents and elements in the web, although there is also a high degree of web content decay. The results suggest that in the seven year period covered by this study the web is characterized by both strong dynamism and instability.    J.L. ORTEGA ET AL.   2 Journal of Information Science © CILIP 2005 Keywords: Webomatrics; Web persistence; Web growth; Web decay; Linkrot 1. Introduction Since the beginning of the World Wide Web, different growth behavior patterns have been studied. Pennock et. al. [1] discovered that the incoming links of a web site grow with time in accordance with a power law. According to Internet Systems Consortium [2] web domains are growing since 1994 with a similar rate. However, the OCLC Web Characterization Project [3], carried out between 1998 and 2000, warns that although the WWW keeps growing, contents contribution rates slowed by 1% in 2001-2002 period.  Nevertheless, there is a bibliographic gap about web decay or the disappearance of pages in the World Wide Web. Harter and Kim [4] were the first to study the ephemeral nature of the Web, detecting that a third of the electronic citations in e-journals were not available. Lawrence et al. [5] also studied the problems of the electronics cites obtaining similar results. Koehler [6,7,8], one of the busiest authors in this field, monitored 360  pages and 343 web sites over several years, finding that in 2001 the operative pages had reduced 34.4% and in 2003, 33.8%. Nelson and Allen [9] tested the contents of different e-libraries during one year finding only 3% of unavailable objects ( linkrot  ). However, they warn that these media are more stable than the rest of the World Wide Web and that their results have to be considered carefully. Fetterly et al. [10], continuing with the work of Cho and García-Molina [11], studied the evolution and persistence of 150 million pages for 11 weeks and found that the larger pages change more often and more deeply than the smaller ones. Bar-Ilan and Peritz [12] queried “informetrics” using the most important search engines for 5 years, with the intention of studying the evolution of that discipline in the web, finding a disappearance rate of 40%. Wouters, Hellsten and Leydesdorff [13] studied the time span features of Google and  Altavista  and detected a great variability. While Ortega et al. [14] also detected that the query results of Google  decayed according to the isotope radiation decay. 2. Objectives The aim of this paper to study the state and evolution of 738 web sites in two different moments in time, 1997 and 2004. It intends to establish the increment and decrease of several of web objects, to detect the different growth patterns in the web sites studied and to describe the persistence of these objects with time. It also tries to analyse the relationship between several web elements with the intention of finding out their behaviour in these    Longitudinal Study of Contents and Elements in the Scientific Web environment Journal of Information Science, © CILIP 2005 two moments in time. These web sites were crawled in 1997 and 2004, and the results compared with intention of analysing their evolution. 3. Methodology In 1997, web sites were analysed by NetCarta.com [15]. This web site gathered the 1000 high quality web sites in terms of importance and contents. For this reason, most of these web sites are directories, e-libraries and information resources for scientists. These web sites were analysed with the WebMapper 2.0 software of  NetCarta. 921 of these web sites were downloaded to develop this study. In 2004, with the intention of comparing the results obtained in 1997, these web sites were again analysed with the software Microsoft Site Analyst. This software was used because WebMapper was acquired by Microsoft, and merged with Site Analyst. In this way, Microsoft Site Analyst was the only software that could open the reports generated in 1997. For this reason, this study is limited to the features of this software and the elements arrangement supplied by this commercial crawler. This software works at different levels and it defines one web site according to the URL inserted. Thus, a web site can be a institutional domain, a directory or a unique page, and them it extracts information only of these unities. Table 1 shows the elements that Site Analyst generates in the crawl process and that are analysed in this study [16]. Element Description Images GIF, JPEG, and other types of images. Gateways Representations of CGI Scripts. Internet links to FTP, Telnet, Mailto, WAIS, NNTP, Gopher, and all other Internet services (except HTTP)  Applications Java applets, executable files, PDF files, Microsoft Word documents, PostScript files, and other applications  Audio WAV, AIFF, AU, and other audio files Video MPEG and other video file types. Text TXT files and other text files (other than HTML pages), including plain text. Pages Number of pages in the web site    J.L. ORTEGA ET AL.   4 Journal of Information Science © CILIP 2005 Internal Links links from the web site that point to its own pages Outlinks links from the web site that point to pages in other web sites. Table 1. Elements generated by Site Analyst and their description.   At first observation, less than half of these web sites had changed their address; concretely 427 (46.3%) and 183 (19.8%) had disappeared or had produced failures in the conversion to Microsoft Site Analyst, since to compare  both crawls it was necesary to open again the Webmapper files in Site Analyst; and only 311 (33.7%) are remained constant. Finally, apart from the disappeared and faulty web sites, 738 web sites were analysed. The following URL contains (http://internetlab.cindoc.csic.es/cv/11/listado.htm   ) the 921 resources obtained in 1997 and the 738 analysed in 2004.  Next, the data of each web site were extracted from the final reports of Microsoft Site Analyst through a little software programmed in VBS, and were recorded in a Microsoft Access database. Finally, they were analysed in a Microsoft Excel spreadsheet. 4. Research field Web sites analysed are significant research web sites, which have been working from 1997 until 2004. These web sites are characterised by having a great volume of information and act as an information resource to the scientific community. Table 2 shows the distribution of these web sites according to the institutional domain. More than half the web sites belong to the academic and scholarly domain (56.91%), followed by a considerable government presence (18.56%). Nevertheless, the economic sector only represents 10.03%. As we can see, the commercial sector was hardly present in 1997, as the Web was almost exclusively used by academics, and the non profit sector takes up the whole web. Sectors Web sites Percentage University 42056.91% Government 13718.56% Organisations 10714.50% Commercial 7410.03% TOTAL 738100.00% Table 2. Web sites by institutional sector.      Longitudinal Study of Contents and Elements in the Scientific Web environment Journal of Information Science, © CILIP 2005 In the following Table 3, the web sites have been presented by country, first, from the TLD of each site and then from an heuristic exploration. The web sites of United States are more than half of the sites studied (52.85%), followed at a distance by United Kingdom (7.45%) and Canada (6.23%). However, there are minor presence of French (1.22%) and Japanese (0.95%) web sites. It is understandable that the United States takes up all the net and nevertheless it is surprising that other countries, who carry a considerable weight in science, were poorly represented, such as France and Japan, which could suggest that the Web was still expanding. Countries Web Sites Percentage USA 39052.85%UK 557.45%CA 466.23%DE 324.34%IT 253.39% AU 222.98%FI 101.36%FR 91.22%NL 91.22%JP 70.95%Other Countries 13318.02% TOTAL 738100.00 Table 3. Web sites by country TLD .   5. Results  Next, the result of the crawl process carried out in 2004 and its comparison with the initial data of 1997 is discussed.
Related Documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x