09 July 2014

A Tool to Discover Web Content For Archives and Libraries (Version 1.0)

Abstract

A rationale for identifying websites that libraries or archives may want to harvest for their collections.  Also includes Version 1 of code to automate the process.  The code can be run on consumer grade PCs using software that’s freely available on the web, and is relatively easy for those with limited technical knowledge to install and use.

§ § §

Updated 10 July 2014 to include explanatory comments in the code section.  The Zip file available for download included those comments.

§ § §

The web is a rich source of materials for libraries and archives.  One of the biggest challenges is finding content that falls within the scope of their collecting policies.  Using a search engine is ineffective and unsystematic.  Even a substantial investment of time looking at individuals documents may return only a small fraction of what’s available, the results are dependent on the vagaries of the search engines’ indexing algorithms, and the results often will return documents already considered.

In the early 2000s, the “Arizona Model for Preservation and Access of Web Documents” described an approach based on my experience at the Arizona State Library, Archives and Public Records.[1]  The Arizona Model is based on two assumptions. 

First, rather than using a bibliographic model that focused on individual documents, treat websites as archival collections and acquire the site as aggregates.  A smaller site might be acquired as a whole; time spent weeding out-of-scope materials is not justified by the miniscule gain in storage space.  A larger site might be acquired by analyzing the site’s structure as represented by the subdirectories in the URLs; as is typical in archival practice, appraisal would be done at the series (subdirectory) level. 

Before content could be acquired, a library or archives would have to identify relevant websites.[2]  The Arizona Model’s second assumption is simple: a website that is in scope will likely link to several other websites that are in scope. The site will also include links to sites that are out of scope, but the ratio of in-scope to out-of-scope sites will likely be high. 

A large website may have several thousand URLs.  However, the number of distinct websites referenced is likely much lower.  The trick is to extract a simple list of all websites referenced on a website.  Manually inspecting the pages would be impractical, not to mention tedious and error prone.  Fortunately, that work can be automated.  The process can be implemented and run on consumer-grade computers by individuals with minimal technical skills, using software that is available at no cost.  The code and instructions for use are presented here.

The approach, when first tested in Arizona in the mid 2000s, generated about 10,000 URLs from four websites, and impractical number to review.  However, that list was distilled to about 700 domains, a much more reasonable number.  A librarian or archivist familiar with the subject area could review that list and quickly determine whether many of the websites are in or out of scope.  For example, virtually all Arizona websites included a link to Adobe.com (clearly out of scope), pointing to the Acrobat Reader necessary to read the documents.  By contrast, the list contained many sites they immediately recognized as in scope (az.gov, azcommerce.gov) because they used those sites on a regular basis. Roughly half were immediately recognized as in or out of scope. That’s a reasonable number to manually check; five or ten minutes per site would take about forty hours.

While the second assumption is not logically valid (there’s no way to be sure any given site will be listed on at least one other site), it’s adequate. The Illinois State Library was confident that its list of state websites was complete, but a subsequent analysis based on this approach identified many more.[3]  At the same time, the tool is imperfect because it focuses on domains, and missed websites that were hosted with other content under a shared domain. (For example, a small agency using an ISP without registering its own domain. For example, www.comcast.net/~azbarbers returns www.comcast.net, which is out of scope.)

As part of the ECHO DEPository Project at the University of Illinois, Urbana-Champaign, the Arizona State Library worked with OCLC to build a suite of tools to help automate identifying domains, appraising subdirectories for acquisition, and describing the collections and series.[4]  Those tools are no longer supported or readily available.

One of the tools – used to discover websites – was particularly useful and immediately addresses libraries’ and archives’ need to identify relevant content.  Moreover, recreating it was relatively trivial, and the code and instructions for its use are presented here.  Version 1.0 does a single, simple task: it generates a list of websites referenced on a given website.  Although limited, it is easy for someone with limited technical skills to install and use.

Version 2.0, under development, will stored the discovered websites in a simple database.  Rather than reviewing individual reports, which necessarily means redundant review of sites already discovered, users will be able to see a report of all domains and a report of all domains discovered starting with a given date (typically, the last time the list was reviewed).  The installation of the database is relatively simple, and reports will be run from the command line.

Version 3.0, for future development, will create a website that will allow users to track whether sites have been reviewed and to record comments on the sites.  Version 3.0 will require webserver software, and the version of IIS that’s distributed as part of Windows will suffice for most users.  Note: even Version 3.0 should run on a mid-level, consumer-grade computer; it will not require a powerful server.

Note that this tool does nothing beyond identifying potentially relevant sites.  It does not harvest the content for preservation.  Services such as Archive-It and tools such as HTTrack or wget can be used to capture content, using the sites identified by the domain tool as seeds.

Once installed, using the tool should be straightforward, although some familiarity with the command line is useful.  Installing some the software requires that ability to create directories and extract the contents of a zip file.  For those who are not familiar with the inner workings of a computer, the optional step to modify the system path likely more intimidating than difficult.

Requisite Software and Configuration

The instructions that follow are simplistic, in hope that they will be easy for individuals with limited technical skills to follow.  Sophisticated users should recognize other options that will work as well or better for their particular needs.

Caveat emptor: I’ve tested this code on a number of machines with no problems.  No doubt, the first time someone tries it, they’ll find a bug.  Please contact me. 

Standard disclaimers apply:  I offer the code without any warranty.

1.  Xenu’s LinkSleuth was designed to help web masters check for broken links on their sites.  Fortunately for librarians and archivists, the URLs can be exported to a file. 
          Even though LinkSleuth is a bit dated, it works well for the purpose at hand.  The code run under Windows 7 and 8 (and possible earlier versions).  The code is available at no charge, although Tilman Hausherr, the author of the software, invites users to send a thank you letter, an XL T-shirt, or other inexpensive token as a gratuity.[5] 
          The program is available at http://home.snafu.de/tilman/xenulink.html, and is a simple install.
          If using another link checking program, the code to parse the exported list will likely need to be modified.  For example, LinkChecker files start with three lines to be skipped (as opposed to one), and the data elements are stored as comma separated values (rather than tabs).[6]  Such changes may be relatively trivial for someone with a basic understanding of PHP string functions.

2.      PHP is commonly used to develop webpages, but – as in this case – can also be run from the command line.[7]  Windows versions of PHP are available from http://windows.php.net/.  The VC11 Non Thread Safe version is experimental, but worked with Windows 7 and 8.  Using the VC9 Non Threaded Safe version has fewer risks.  Be sure to select the appropriate binary for your system (x86 for 32 bit  systems, or x64 for 64 bit systems).
          PHP is a bit trickier to install.  First, use Windows Explorer to locate the zip file you downloaded.  Right click and select Extract ... .  When prompted for a location to extract the files, navigate to Computer, click to highlight “Local Disk C:”, and then click “Make a New Folder.” Enter php and click “Extract.”
          Second, and optionally, you can modify the system path variable so that Windows can easily run PHP from the command line.  Instructions for Windows 7 can be found on Renso Hollumer’s blog; the process is fundamentally the same with Windows 8.[8]
          If you chose not to modify the path, you’ll need to fully specify the directory where PHP is housed when you run the command.  For example, the only difference it the need to include the underlined portion of the second example.
          modifying the path
                  > php domainid.php < XenuFile.txt > ExtractURLs.txt

          unmodified path
                  > c:\php\domainid.php < XenuFile.txt > ExtractURLs.txt
         
3.  Optionally, create a directory to hold the domain identification tool, the file containing raw data from LinkSleuth, and the files containing the extracted domains.  These instructions will assume that you create such a directory, called c:\domainid.  If you create a directory in another location, replace this example with the drive and path you’re using instead.

4.  The Domain Identification Tool is a simple PHP script.  Use a plain text editor (WordPad, Notepad, Textpad, not Word) to copy and paste the code in Appendix I into a blank documents, and save it in a file called domainid.php in the directory you’ve created for this project.   You can also download and save this file from http://arstweb.clayton.edu/domainid/domainid.zip.  You’ll need to extract the file to the working directory.

Instructions for Use

1.      Xenu’s LinkSleuth

LinkSleuth is sophisticated software that is easy to use at a rudimentary level.  However, it can be configured to address problematic websites.  As you get more experience with the process and the diverse configurations of websites, take time to read the documentation and learn about how to use the tool effectively and efficiently.

Launch the program from the menu with File | Open URL ... and enter the URL for the home page of the website you want to crawl.  Very shortly, you’ll begin to see a list of links, which will change color as the program verifies their status. 

Monitor the program’s progress in the lower right corner of the status bar.  If the program runs for more than five minutes – especially the first few times you harvest URLs from a site – use the pause or emergency stop buttons to halt the process, and check the report for possible problems. 

If LinkSleuth finishes quickly with only one or a few URLs, the website may be rejecting requests from unknown spiders with a robots.txt file.[9]   If you run into this scenario, you’ll likely need to find another website that you can use as your starting point. 

The most common problem is a “spider trap,” a set of pages that generate an endless number of links for the spider to check.  The classic example is a database-driven calendar page with a link to “next month.”  Following the link to the next month, the spider finds a link to the subsequent month; the spider could follow these links until the end of time. 

After running the report the first few times, browse the log for URLs that suggest a spider trap or other content you know you don’t want or that won’t produce useful results.  For example, a website may have thousands of scanned images.  If you spot problems, you can configure LinkSleuth to skip those URLs. As you become more proficient with LinkSleuth, you’ll know when the program is functioning properly and can run for longer periods of time.

When the report is finished (or if you’ve stopped it manually), you’ll be prompted for a report.  Viewing this report is optional.  After closing that report, from the menu click File | Export to tab separated file . . .  Save the file in an appropriate location (c:\domainid).   Using an appropriate name will make it easier to keep track of your work.  For example, include a reference to the site and date; a report for arstweb.clayton.edu might be arstweb_20140708.txt.  For an organization with several websites, you might use the domain itself support.microsoft.com_20140708.txt or www.microsoft.com_20140708.txt.

2.      domainid.php

Open a window with the command line by clicking the start button, command, and selecting Command Prompt (command.exe). 

Change to the working directory that contains the domainid.php file and the tab separated file from LinkSleuth
    cd c:\domainid

Run the command as follows, substituting the file names for the LinkSleuth report and a name for the new file containing the extracted domains.  Note: If an existing file has the same name that you use for ExtractedDomains.txt, it will be overwritten and lost.
    php domainid.php < LinkSleuthFile.txt > ExtractedDomains.txt
or   c:\php\php domainid.php < LinkSleuthFile.txt > ExtractedDomains.txt

A bit of explanation for those unfamiliar with the command line.  The first element calls the php interpreter, and the second part calls the particular program to be run. 
          The third element (< LinkSleuthFile.txt) tells the program the file in which the raw data is stored.  The final element (> ExtractedDomains.txt) tells the program where to store the out.  For more information about this syntax, search for stdin and stdout, standard abbreviations for standard input and standard output.
          If the input or output files are other directories, include a fully qualified path.  For example, if the input file were on the H: drive in a folder called Raw and the output file were one the E: drive in a folder called Parsed, the command would look like this:
          php domainid.php < H:\Raw\LinkSleuthFile.txt > E:\Parsed\ExtractedDomains.txt
If the file name or path includes spaces or special characters, you have may to include them in quotes; the easier solution is to avoid those characters, using alphanumerics, dash, underscore, and dot.

[This is my first effort to post code. Yep, I'm a little nervous.  Comments on how to make this more accessible for people with limited skills very welcome, either in comments below or by other channels.]

Appendix 1.  DomainID PHP Code

Cut and paste the text below in Courier into a plain text editor, and save the file as domainid.php.  Or, download the code from http://arstweb.clayton.edu/domainid/domainid.zip

<?php


/****************************************************************
getdomain.php
Richard Pearce-Moses
pearcemoses@gmail.com

Creates a list of domains referenced on a website, allowing identification of other domains that may be of interest.  Intended for use by archives that harvest web content.  See Richard Pearce-Moses and Joanne Kaczmarek, “An Arizona Model for Preservation and Access of Web Documents,” DttP: Documents to the People 33:1 (Spring 2005) at http://home.comcast.net/~pearcemoses/papers/AzModel.pdf.  And,  Jackson, Zhang, and Wu, "Hyperlink Extraction Improves State of Illinois Website Identification," Proceedings of the American Society for Information Science and Technology 43:1 (2006) at http://onlinelibrary.wiley.com/doi/10.1002/meet.14504301218/abstract

1. Create a list of URLs from a single site using Xenu's LinkSleuth, available from http://home.snafu.de/tilman/xenulink.html. Note: Versions available through other sources, such as cnet.com, may embed adware in your browser. The software is distributed at no charge. Tilman Hausherr,the author of the software, invites users to send a thank you letter, an XL T-shirt, or other inexpensive token.

Using Xenu is easy and does a very good job.  It may get caught in a "spider trap" -- commonly a series of URLs that generate an endless list.  For example, links on a calendar page that point to "next month" can be followed for years (as it were).  Xenu can exclude such problem links.  If a crawl is taking exceptionally long, abort it, inspect the results, and make changes as appropriate.

2. From Xenu, export the results as a tab separated file.

3. Open a command line prompt.  For simplicity's sake, change to the directory where you saved the file.

4.  Run the php script, redirecting the file to the program using stdin and exporting the results using stdout. Your command should look like something like what follows, although the prompt will be different
     c:\dir\to\file>  php getdomain < XenuOutput.txt > WebsiteList.txt

If you forget the < before the input file, the program will appear to hang -- it's waiting for input.  Hit CTRL-C to abort the program.

5.  The results of the first file can be used as seeds to discover still more websites of potential interest.

This work is licensed under a Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/


Please report bug and improvements to pearcemoses@gmail.com

*****************************************************************/

// Initialize file handle
$rawURLS = fopen('php://stdin', 'r' );

// Skip the first line (header info)
$URL = fgets( $rawURLS ) ;

while( $URL = fgets( $rawURLS ) ) 
{ // strip everything after the first tab
  $tabstop = stripos($URL, "\t");
  $URL = substr($URL, 0, $tabstop) ;
// strip protocol from beginning of URL
$start_position = stripos($URL, '://') + 3 ;
$URL = substr($URL, $start_position) ;

// strip directory and file information following website domain
$end_position = stripos($URL, '/') ;
$URL = substr($URL, 0, $end_position) ;

// Assign domain to an array, skipping any the duplicate immediate predecessor
$prevURL = "" ;

if ($prevURL != $URL)
{ $URLlist[] = $URL ; 
$prevURL = $URL ;
}
}

// Sort the array
sort($URLlist) ;
 // Deduplicate the array 
 $prevURL = "" ;
$i = 0 ;
$j = count($URLlist) ;
while( $i < $j ) 
{ if ($prevURL != $URLlist[$i])
{ echo $URLlist[$i] . "\n" ;
$prevURL = $URLlist[$i] ;
$i = $i +  1;
}
else 
{ ++$i; }
}
echo "Number of domains referenced on site: " . $j . "\n" ;
//Note: Output to a file using stdout

fclose( $rawURLS );

?>




NOTES: (Not to be included in the php file above.  Apologies to readers: Bloggers makes it look like the notes are linked in the text, but -- they're not.

[1] See Richard Pearce-Moses and Joanne Kaczmarek, “An Arizona Model for Preservation and Access of Web Documents,” DttP: Documents to the People 33:1 (Spring 2005), p. 17–24.  Preprint available at http://home.comcast.net/~pearcemoses/papers/AzModel.pdf.
[2] In this paper, “website” refers to the whole of an organization’s web presence.  Often that presence is distributed across several webservers, each identified by its own domain, the first part of the URL.  For example, Microsoft’s website would include www.microsoft.com, support.microsoft.com.  The domain discovery tool identifies specific domains.  In some instances, as above, domains that are clearly part of the website because the last two elements are the same.  In other instances, azcleanair.com and azdeq.gov could both be considered part of the Arizona Department of Environmental Quality’s web, but that is not readily apparent from the domain.
[3] Jackson, Zhang, and Wu, "Hyperlink Extraction Improves State of Illinois Website Identification,"
Proceedings of the American Society for Information Science and Technology 43:1 (2006) at http://onlinelibrary.wiley.com/doi/10.1002/meet.14504301218/abstract
[4] “Tools Development,” http://www.ndiipp.illinois.edu/index.php?Phase_I_%282004-2007%29:Tools_Development, checked 8 July 2014.  The project was funded by the National Digital Information Infrastructure and Preservation Program of the Library of Congress.
[5] Be aware that versions available through other sources, such as cnet.com, may embed adware in your browser.
[7] http://php.net/.  Windows binaries are available at http://windows.php.net/.
[9] See “The robots.txt file” in the HTML 4.01 Specification (W3C, 1999), http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1

1 comment:

  1. Since posting, I have figured out how to parse reports generated by LinkChecker (http://wummel.github.io/linkchecker/index.html).

    LinkChecker runs on multiple platforms. As time permits, I'll post a version of the domainid tool that can read these files.

    ReplyDelete