17 July 2014

A Tool to Discover Web Content For Archives and Libraries (New, Improved Version 2.0)

Abstract

Version 2 of the Domain ID tool improves the process of identifying websites by adding a database to combine the domains identified from several seed sites, and to avoid reviewing sites already reviewed.  Installing and using the software is only slightly more complicated than the previous version, and will still run on consumer-grade PCs using software available at no charge.

§ § §

Version 1 of the Domain ID tool described an approach for archives and libraries to identify websites that might be candidates for acquisition.  Finding potentially relevant websites among the thousands on the web is like searching for the proverbial needle in a haystack.  To increase the odds, the approach behind the Domain ID tool assumes that a relevant website will likely have links to other relevant sites.  Manually searching a site for all other sites is tedious, time consuming, and prone to errors. The complete background and rationale remains the same.  Rather than repeating it here, please see the previous entry.

Note: The boundaries of a website are somewhat arbitrary.  A simple site may be on a single domain (www.simple.com).  A more complex site may be on several domains (www.complex.com, apps.complex.com, complex.differentdomain.com).  Often the website's components can be identified by the shared second- and top-level domains (complex.com). Sometimes a portion may be on an entirely different domain, and the only way to recognize that connect is to look at the that site. This tool identifies domains, which are specific.  The domains are grouped by the second- and top-level domains (rather than alphabetically), to help make obvious connections.

The Domain ID  tool automates the process in two steps.  First, it uses a program (Xenu's Link Sleuth) to download all the links on a site.  Second, it uses a second program (domainid2_add.php) to analyze those links, extract a list of all the domains in the list, and save the results to a database.  Finally, the list can be viewed using a third program (domainid2_new.php). The list can be reviewed relatively quickly, especially as it will be readily apparent that many sites are in or out of scope. 

Version 1 emphasized simplicity over efficiency.  It produced a report based on a single website.  Further, the entire list would  have to be reviewed each time it was run, on the same website or a different website.

Version 2 addresses those problems by storing the domains in a database.  Domains extracted from analyzing many websites are integrated into a single list.  The tool also tracks when the domain was added to the database; a simple query to the database can return only newly-discovered domains.  Version 2 is also designed with the same considerations to make it accessible to those with limited technical skills or resources.

Version 3 will provide a web interface to the database, allowing users to make notes about specific domains and to flag whether the domain is in or out of scope.  Version 3 should be considered vaporware with no promise that it will be available REAL SOON NOW.  It will require greater technical expertise to install the software, although it will continue to use tools available at no cost and that run on consumer grade equipment.  The tools will have a web interface and should be easier to use.  It will provide the ability to record whether a domain is in scope, make notes about the domains, and to group domains that share the same second-level domain.

Requisite Software and Configuration

1-4.  Version 2 requires the same software and configuration in Version 1: Xenu's LinkSleuth, and PHP.  The domainid.php tool is not used in Version 2, but it does not have to be deleted.  See the previous entry for installation instructions.  If you've not already installed Version 1, see those instructions and continue here.

5.  MySQL

MySQL is commercial-grade database management system.  Installing MySQL is a bit tricky, but more intimidating than difficult.  Although unlilkely, MySQL might cause an older system with limited resources to run slow. If things go awry, open the MySQL folder under Start | Programs and look for the MySQL Installer.  Although the name is counterintuitive, launch that program and select Uninstall to remove MySQL from the system.

Download the MySQL Installer 5.6 for Windows from http://dev.mysql.com/downloads/mysql/.  (Note that the MySQL Installer is 32-bit, but is used to install both the 32- and 64-bit versions of the program.) Also note that after clicking the link to the installer, you'll see options for two installers – one small, the other large.  The former downloads the program during installation; the latter downloads it all at once. 

The Installer offers to install all products.  The Server Only option is adequate for the Domain ID programs. You can use the Installer to add additional features at a later date, if you want to invest some time learning this very powerful tool so that you can build your own queries.  
-      Select Development Machine for the Server Configuration Type. 
-      When prompted for a root password for MySQL, be sure to note it as you will need it in subsequent steps.  If this is the first time you've set up MySQL, leave "Current Root Password" blank. 
-      Accept the defaults to start MySQL at Windows Startup and to run the Windows service as a standard system user.
-      Accept the defaults for Config Type (Development Machine), to Enable TCP/IP Networking, and no Advanced Configuration
-      Restart your computer

6.      Download the code from http://arstweb.clayton.edu/domainid/domainid2.zip and extract the files. In the instructions for version 1, I recommended that you create the directory c:\domainid for those files. You may continue to use that directory for these new files.  I also noted that more experienced users may choose to use another directory; in the instructions below they should replace c:\domainid\with the directory of their choice.   

Note:  Several of the programs below create, update, and query the database.  They include the userid 'archivist' and password 'Kaczmarek'.  (The latter in honor of my colleague who helped with the Arizona model years ago.)  In general, it's a bad idea to use userids and passwords you found on the Internet.  You would be well advised to take a few minutes to change the userid and password as save the files on your computer.  Be consistent, using the same userid and password throughout.

At the same time, using the defaults in a local installation on a workstation that's password protected poses relatively little risk.  As configured, the user can access the database only from the local system.  Anyone with that access could view the contents of the programs to discover the userid and password combination. 

If the programs are run from a server – especially if that server has other MySQL databases – you or someone you work likely can address these security issues.  Take care not to forget this step.

7.      Create the database by opening a window with the command prompt (Windows Key, the cmd).  The database structure is described in Appendix 1. Change to the directory you've created with the command
                   cd c:\domainid

then issue the command
                   mysql -u root -p < domainid2_createdb.sql

You will be prompted for the MySQL root password you created above.  Type it, and hit enter.

Note: Be aware that this command is intended to be used only for initial configuration of a blank database.  If you run it again after capturing websites, all data will be deleted.  You may want to do that to start from a clean slate after playing with the tools to see how they work.  If you have data that needs to be preserved, see the instructions for backup below.

Test the database by issuing the command
                   php domainid2_testdb.php

You should see the following results
                   C:\Users\rpm\Dropbox\DomainID\v2>php domainid2_testdb.php
          <h2>Found 1 domains</h2>This database currently has 0 rows


Instructions for Use

1.      Xenu's LinkSleuth

Follow the same instructions for step 1 in the instructions for Version 1 to run Xenu's LinkSleuth and save the file as tab separated values.

2.      domainid2_add.php

Open a window with the command prompt and change to c:\domain (or your preferred working directory) as above.

Run the command as follows, substituting the file name for the LinkSleuth report.  Note that in Version 2, you do not supply a name for a file to store the discovered domains.
    php domainid_add.php < LinkSleuthFile.txt  

If you did not add PHP to your path, use the following form:
    c:\php\php domainid2_add.php < LinkSleuthFile.txt

You'll see a list of newly discovered domains scroll up the screen as they are added to the database.  If you want to capture that list to a file, add > ExtractedDomains.txt to the end of the command.

3.    domainid2_new.php

To see a list of all domains in the database, issue the command
          php domainid2_new.php all

To see a list of all domains new to the database since a given date, issue the command
          php domainid2_new.php YYYY-MM-DD

To capture the report to a file, append > newdomains.txt to the end of either command, substituting something appropriate for the filename.

Note: the domains are sorted by the second and top level domains.  It looks a bit weird; en.wikipedia.org might sort right next to my.yahoo.com (W is just before Y), but after www.az.gov (wikipedia.org follows az.gov).  However, the sort groups all sites ending with the same two levels, so az.gov, procurement.az.gov, and revenue.az.gov are together.
         
4.      Repeat

Version 2 combines the domains found on multiple sites into a single database.  You can continue to grow the list of domains by running Xenu's LinkSleuth against other websites, then running domainid2_add.php to add the list or URLs on those websites to the database.

5.      Backup

If you find the tool useful and it grows over time, you'll want to back it up to avoid loss in case of a system failure.

From the command prompt, change to a directory where you want to initially store the backup file.  If you have a directory linked to a cloud storage service such as Dropbox, you may want to store the file there so it is automatically copied to another system that's physically remote.

Issue the command below, providing the root password when prompted
    mysqldump -u root -p --quick domains > domains_YYYY-MM-DD.sql

Copy the file domains_YYYY-MM-DD.sql elsewhere for safekeeping.

If you have to restore the database entirely, run the domainid2_createdb.sql command as above, then issue the command below, providing the root password when prompted
          mysqldump -u root -p domains < domains_YYYY-MM-DD.sql

6.      Optionally: Use Microsoft Access to work with the MySQL database

If you're familiar with Microsoft Access, you can link to the MySQL database and create queries and forms to work with the data.  You'll need to install the MySQL ODBC connector and create a Data Source Name (DSN). Instructions are beyond the scope of this post, but can be easily found on the web.  (A quick search revealed a likely starting point at http://dev.mysql.com/doc/connector-odbc/en/connector-odbc-examples-tools-with-access-linked-tables.html.  


Appendix 1.  Domains Database Structure

The domainid database includes the following fields

domainKey
int(11)
The unique identifier for each row in the database.
domain
varchar(250)
The domain of the website server
TLD
tinytext
The top-level domain (.gov, .org, .edu, .com, etc)
SLD
varchar(150)
The second-level domain below the TLD.  This field is useful for grouping websites that span several domains, such as www.clayton.edu, faculty.clayton.edu.
firstseen
varchar(25)
Date first added to the database.  Note: the site may have existed long before it was seen by the domainid tool.
inscope
Tinyint
A binary field to indicate if a field is in or out of scope.  Set to NULL by default. To be used in Version 3.
notes
Longtext
A field to make comments about the domain.  To be used in Version 3.


09 July 2014

A Tool to Discover Web Content For Archives and Libraries (Version 1.0)

Abstract

A rationale for identifying websites that libraries or archives may want to harvest for their collections.  Also includes Version 1 of code to automate the process.  The code can be run on consumer grade PCs using software that’s freely available on the web, and is relatively easy for those with limited technical knowledge to install and use.

§ § §

Updated 10 July 2014 to include explanatory comments in the code section.  The Zip file available for download included those comments.

§ § §

The web is a rich source of materials for libraries and archives.  One of the biggest challenges is finding content that falls within the scope of their collecting policies.  Using a search engine is ineffective and unsystematic.  Even a substantial investment of time looking at individuals documents may return only a small fraction of what’s available, the results are dependent on the vagaries of the search engines’ indexing algorithms, and the results often will return documents already considered.

In the early 2000s, the “Arizona Model for Preservation and Access of Web Documents” described an approach based on my experience at the Arizona State Library, Archives and Public Records.[1]  The Arizona Model is based on two assumptions. 

First, rather than using a bibliographic model that focused on individual documents, treat websites as archival collections and acquire the site as aggregates.  A smaller site might be acquired as a whole; time spent weeding out-of-scope materials is not justified by the miniscule gain in storage space.  A larger site might be acquired by analyzing the site’s structure as represented by the subdirectories in the URLs; as is typical in archival practice, appraisal would be done at the series (subdirectory) level. 

Before content could be acquired, a library or archives would have to identify relevant websites.[2]  The Arizona Model’s second assumption is simple: a website that is in scope will likely link to several other websites that are in scope. The site will also include links to sites that are out of scope, but the ratio of in-scope to out-of-scope sites will likely be high. 

A large website may have several thousand URLs.  However, the number of distinct websites referenced is likely much lower.  The trick is to extract a simple list of all websites referenced on a website.  Manually inspecting the pages would be impractical, not to mention tedious and error prone.  Fortunately, that work can be automated.  The process can be implemented and run on consumer-grade computers by individuals with minimal technical skills, using software that is available at no cost.  The code and instructions for use are presented here.

The approach, when first tested in Arizona in the mid 2000s, generated about 10,000 URLs from four websites, and impractical number to review.  However, that list was distilled to about 700 domains, a much more reasonable number.  A librarian or archivist familiar with the subject area could review that list and quickly determine whether many of the websites are in or out of scope.  For example, virtually all Arizona websites included a link to Adobe.com (clearly out of scope), pointing to the Acrobat Reader necessary to read the documents.  By contrast, the list contained many sites they immediately recognized as in scope (az.gov, azcommerce.gov) because they used those sites on a regular basis. Roughly half were immediately recognized as in or out of scope. That’s a reasonable number to manually check; five or ten minutes per site would take about forty hours.

While the second assumption is not logically valid (there’s no way to be sure any given site will be listed on at least one other site), it’s adequate. The Illinois State Library was confident that its list of state websites was complete, but a subsequent analysis based on this approach identified many more.[3]  At the same time, the tool is imperfect because it focuses on domains, and missed websites that were hosted with other content under a shared domain. (For example, a small agency using an ISP without registering its own domain. For example, www.comcast.net/~azbarbers returns www.comcast.net, which is out of scope.)

As part of the ECHO DEPository Project at the University of Illinois, Urbana-Champaign, the Arizona State Library worked with OCLC to build a suite of tools to help automate identifying domains, appraising subdirectories for acquisition, and describing the collections and series.[4]  Those tools are no longer supported or readily available.

One of the tools – used to discover websites – was particularly useful and immediately addresses libraries’ and archives’ need to identify relevant content.  Moreover, recreating it was relatively trivial, and the code and instructions for its use are presented here.  Version 1.0 does a single, simple task: it generates a list of websites referenced on a given website.  Although limited, it is easy for someone with limited technical skills to install and use.

Version 2.0, under development, will stored the discovered websites in a simple database.  Rather than reviewing individual reports, which necessarily means redundant review of sites already discovered, users will be able to see a report of all domains and a report of all domains discovered starting with a given date (typically, the last time the list was reviewed).  The installation of the database is relatively simple, and reports will be run from the command line.

Version 3.0, for future development, will create a website that will allow users to track whether sites have been reviewed and to record comments on the sites.  Version 3.0 will require webserver software, and the version of IIS that’s distributed as part of Windows will suffice for most users.  Note: even Version 3.0 should run on a mid-level, consumer-grade computer; it will not require a powerful server.

Note that this tool does nothing beyond identifying potentially relevant sites.  It does not harvest the content for preservation.  Services such as Archive-It and tools such as HTTrack or wget can be used to capture content, using the sites identified by the domain tool as seeds.

Once installed, using the tool should be straightforward, although some familiarity with the command line is useful.  Installing some the software requires that ability to create directories and extract the contents of a zip file.  For those who are not familiar with the inner workings of a computer, the optional step to modify the system path likely more intimidating than difficult.

Requisite Software and Configuration

The instructions that follow are simplistic, in hope that they will be easy for individuals with limited technical skills to follow.  Sophisticated users should recognize other options that will work as well or better for their particular needs.

Caveat emptor: I’ve tested this code on a number of machines with no problems.  No doubt, the first time someone tries it, they’ll find a bug.  Please contact me. 

Standard disclaimers apply:  I offer the code without any warranty.

1.  Xenu’s LinkSleuth was designed to help web masters check for broken links on their sites.  Fortunately for librarians and archivists, the URLs can be exported to a file. 
          Even though LinkSleuth is a bit dated, it works well for the purpose at hand.  The code run under Windows 7 and 8 (and possible earlier versions).  The code is available at no charge, although Tilman Hausherr, the author of the software, invites users to send a thank you letter, an XL T-shirt, or other inexpensive token as a gratuity.[5] 
          The program is available at http://home.snafu.de/tilman/xenulink.html, and is a simple install.
          If using another link checking program, the code to parse the exported list will likely need to be modified.  For example, LinkChecker files start with three lines to be skipped (as opposed to one), and the data elements are stored as comma separated values (rather than tabs).[6]  Such changes may be relatively trivial for someone with a basic understanding of PHP string functions.

2.      PHP is commonly used to develop webpages, but – as in this case – can also be run from the command line.[7]  Windows versions of PHP are available from http://windows.php.net/.  The VC11 Non Thread Safe version is experimental, but worked with Windows 7 and 8.  Using the VC9 Non Threaded Safe version has fewer risks.  Be sure to select the appropriate binary for your system (x86 for 32 bit  systems, or x64 for 64 bit systems).
          PHP is a bit trickier to install.  First, use Windows Explorer to locate the zip file you downloaded.  Right click and select Extract ... .  When prompted for a location to extract the files, navigate to Computer, click to highlight “Local Disk C:”, and then click “Make a New Folder.” Enter php and click “Extract.”
          Second, and optionally, you can modify the system path variable so that Windows can easily run PHP from the command line.  Instructions for Windows 7 can be found on Renso Hollumer’s blog; the process is fundamentally the same with Windows 8.[8]
          If you chose not to modify the path, you’ll need to fully specify the directory where PHP is housed when you run the command.  For example, the only difference it the need to include the underlined portion of the second example.
          modifying the path
                  > php domainid.php < XenuFile.txt > ExtractURLs.txt

          unmodified path
                  > c:\php\domainid.php < XenuFile.txt > ExtractURLs.txt
         
3.  Optionally, create a directory to hold the domain identification tool, the file containing raw data from LinkSleuth, and the files containing the extracted domains.  These instructions will assume that you create such a directory, called c:\domainid.  If you create a directory in another location, replace this example with the drive and path you’re using instead.

4.  The Domain Identification Tool is a simple PHP script.  Use a plain text editor (WordPad, Notepad, Textpad, not Word) to copy and paste the code in Appendix I into a blank documents, and save it in a file called domainid.php in the directory you’ve created for this project.   You can also download and save this file from http://arstweb.clayton.edu/domainid/domainid.zip.  You’ll need to extract the file to the working directory.

Instructions for Use

1.      Xenu’s LinkSleuth

LinkSleuth is sophisticated software that is easy to use at a rudimentary level.  However, it can be configured to address problematic websites.  As you get more experience with the process and the diverse configurations of websites, take time to read the documentation and learn about how to use the tool effectively and efficiently.

Launch the program from the menu with File | Open URL ... and enter the URL for the home page of the website you want to crawl.  Very shortly, you’ll begin to see a list of links, which will change color as the program verifies their status. 

Monitor the program’s progress in the lower right corner of the status bar.  If the program runs for more than five minutes – especially the first few times you harvest URLs from a site – use the pause or emergency stop buttons to halt the process, and check the report for possible problems. 

If LinkSleuth finishes quickly with only one or a few URLs, the website may be rejecting requests from unknown spiders with a robots.txt file.[9]   If you run into this scenario, you’ll likely need to find another website that you can use as your starting point. 

The most common problem is a “spider trap,” a set of pages that generate an endless number of links for the spider to check.  The classic example is a database-driven calendar page with a link to “next month.”  Following the link to the next month, the spider finds a link to the subsequent month; the spider could follow these links until the end of time. 

After running the report the first few times, browse the log for URLs that suggest a spider trap or other content you know you don’t want or that won’t produce useful results.  For example, a website may have thousands of scanned images.  If you spot problems, you can configure LinkSleuth to skip those URLs. As you become more proficient with LinkSleuth, you’ll know when the program is functioning properly and can run for longer periods of time.

When the report is finished (or if you’ve stopped it manually), you’ll be prompted for a report.  Viewing this report is optional.  After closing that report, from the menu click File | Export to tab separated file . . .  Save the file in an appropriate location (c:\domainid).   Using an appropriate name will make it easier to keep track of your work.  For example, include a reference to the site and date; a report for arstweb.clayton.edu might be arstweb_20140708.txt.  For an organization with several websites, you might use the domain itself support.microsoft.com_20140708.txt or www.microsoft.com_20140708.txt.

2.      domainid.php

Open a window with the command line by clicking the start button, command, and selecting Command Prompt (command.exe). 

Change to the working directory that contains the domainid.php file and the tab separated file from LinkSleuth
    cd c:\domainid

Run the command as follows, substituting the file names for the LinkSleuth report and a name for the new file containing the extracted domains.  Note: If an existing file has the same name that you use for ExtractedDomains.txt, it will be overwritten and lost.
    php domainid.php < LinkSleuthFile.txt > ExtractedDomains.txt
or   c:\php\php domainid.php < LinkSleuthFile.txt > ExtractedDomains.txt

A bit of explanation for those unfamiliar with the command line.  The first element calls the php interpreter, and the second part calls the particular program to be run. 
          The third element (< LinkSleuthFile.txt) tells the program the file in which the raw data is stored.  The final element (> ExtractedDomains.txt) tells the program where to store the out.  For more information about this syntax, search for stdin and stdout, standard abbreviations for standard input and standard output.
          If the input or output files are other directories, include a fully qualified path.  For example, if the input file were on the H: drive in a folder called Raw and the output file were one the E: drive in a folder called Parsed, the command would look like this:
          php domainid.php < H:\Raw\LinkSleuthFile.txt > E:\Parsed\ExtractedDomains.txt
If the file name or path includes spaces or special characters, you have may to include them in quotes; the easier solution is to avoid those characters, using alphanumerics, dash, underscore, and dot.

[This is my first effort to post code. Yep, I'm a little nervous.  Comments on how to make this more accessible for people with limited skills very welcome, either in comments below or by other channels.]

Appendix 1.  DomainID PHP Code

Cut and paste the text below in Courier into a plain text editor, and save the file as domainid.php.  Or, download the code from http://arstweb.clayton.edu/domainid/domainid.zip

<?php


/****************************************************************
getdomain.php
Richard Pearce-Moses
pearcemoses@gmail.com

Creates a list of domains referenced on a website, allowing identification of other domains that may be of interest.  Intended for use by archives that harvest web content.  See Richard Pearce-Moses and Joanne Kaczmarek, “An Arizona Model for Preservation and Access of Web Documents,” DttP: Documents to the People 33:1 (Spring 2005) at http://home.comcast.net/~pearcemoses/papers/AzModel.pdf.  And,  Jackson, Zhang, and Wu, "Hyperlink Extraction Improves State of Illinois Website Identification," Proceedings of the American Society for Information Science and Technology 43:1 (2006) at http://onlinelibrary.wiley.com/doi/10.1002/meet.14504301218/abstract

1. Create a list of URLs from a single site using Xenu's LinkSleuth, available from http://home.snafu.de/tilman/xenulink.html. Note: Versions available through other sources, such as cnet.com, may embed adware in your browser. The software is distributed at no charge. Tilman Hausherr,the author of the software, invites users to send a thank you letter, an XL T-shirt, or other inexpensive token.

Using Xenu is easy and does a very good job.  It may get caught in a "spider trap" -- commonly a series of URLs that generate an endless list.  For example, links on a calendar page that point to "next month" can be followed for years (as it were).  Xenu can exclude such problem links.  If a crawl is taking exceptionally long, abort it, inspect the results, and make changes as appropriate.

2. From Xenu, export the results as a tab separated file.

3. Open a command line prompt.  For simplicity's sake, change to the directory where you saved the file.

4.  Run the php script, redirecting the file to the program using stdin and exporting the results using stdout. Your command should look like something like what follows, although the prompt will be different
     c:\dir\to\file>  php getdomain < XenuOutput.txt > WebsiteList.txt

If you forget the < before the input file, the program will appear to hang -- it's waiting for input.  Hit CTRL-C to abort the program.

5.  The results of the first file can be used as seeds to discover still more websites of potential interest.

This work is licensed under a Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/


Please report bug and improvements to pearcemoses@gmail.com

*****************************************************************/

// Initialize file handle
$rawURLS = fopen('php://stdin', 'r' );

// Skip the first line (header info)
$URL = fgets( $rawURLS ) ;

while( $URL = fgets( $rawURLS ) ) 
{ // strip everything after the first tab
  $tabstop = stripos($URL, "\t");
  $URL = substr($URL, 0, $tabstop) ;
// strip protocol from beginning of URL
$start_position = stripos($URL, '://') + 3 ;
$URL = substr($URL, $start_position) ;

// strip directory and file information following website domain
$end_position = stripos($URL, '/') ;
$URL = substr($URL, 0, $end_position) ;

// Assign domain to an array, skipping any the duplicate immediate predecessor
$prevURL = "" ;

if ($prevURL != $URL)
{ $URLlist[] = $URL ; 
$prevURL = $URL ;
}
}

// Sort the array
sort($URLlist) ;
 // Deduplicate the array 
 $prevURL = "" ;
$i = 0 ;
$j = count($URLlist) ;
while( $i < $j ) 
{ if ($prevURL != $URLlist[$i])
{ echo $URLlist[$i] . "\n" ;
$prevURL = $URLlist[$i] ;
$i = $i +  1;
}
else 
{ ++$i; }
}
echo "Number of domains referenced on site: " . $j . "\n" ;
//Note: Output to a file using stdout

fclose( $rawURLS );

?>




NOTES: (Not to be included in the php file above.  Apologies to readers: Bloggers makes it look like the notes are linked in the text, but -- they're not.

[1] See Richard Pearce-Moses and Joanne Kaczmarek, “An Arizona Model for Preservation and Access of Web Documents,” DttP: Documents to the People 33:1 (Spring 2005), p. 17–24.  Preprint available at http://home.comcast.net/~pearcemoses/papers/AzModel.pdf.
[2] In this paper, “website” refers to the whole of an organization’s web presence.  Often that presence is distributed across several webservers, each identified by its own domain, the first part of the URL.  For example, Microsoft’s website would include www.microsoft.com, support.microsoft.com.  The domain discovery tool identifies specific domains.  In some instances, as above, domains that are clearly part of the website because the last two elements are the same.  In other instances, azcleanair.com and azdeq.gov could both be considered part of the Arizona Department of Environmental Quality’s web, but that is not readily apparent from the domain.
[3] Jackson, Zhang, and Wu, "Hyperlink Extraction Improves State of Illinois Website Identification,"
Proceedings of the American Society for Information Science and Technology 43:1 (2006) at http://onlinelibrary.wiley.com/doi/10.1002/meet.14504301218/abstract
[4] “Tools Development,” http://www.ndiipp.illinois.edu/index.php?Phase_I_%282004-2007%29:Tools_Development, checked 8 July 2014.  The project was funded by the National Digital Information Infrastructure and Preservation Program of the Library of Congress.
[5] Be aware that versions available through other sources, such as cnet.com, may embed adware in your browser.
[7] http://php.net/.  Windows binaries are available at http://windows.php.net/.
[9] See “The robots.txt file” in the HTML 4.01 Specification (W3C, 1999), http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1