17 July 2014

A Tool to Discover Web Content For Archives and Libraries (New, Improved Version 2.0)

Abstract

Version 2 of the Domain ID tool improves the process of identifying websites by adding a database to combine the domains identified from several seed sites, and to avoid reviewing sites already reviewed.  Installing and using the software is only slightly more complicated than the previous version, and will still run on consumer-grade PCs using software available at no charge.

§ § §

Version 1 of the Domain ID tool described an approach for archives and libraries to identify websites that might be candidates for acquisition.  Finding potentially relevant websites among the thousands on the web is like searching for the proverbial needle in a haystack.  To increase the odds, the approach behind the Domain ID tool assumes that a relevant website will likely have links to other relevant sites.  Manually searching a site for all other sites is tedious, time consuming, and prone to errors. The complete background and rationale remains the same.  Rather than repeating it here, please see the previous entry.

Note: The boundaries of a website are somewhat arbitrary.  A simple site may be on a single domain (www.simple.com).  A more complex site may be on several domains (www.complex.com, apps.complex.com, complex.differentdomain.com).  Often the website's components can be identified by the shared second- and top-level domains (complex.com). Sometimes a portion may be on an entirely different domain, and the only way to recognize that connect is to look at the that site. This tool identifies domains, which are specific.  The domains are grouped by the second- and top-level domains (rather than alphabetically), to help make obvious connections.

The Domain ID  tool automates the process in two steps.  First, it uses a program (Xenu's Link Sleuth) to download all the links on a site.  Second, it uses a second program (domainid2_add.php) to analyze those links, extract a list of all the domains in the list, and save the results to a database.  Finally, the list can be viewed using a third program (domainid2_new.php). The list can be reviewed relatively quickly, especially as it will be readily apparent that many sites are in or out of scope. 

Version 1 emphasized simplicity over efficiency.  It produced a report based on a single website.  Further, the entire list would  have to be reviewed each time it was run, on the same website or a different website.

Version 2 addresses those problems by storing the domains in a database.  Domains extracted from analyzing many websites are integrated into a single list.  The tool also tracks when the domain was added to the database; a simple query to the database can return only newly-discovered domains.  Version 2 is also designed with the same considerations to make it accessible to those with limited technical skills or resources.

Version 3 will provide a web interface to the database, allowing users to make notes about specific domains and to flag whether the domain is in or out of scope.  Version 3 should be considered vaporware with no promise that it will be available REAL SOON NOW.  It will require greater technical expertise to install the software, although it will continue to use tools available at no cost and that run on consumer grade equipment.  The tools will have a web interface and should be easier to use.  It will provide the ability to record whether a domain is in scope, make notes about the domains, and to group domains that share the same second-level domain.

Requisite Software and Configuration

1-4.  Version 2 requires the same software and configuration in Version 1: Xenu's LinkSleuth, and PHP.  The domainid.php tool is not used in Version 2, but it does not have to be deleted.  See the previous entry for installation instructions.  If you've not already installed Version 1, see those instructions and continue here.

5.  MySQL

MySQL is commercial-grade database management system.  Installing MySQL is a bit tricky, but more intimidating than difficult.  Although unlilkely, MySQL might cause an older system with limited resources to run slow. If things go awry, open the MySQL folder under Start | Programs and look for the MySQL Installer.  Although the name is counterintuitive, launch that program and select Uninstall to remove MySQL from the system.

Download the MySQL Installer 5.6 for Windows from http://dev.mysql.com/downloads/mysql/.  (Note that the MySQL Installer is 32-bit, but is used to install both the 32- and 64-bit versions of the program.) Also note that after clicking the link to the installer, you'll see options for two installers – one small, the other large.  The former downloads the program during installation; the latter downloads it all at once. 

The Installer offers to install all products.  The Server Only option is adequate for the Domain ID programs. You can use the Installer to add additional features at a later date, if you want to invest some time learning this very powerful tool so that you can build your own queries.  
-      Select Development Machine for the Server Configuration Type. 
-      When prompted for a root password for MySQL, be sure to note it as you will need it in subsequent steps.  If this is the first time you've set up MySQL, leave "Current Root Password" blank. 
-      Accept the defaults to start MySQL at Windows Startup and to run the Windows service as a standard system user.
-      Accept the defaults for Config Type (Development Machine), to Enable TCP/IP Networking, and no Advanced Configuration
-      Restart your computer

6.      Download the code from http://arstweb.clayton.edu/domainid/domainid2.zip and extract the files. In the instructions for version 1, I recommended that you create the directory c:\domainid for those files. You may continue to use that directory for these new files.  I also noted that more experienced users may choose to use another directory; in the instructions below they should replace c:\domainid\with the directory of their choice.   

Note:  Several of the programs below create, update, and query the database.  They include the userid 'archivist' and password 'Kaczmarek'.  (The latter in honor of my colleague who helped with the Arizona model years ago.)  In general, it's a bad idea to use userids and passwords you found on the Internet.  You would be well advised to take a few minutes to change the userid and password as save the files on your computer.  Be consistent, using the same userid and password throughout.

At the same time, using the defaults in a local installation on a workstation that's password protected poses relatively little risk.  As configured, the user can access the database only from the local system.  Anyone with that access could view the contents of the programs to discover the userid and password combination. 

If the programs are run from a server – especially if that server has other MySQL databases – you or someone you work likely can address these security issues.  Take care not to forget this step.

7.      Create the database by opening a window with the command prompt (Windows Key, the cmd).  The database structure is described in Appendix 1. Change to the directory you've created with the command
                   cd c:\domainid

then issue the command
                   mysql -u root -p < domainid2_createdb.sql

You will be prompted for the MySQL root password you created above.  Type it, and hit enter.

Note: Be aware that this command is intended to be used only for initial configuration of a blank database.  If you run it again after capturing websites, all data will be deleted.  You may want to do that to start from a clean slate after playing with the tools to see how they work.  If you have data that needs to be preserved, see the instructions for backup below.

Test the database by issuing the command
                   php domainid2_testdb.php

You should see the following results
                   C:\Users\rpm\Dropbox\DomainID\v2>php domainid2_testdb.php
          <h2>Found 1 domains</h2>This database currently has 0 rows


Instructions for Use

1.      Xenu's LinkSleuth

Follow the same instructions for step 1 in the instructions for Version 1 to run Xenu's LinkSleuth and save the file as tab separated values.

2.      domainid2_add.php

Open a window with the command prompt and change to c:\domain (or your preferred working directory) as above.

Run the command as follows, substituting the file name for the LinkSleuth report.  Note that in Version 2, you do not supply a name for a file to store the discovered domains.
    php domainid_add.php < LinkSleuthFile.txt  

If you did not add PHP to your path, use the following form:
    c:\php\php domainid2_add.php < LinkSleuthFile.txt

You'll see a list of newly discovered domains scroll up the screen as they are added to the database.  If you want to capture that list to a file, add > ExtractedDomains.txt to the end of the command.

3.    domainid2_new.php

To see a list of all domains in the database, issue the command
          php domainid2_new.php all

To see a list of all domains new to the database since a given date, issue the command
          php domainid2_new.php YYYY-MM-DD

To capture the report to a file, append > newdomains.txt to the end of either command, substituting something appropriate for the filename.

Note: the domains are sorted by the second and top level domains.  It looks a bit weird; en.wikipedia.org might sort right next to my.yahoo.com (W is just before Y), but after www.az.gov (wikipedia.org follows az.gov).  However, the sort groups all sites ending with the same two levels, so az.gov, procurement.az.gov, and revenue.az.gov are together.
         
4.      Repeat

Version 2 combines the domains found on multiple sites into a single database.  You can continue to grow the list of domains by running Xenu's LinkSleuth against other websites, then running domainid2_add.php to add the list or URLs on those websites to the database.

5.      Backup

If you find the tool useful and it grows over time, you'll want to back it up to avoid loss in case of a system failure.

From the command prompt, change to a directory where you want to initially store the backup file.  If you have a directory linked to a cloud storage service such as Dropbox, you may want to store the file there so it is automatically copied to another system that's physically remote.

Issue the command below, providing the root password when prompted
    mysqldump -u root -p --quick domains > domains_YYYY-MM-DD.sql

Copy the file domains_YYYY-MM-DD.sql elsewhere for safekeeping.

If you have to restore the database entirely, run the domainid2_createdb.sql command as above, then issue the command below, providing the root password when prompted
          mysqldump -u root -p domains < domains_YYYY-MM-DD.sql

6.      Optionally: Use Microsoft Access to work with the MySQL database

If you're familiar with Microsoft Access, you can link to the MySQL database and create queries and forms to work with the data.  You'll need to install the MySQL ODBC connector and create a Data Source Name (DSN). Instructions are beyond the scope of this post, but can be easily found on the web.  (A quick search revealed a likely starting point at http://dev.mysql.com/doc/connector-odbc/en/connector-odbc-examples-tools-with-access-linked-tables.html.  


Appendix 1.  Domains Database Structure

The domainid database includes the following fields

domainKey
int(11)
The unique identifier for each row in the database.
domain
varchar(250)
The domain of the website server
TLD
tinytext
The top-level domain (.gov, .org, .edu, .com, etc)
SLD
varchar(150)
The second-level domain below the TLD.  This field is useful for grouping websites that span several domains, such as www.clayton.edu, faculty.clayton.edu.
firstseen
varchar(25)
Date first added to the database.  Note: the site may have existed long before it was seen by the domainid tool.
inscope
Tinyint
A binary field to indicate if a field is in or out of scope.  Set to NULL by default. To be used in Version 3.
notes
Longtext
A field to make comments about the domain.  To be used in Version 3.


09 July 2014

A Tool to Discover Web Content For Archives and Libraries (Version 1.0)

Abstract

A rationale for identifying websites that libraries or archives may want to harvest for their collections.  Also includes Version 1 of code to automate the process.  The code can be run on consumer grade PCs using software that’s freely available on the web, and is relatively easy for those with limited technical knowledge to install and use.

§ § §

Updated 10 July 2014 to include explanatory comments in the code section.  The Zip file available for download included those comments.

§ § §

The web is a rich source of materials for libraries and archives.  One of the biggest challenges is finding content that falls within the scope of their collecting policies.  Using a search engine is ineffective and unsystematic.  Even a substantial investment of time looking at individuals documents may return only a small fraction of what’s available, the results are dependent on the vagaries of the search engines’ indexing algorithms, and the results often will return documents already considered.

In the early 2000s, the “Arizona Model for Preservation and Access of Web Documents” described an approach based on my experience at the Arizona State Library, Archives and Public Records.[1]  The Arizona Model is based on two assumptions. 

First, rather than using a bibliographic model that focused on individual documents, treat websites as archival collections and acquire the site as aggregates.  A smaller site might be acquired as a whole; time spent weeding out-of-scope materials is not justified by the miniscule gain in storage space.  A larger site might be acquired by analyzing the site’s structure as represented by the subdirectories in the URLs; as is typical in archival practice, appraisal would be done at the series (subdirectory) level. 

Before content could be acquired, a library or archives would have to identify relevant websites.[2]  The Arizona Model’s second assumption is simple: a website that is in scope will likely link to several other websites that are in scope. The site will also include links to sites that are out of scope, but the ratio of in-scope to out-of-scope sites will likely be high. 

A large website may have several thousand URLs.  However, the number of distinct websites referenced is likely much lower.  The trick is to extract a simple list of all websites referenced on a website.  Manually inspecting the pages would be impractical, not to mention tedious and error prone.  Fortunately, that work can be automated.  The process can be implemented and run on consumer-grade computers by individuals with minimal technical skills, using software that is available at no cost.  The code and instructions for use are presented here.

The approach, when first tested in Arizona in the mid 2000s, generated about 10,000 URLs from four websites, and impractical number to review.  However, that list was distilled to about 700 domains, a much more reasonable number.  A librarian or archivist familiar with the subject area could review that list and quickly determine whether many of the websites are in or out of scope.  For example, virtually all Arizona websites included a link to Adobe.com (clearly out of scope), pointing to the Acrobat Reader necessary to read the documents.  By contrast, the list contained many sites they immediately recognized as in scope (az.gov, azcommerce.gov) because they used those sites on a regular basis. Roughly half were immediately recognized as in or out of scope. That’s a reasonable number to manually check; five or ten minutes per site would take about forty hours.

While the second assumption is not logically valid (there’s no way to be sure any given site will be listed on at least one other site), it’s adequate. The Illinois State Library was confident that its list of state websites was complete, but a subsequent analysis based on this approach identified many more.[3]  At the same time, the tool is imperfect because it focuses on domains, and missed websites that were hosted with other content under a shared domain. (For example, a small agency using an ISP without registering its own domain. For example, www.comcast.net/~azbarbers returns www.comcast.net, which is out of scope.)

As part of the ECHO DEPository Project at the University of Illinois, Urbana-Champaign, the Arizona State Library worked with OCLC to build a suite of tools to help automate identifying domains, appraising subdirectories for acquisition, and describing the collections and series.[4]  Those tools are no longer supported or readily available.

One of the tools – used to discover websites – was particularly useful and immediately addresses libraries’ and archives’ need to identify relevant content.  Moreover, recreating it was relatively trivial, and the code and instructions for its use are presented here.  Version 1.0 does a single, simple task: it generates a list of websites referenced on a given website.  Although limited, it is easy for someone with limited technical skills to install and use.

Version 2.0, under development, will stored the discovered websites in a simple database.  Rather than reviewing individual reports, which necessarily means redundant review of sites already discovered, users will be able to see a report of all domains and a report of all domains discovered starting with a given date (typically, the last time the list was reviewed).  The installation of the database is relatively simple, and reports will be run from the command line.

Version 3.0, for future development, will create a website that will allow users to track whether sites have been reviewed and to record comments on the sites.  Version 3.0 will require webserver software, and the version of IIS that’s distributed as part of Windows will suffice for most users.  Note: even Version 3.0 should run on a mid-level, consumer-grade computer; it will not require a powerful server.

Note that this tool does nothing beyond identifying potentially relevant sites.  It does not harvest the content for preservation.  Services such as Archive-It and tools such as HTTrack or wget can be used to capture content, using the sites identified by the domain tool as seeds.

Once installed, using the tool should be straightforward, although some familiarity with the command line is useful.  Installing some the software requires that ability to create directories and extract the contents of a zip file.  For those who are not familiar with the inner workings of a computer, the optional step to modify the system path likely more intimidating than difficult.

Requisite Software and Configuration

The instructions that follow are simplistic, in hope that they will be easy for individuals with limited technical skills to follow.  Sophisticated users should recognize other options that will work as well or better for their particular needs.

Caveat emptor: I’ve tested this code on a number of machines with no problems.  No doubt, the first time someone tries it, they’ll find a bug.  Please contact me. 

Standard disclaimers apply:  I offer the code without any warranty.

1.  Xenu’s LinkSleuth was designed to help web masters check for broken links on their sites.  Fortunately for librarians and archivists, the URLs can be exported to a file. 
          Even though LinkSleuth is a bit dated, it works well for the purpose at hand.  The code run under Windows 7 and 8 (and possible earlier versions).  The code is available at no charge, although Tilman Hausherr, the author of the software, invites users to send a thank you letter, an XL T-shirt, or other inexpensive token as a gratuity.[5] 
          The program is available at http://home.snafu.de/tilman/xenulink.html, and is a simple install.
          If using another link checking program, the code to parse the exported list will likely need to be modified.  For example, LinkChecker files start with three lines to be skipped (as opposed to one), and the data elements are stored as comma separated values (rather than tabs).[6]  Such changes may be relatively trivial for someone with a basic understanding of PHP string functions.

2.      PHP is commonly used to develop webpages, but – as in this case – can also be run from the command line.[7]  Windows versions of PHP are available from http://windows.php.net/.  The VC11 Non Thread Safe version is experimental, but worked with Windows 7 and 8.  Using the VC9 Non Threaded Safe version has fewer risks.  Be sure to select the appropriate binary for your system (x86 for 32 bit  systems, or x64 for 64 bit systems).
          PHP is a bit trickier to install.  First, use Windows Explorer to locate the zip file you downloaded.  Right click and select Extract ... .  When prompted for a location to extract the files, navigate to Computer, click to highlight “Local Disk C:”, and then click “Make a New Folder.” Enter php and click “Extract.”
          Second, and optionally, you can modify the system path variable so that Windows can easily run PHP from the command line.  Instructions for Windows 7 can be found on Renso Hollumer’s blog; the process is fundamentally the same with Windows 8.[8]
          If you chose not to modify the path, you’ll need to fully specify the directory where PHP is housed when you run the command.  For example, the only difference it the need to include the underlined portion of the second example.
          modifying the path
                  > php domainid.php < XenuFile.txt > ExtractURLs.txt

          unmodified path
                  > c:\php\domainid.php < XenuFile.txt > ExtractURLs.txt
         
3.  Optionally, create a directory to hold the domain identification tool, the file containing raw data from LinkSleuth, and the files containing the extracted domains.  These instructions will assume that you create such a directory, called c:\domainid.  If you create a directory in another location, replace this example with the drive and path you’re using instead.

4.  The Domain Identification Tool is a simple PHP script.  Use a plain text editor (WordPad, Notepad, Textpad, not Word) to copy and paste the code in Appendix I into a blank documents, and save it in a file called domainid.php in the directory you’ve created for this project.   You can also download and save this file from http://arstweb.clayton.edu/domainid/domainid.zip.  You’ll need to extract the file to the working directory.

Instructions for Use

1.      Xenu’s LinkSleuth

LinkSleuth is sophisticated software that is easy to use at a rudimentary level.  However, it can be configured to address problematic websites.  As you get more experience with the process and the diverse configurations of websites, take time to read the documentation and learn about how to use the tool effectively and efficiently.

Launch the program from the menu with File | Open URL ... and enter the URL for the home page of the website you want to crawl.  Very shortly, you’ll begin to see a list of links, which will change color as the program verifies their status. 

Monitor the program’s progress in the lower right corner of the status bar.  If the program runs for more than five minutes – especially the first few times you harvest URLs from a site – use the pause or emergency stop buttons to halt the process, and check the report for possible problems. 

If LinkSleuth finishes quickly with only one or a few URLs, the website may be rejecting requests from unknown spiders with a robots.txt file.[9]   If you run into this scenario, you’ll likely need to find another website that you can use as your starting point. 

The most common problem is a “spider trap,” a set of pages that generate an endless number of links for the spider to check.  The classic example is a database-driven calendar page with a link to “next month.”  Following the link to the next month, the spider finds a link to the subsequent month; the spider could follow these links until the end of time. 

After running the report the first few times, browse the log for URLs that suggest a spider trap or other content you know you don’t want or that won’t produce useful results.  For example, a website may have thousands of scanned images.  If you spot problems, you can configure LinkSleuth to skip those URLs. As you become more proficient with LinkSleuth, you’ll know when the program is functioning properly and can run for longer periods of time.

When the report is finished (or if you’ve stopped it manually), you’ll be prompted for a report.  Viewing this report is optional.  After closing that report, from the menu click File | Export to tab separated file . . .  Save the file in an appropriate location (c:\domainid).   Using an appropriate name will make it easier to keep track of your work.  For example, include a reference to the site and date; a report for arstweb.clayton.edu might be arstweb_20140708.txt.  For an organization with several websites, you might use the domain itself support.microsoft.com_20140708.txt or www.microsoft.com_20140708.txt.

2.      domainid.php

Open a window with the command line by clicking the start button, command, and selecting Command Prompt (command.exe). 

Change to the working directory that contains the domainid.php file and the tab separated file from LinkSleuth
    cd c:\domainid

Run the command as follows, substituting the file names for the LinkSleuth report and a name for the new file containing the extracted domains.  Note: If an existing file has the same name that you use for ExtractedDomains.txt, it will be overwritten and lost.
    php domainid.php < LinkSleuthFile.txt > ExtractedDomains.txt
or   c:\php\php domainid.php < LinkSleuthFile.txt > ExtractedDomains.txt

A bit of explanation for those unfamiliar with the command line.  The first element calls the php interpreter, and the second part calls the particular program to be run. 
          The third element (< LinkSleuthFile.txt) tells the program the file in which the raw data is stored.  The final element (> ExtractedDomains.txt) tells the program where to store the out.  For more information about this syntax, search for stdin and stdout, standard abbreviations for standard input and standard output.
          If the input or output files are other directories, include a fully qualified path.  For example, if the input file were on the H: drive in a folder called Raw and the output file were one the E: drive in a folder called Parsed, the command would look like this:
          php domainid.php < H:\Raw\LinkSleuthFile.txt > E:\Parsed\ExtractedDomains.txt
If the file name or path includes spaces or special characters, you have may to include them in quotes; the easier solution is to avoid those characters, using alphanumerics, dash, underscore, and dot.

[This is my first effort to post code. Yep, I'm a little nervous.  Comments on how to make this more accessible for people with limited skills very welcome, either in comments below or by other channels.]

Appendix 1.  DomainID PHP Code

Cut and paste the text below in Courier into a plain text editor, and save the file as domainid.php.  Or, download the code from http://arstweb.clayton.edu/domainid/domainid.zip

<?php


/****************************************************************
getdomain.php
Richard Pearce-Moses
pearcemoses@gmail.com

Creates a list of domains referenced on a website, allowing identification of other domains that may be of interest.  Intended for use by archives that harvest web content.  See Richard Pearce-Moses and Joanne Kaczmarek, “An Arizona Model for Preservation and Access of Web Documents,” DttP: Documents to the People 33:1 (Spring 2005) at http://home.comcast.net/~pearcemoses/papers/AzModel.pdf.  And,  Jackson, Zhang, and Wu, "Hyperlink Extraction Improves State of Illinois Website Identification," Proceedings of the American Society for Information Science and Technology 43:1 (2006) at http://onlinelibrary.wiley.com/doi/10.1002/meet.14504301218/abstract

1. Create a list of URLs from a single site using Xenu's LinkSleuth, available from http://home.snafu.de/tilman/xenulink.html. Note: Versions available through other sources, such as cnet.com, may embed adware in your browser. The software is distributed at no charge. Tilman Hausherr,the author of the software, invites users to send a thank you letter, an XL T-shirt, or other inexpensive token.

Using Xenu is easy and does a very good job.  It may get caught in a "spider trap" -- commonly a series of URLs that generate an endless list.  For example, links on a calendar page that point to "next month" can be followed for years (as it were).  Xenu can exclude such problem links.  If a crawl is taking exceptionally long, abort it, inspect the results, and make changes as appropriate.

2. From Xenu, export the results as a tab separated file.

3. Open a command line prompt.  For simplicity's sake, change to the directory where you saved the file.

4.  Run the php script, redirecting the file to the program using stdin and exporting the results using stdout. Your command should look like something like what follows, although the prompt will be different
     c:\dir\to\file>  php getdomain < XenuOutput.txt > WebsiteList.txt

If you forget the < before the input file, the program will appear to hang -- it's waiting for input.  Hit CTRL-C to abort the program.

5.  The results of the first file can be used as seeds to discover still more websites of potential interest.

This work is licensed under a Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/


Please report bug and improvements to pearcemoses@gmail.com

*****************************************************************/

// Initialize file handle
$rawURLS = fopen('php://stdin', 'r' );

// Skip the first line (header info)
$URL = fgets( $rawURLS ) ;

while( $URL = fgets( $rawURLS ) ) 
{ // strip everything after the first tab
  $tabstop = stripos($URL, "\t");
  $URL = substr($URL, 0, $tabstop) ;
// strip protocol from beginning of URL
$start_position = stripos($URL, '://') + 3 ;
$URL = substr($URL, $start_position) ;

// strip directory and file information following website domain
$end_position = stripos($URL, '/') ;
$URL = substr($URL, 0, $end_position) ;

// Assign domain to an array, skipping any the duplicate immediate predecessor
$prevURL = "" ;

if ($prevURL != $URL)
{ $URLlist[] = $URL ; 
$prevURL = $URL ;
}
}

// Sort the array
sort($URLlist) ;
 // Deduplicate the array 
 $prevURL = "" ;
$i = 0 ;
$j = count($URLlist) ;
while( $i < $j ) 
{ if ($prevURL != $URLlist[$i])
{ echo $URLlist[$i] . "\n" ;
$prevURL = $URLlist[$i] ;
$i = $i +  1;
}
else 
{ ++$i; }
}
echo "Number of domains referenced on site: " . $j . "\n" ;
//Note: Output to a file using stdout

fclose( $rawURLS );

?>




NOTES: (Not to be included in the php file above.  Apologies to readers: Bloggers makes it look like the notes are linked in the text, but -- they're not.

[1] See Richard Pearce-Moses and Joanne Kaczmarek, “An Arizona Model for Preservation and Access of Web Documents,” DttP: Documents to the People 33:1 (Spring 2005), p. 17–24.  Preprint available at http://home.comcast.net/~pearcemoses/papers/AzModel.pdf.
[2] In this paper, “website” refers to the whole of an organization’s web presence.  Often that presence is distributed across several webservers, each identified by its own domain, the first part of the URL.  For example, Microsoft’s website would include www.microsoft.com, support.microsoft.com.  The domain discovery tool identifies specific domains.  In some instances, as above, domains that are clearly part of the website because the last two elements are the same.  In other instances, azcleanair.com and azdeq.gov could both be considered part of the Arizona Department of Environmental Quality’s web, but that is not readily apparent from the domain.
[3] Jackson, Zhang, and Wu, "Hyperlink Extraction Improves State of Illinois Website Identification,"
Proceedings of the American Society for Information Science and Technology 43:1 (2006) at http://onlinelibrary.wiley.com/doi/10.1002/meet.14504301218/abstract
[4] “Tools Development,” http://www.ndiipp.illinois.edu/index.php?Phase_I_%282004-2007%29:Tools_Development, checked 8 July 2014.  The project was funded by the National Digital Information Infrastructure and Preservation Program of the Library of Congress.
[5] Be aware that versions available through other sources, such as cnet.com, may embed adware in your browser.
[7] http://php.net/.  Windows binaries are available at http://windows.php.net/.
[9] See “The robots.txt file” in the HTML 4.01 Specification (W3C, 1999), http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1

18 June 2014

Janus as the patron of archives

Melissa Gonzales just tweeted, “I prefer @pearcemoses idea of Janus as patron god of archivists v. Saint Catherine as a patron saint. Let’s kick this pagan.” I’m not sure what’s going on at the Archives Leadership Institute (#ALI14) that triggered this, but I’m seeing lots of great tweets from participants.

I do like the idea of Janus as a patron, especially for digital archivists.  Many prospective students tell me they want to study archives because they love history.  Amen for that; history is good.

However, I make sure they know that archivists aren’t historians.  More important, I stress to them that Clayton State’s program emphasizes digital archives.  It doesn’t take much for them to understand the ephemeral nature of digital records and the potential that – if archivists don’t act to capture those records – they will be lost.  Archivists work with history, but I reframe that idea.  Digital archivists are focused not on their own past, but the future’s past.

Janus may be an ideal patron because he is on the cusp of the moment.  Unlike Mnemosyne, the personification of memory who looks only backwards, Janus is always in the present, looking simultaneously to the past and the future. 



08 April 2014

Radcliffe Workshop on Technology & Archival Processing

Recently I had the privilege of participating in a workshop on technology and archival processing sponsored by the Radcliffe Institute for Advanced Study and the Association for Research Libraries.[1] The workshop sought to explore ways to apply technology to tangible collections, with only secondary consideration of born-digital materials.  In particular, how can technology facilitate arranging and describing archival collections?   A second, inherent question focused on how finding aids might change or be improved through technology.

Aaron Trehub of Auburn and I were asked to offer closing comments.  I offer my observations here, after taking time to reflect on the many excellent insights and ideas.

§ § §

Throughout the workshop, I wondered (as I often do) if archivists were confronting evolution or revolution.  Are we seeing the transformation of the profession?  Or, have things changed so much that we’re really witnessing the demise of archives (and archivists) as we know them?

I believe that  archivists must persevere in their noble profession because they serve a distinct role in society.  They are focused on the long-tail value of records, the usefulness of records long after they’ve been created.  William Maher observed that archivists “must stand fast and hold true to [their] role as custodians and guardians of the authentic record of the past,” “to provide  an authentic, comprehensive record that ensures accountability for our institutions and preservation of cultural heritage for our publics.”[2]

For years, I've said that what archivists do (at the abstract level) remains the same, but how we do must change.[3], [4]  Among other things, archivists

-      Select and acquire records that capture a complete (representative, if not exhaustive), accurate, and authentic story of the past.  If not, cultural memory will be lost, the future will not have the records its needs to understand its past, and individuals and organizations will not have the evidence necessary to protect their rights and interests.

-      Organize and describe those records to provide physical and intellectual control.  Archivists must help people find their way in what is, for many, a strange land of primary sources, where meaning often lies in the contextual relationship between records, relationships that reflect their provenance and original order – rather than in the document itself.  Archives aren’t your grandfather’s Dewey Decimal library, and can be alien and confusing for many.

-      Provide access and reference services.  Where some see archivists as gatekeepers and barriers to the records, the reality is that the archivists are advocates for researchers.  Not only do archivists help researchers find relevant records, they often helps researchers hone their questions.

Getting into the weeds a bit, how we do it must change.   In the past, when we transferred records from the file cabinets where they were stored during use, across the archival threshold, and into our custody, we put carefully placed them in boxes to preserve their original order.  That doesn’t work for electronic records.  The records are not on paper, but in databases, and may need to be extracted from fielded data and templates to a document-like report.   Not to mention, placing electronic records in box doesn’t make sense.  But that change is trivial, as we have a number of readily available possibilities.  Files can be placed in zip or tar files, then transferred via a network connection, thumb drive, tape, or disk.  The workshop suggested many more interesting possibilities, changes in how we do our jobs that re-envision new, more effective ways to work. 

Moving finding aids from typewritten paper to DACS/EAD files on the web was just a start.  To a large extent, digital finding aids are protodigital forms, a replication of the existing structure and functionality without taking advantage of the virtual medium.[5]  Not that I’m discounting DACS or EAD.  We must continue to describe our collections, but technology offers us much more than markup.  We need to take advantage of technology to go well beyond the protodigital and find new ways to connect researchers with relevant records they might formerly have overlooked.

Many would immediately think of the scale of information as the most significant change facing archivists.  While the size of backlogs and digital information is a problem, it’s hardly new.  Archivists have struggled with information explosions for years.  After World War I, Jenkinson specifically addressed the issue in his Manual of Archive Administration: Including the Problem of War Archives and Archive Making.[6]  The volume of records that resulted from the growth of the federal government during the Depression and following World War II drove Schellenberg and others at the National Archives to come up with new ways to manage both active records and archives.  And the phrase “information explosion” takes off in the 1960s, and is largely replaced in the 1980s by “paperless office.”[7]

At the workshop, I heard three themes of how technology can change how we do our job.  (Other themes were mentioned, of course.  And, there are other areas of the archival enterprise where technology will have impact, but the workshop focused on processing and providing access to collections.) 

First, researchers asserted that finding aids remain valuable.[8]  Hierarchical description based on provenance and original order is largely derived from European tradition.  In many ways, the model is as much pragmatic as theoretical. Archives have never had the resources for item-level description.  (In the early 20th century, the Library of Congress’ manuscripts processing manual bemoaned backlogs, even as it prescribed item-level calendaring.[9])  The structure remains useful as a framework.  The finding aid is an important means to document the original order of the collection, to preserve the contextual relationship between records.  New tools that can search repositories and assemble collections based on geotagging, name extraction, and more, described by Dan Cohen of the Digital Public Library of America, are invaluable tools.[10]  But those assemblages are artificial and do not have the authority of the order established by the creators, an order that reflects the primary value of the records.

Bill Landis observed that recent archival practice has trended away from item-level description, to higher and higher levels of abstraction.  I’ll argue that technology allows us to reverse that trend.  It gives us the tools to provide much more detailed access.  In the past, we didn’t have the staff or time to provide item-level access.  Now, we have access to computing power that can provide that access at an even more sophisticated level that goes beyond item-level access to data mining.  Many researchers don’t have ready access to the software or know how to use those tools.  That’s a service archivists can – and I think should – provide.   Trevor Owen noted that the fourchan records were put online as a zip file with a collection level description.  But why not pipe the collection through a full-text indexing tool and let people have at it.  People may find what they’re looking for in the text, but not in the collection level description.[11]

Second, archivists need to be better at what they do.  Which raises the question, what is better?  Ironically, better may be sloppier.  Lambert Schomaker, who presented on automated recognition of handwriting, noted that Google provides reasonable results.  At one point, he observed that archivists sought perfect results, an exact hit.  In archivists’ defense, I think there’s a profound difference between searching the web and searching records.  More often than not, the web has a range of documents that contain overlapping information, where archives hold unique documents that may be the only authoritative, authentic source of a very specific piece of data.  You might find someone’s birthday scattered across the web, but their birth certificate is likely in one place.  Even so, Schomaker’s point is well-taken.  It’s better to have a mess of reasonably relevant documents than nothing at all.  Google can get you in the neighborhood and give you clues where to look.

Luis Francisco‐Revilla noted that there was no consistency in how a group of archivists – working separately – arranged a small collection of personal papers.  In response, one participant[12] expressed her concern that there were no normative practices for arrangement and much of archival practice.  (I expressed some skepticism about the test.  Original order is a normative principle, but personal collections are notorious for being chaotic with no meaningful order to preserve. Moreover, I argued – to tweets in agreement – that such a small collection didn’t merit any arrangement; to the extent arrangement facilitates rapid access, it would take very little time for a researcher to peruse such few records.  Again, providing access without arrangement may be an example where sloppy may be better.)

Better also means that we need to think about what the finding aids say about the collections.  Do they answer users’ questions, help them finding relevant collections and records?  One researcher wanted more back story on how the collections were acquired, something usually missing from finding aids.  One researcher’s comment that scope notes were of little value might have pained the archivists in attendance (it broke my heart), but I don’t find the observation surprising.  Recently, I asked my students to do a survey of mission statements and collecting policies on university archives’ websites.  What they found were often little more than a few bullet points of questionable value because they had little substance that would help users (or archivists) know what was in or out of scope.  A recurring theme at the workshop was that finding aids needed to do more than report the structure of the collection.  I’ve always admired Cutter’s Rules, although more than a hundred years old, because he begins with a strategy that focuses on the user.  His last object for the catalog is “to assist in the choice of a book as to its edition [and] as to its character.”[13]  I believe that spirit needs to be at the heart of finding aids, to be way-finders, to help researchers make sense of the collection.  The quality of description must be measure by the degree to which they communicate the information researchers need, not the degree to which they comply with formal rules.

Finally, and possibly most important, are archivists so wed to the tradition of how we do things that we can’t (or won’t) innovate?  When working on a project to explore automated workflows to process digital collections, a participant whose job was processing collections and proud of her craft fumed at her supervisor, “You can’t automate what I do!”  He responded, “You’re exactly right!  We don’t want to automate what you do.  We need to do something different.” 

That is a revolutionary statement that could portend the demise of archivists.  I am concerned that if archivists don’t step up to the plate, if they don’t adapt and take advantage of technology, they may become extinct and others may take our place.  I’ve already seen examples of this.  When heads of companies and government agencies get questions about email, they call the head of IT, not the records manager or archivist.  I suspect most archives are struggling with limited resources to managed an overwhelming number of tangible records.  But to ignore these tools, to be tied to historical approaches can paint records managers and archivists into a corner.  Investing at least some time experimenting with and touting innovative uses of technology may be an essential part of outreach that demonstrates we remain relevant and current.

At the closing reception, a participant questioned my observation, asking if the archival function would persist, even if others took our place.  I don’t know that the fundamental value of archives – the function of cultural memory that sees the long-tail value of some records – will persist.  Technologists, like the record creators, are appropriately focused on the job at hand, the here and now.  They aren’t focused on “paperwork” or how the records that result for the work might be needed in ten, fifty, or a hundred years. 

Archivists, I believe, should view the present from a future perspective.  What will the future need to remember about its past (our present)?  We need to be creative, and we need to put aside practical worries long enough to think big, think outside the proverbial box (records center or virtual).  We can’t let the desire for the perfect finding aid be the enemy of the possible.  After all, our patrons are accustomed to Google search results.
 



[1] See Corydon Ireland, “Books meet Bytes,” Harvard Gazette (4 April 2014) for a description of the first day of the conference.  http://news.harvard.edu/gazette/story/2014/04/books-meet-bytes/.  See also the Twitter feed by searching #radtech14.  Shane Landrum was actively tweeting and captured a summary at https://github.com/cliotropic/radtech14.
[2] “Lost in a Disneyfied World: Archivists and Society in Late-Twentieth-Century America,” American Archivist 61 (Fall 1998), p. 261, 263.
[3] “Janus in Cyberspace: Archives on the Threshold of the Digital Era,” American Archivist 70 (Summer/Spring 2007), p. 13-22.  Available online at http://archivists.metapress.com/content/n7121165223j6t83/fulltext.pdf.
[4] I would like to acknowledge that Catherine Stollar and Thomas Kiehne challenged my formulation, proposing instead “What we do as archivists will change (practice), but why we do it will not (theory).”  See Richard Pearce-Moses and Susan E. David, New Skills for a Digital Era (Society of American Archivists, 2008), p. 64.  Available online at http://www.archivists.org/publications/proceedings/NewSkillsForADigitalEra.pdf.
[5] Kudos to Ken Withers of the Sedona Conference for coining the term ‘protodigital.’
[6] (Clarendon Press, 1922).  Available through Google Books.
[7] Dates based on Google ngram analysis.
[8] Suzanne Kahn and Rhae Lynn Barnes, two historians actively involved in research, discussed their perspectives on finding aids as part of the program.  Both noted that finding aids, even if imperfect, were valuable for a variety of reasons.  Other speakers on the panel, moderated by Ellen Shea, included Trevor Owen and Maureen Callahan.  Callahan’s presentation is on her blog at http://icantiemyownshoes.wordpress.com/2014/04/04/the-value-of-archival-description-considered/
[9] J. C. Fitzpatrick. Notes on the Care ,Cataloguing, Calendaring and Arranging of Manuscripts (Library of Congress, 1913). Available from the Hathi Trust at http://hdl.handle.net/2027/uc2.ark:/13960/t7br8zr3b.
[10] Cohen gave a brilliant opening plenary that did a great job setting the stage for the discussion. 
[11] In defense, Rome was not built in a day, and the archives deserves credit for what it did, not criticism for not doing even more.  I ask the question to illustrate how these approaches must become so commonplace that they’re routine.
[12] In the spirit of the Chatham House Rules, I omitted names of people making comments unless they were part of the published program or unless they tweeted their comments publicly.  Anyone who wishes to be acknowledged may contact me to have this piece edited, or they may identify themselves in the comments.
[13] Charles A. Cutter, Rules for a Printed Dictionary Catalogue (Department of the Interior, Bureau of Education, 1876).  Accessible through Google Books.

8 April 2014 : 1:48 p.m. EDT.  Corrected Dan Cohen's name.  I have no idea who Fred Cohen is a participant in the InterPARES Trust project.  Apologies! <g>

29 October 2015 : 12:15 p.m. EDT. Grammatical edit, and I remembered who Fred Cohen is.