17 July 2014

A Tool to Discover Web Content For Archives and Libraries (New, Improved Version 2.0)

Abstract

Version 2 of the Domain ID tool improves the process of identifying websites by adding a database to combine the domains identified from several seed sites, and to avoid reviewing sites already reviewed.  Installing and using the software is only slightly more complicated than the previous version, and will still run on consumer-grade PCs using software available at no charge.

§ § §

Version 1 of the Domain ID tool described an approach for archives and libraries to identify websites that might be candidates for acquisition.  Finding potentially relevant websites among the thousands on the web is like searching for the proverbial needle in a haystack.  To increase the odds, the approach behind the Domain ID tool assumes that a relevant website will likely have links to other relevant sites.  Manually searching a site for all other sites is tedious, time consuming, and prone to errors. The complete background and rationale remains the same.  Rather than repeating it here, please see the previous entry.

Note: The boundaries of a website are somewhat arbitrary.  A simple site may be on a single domain (www.simple.com).  A more complex site may be on several domains (www.complex.com, apps.complex.com, complex.differentdomain.com).  Often the website's components can be identified by the shared second- and top-level domains (complex.com). Sometimes a portion may be on an entirely different domain, and the only way to recognize that connect is to look at the that site. This tool identifies domains, which are specific.  The domains are grouped by the second- and top-level domains (rather than alphabetically), to help make obvious connections.

The Domain ID  tool automates the process in two steps.  First, it uses a program (Xenu's Link Sleuth) to download all the links on a site.  Second, it uses a second program (domainid2_add.php) to analyze those links, extract a list of all the domains in the list, and save the results to a database.  Finally, the list can be viewed using a third program (domainid2_new.php). The list can be reviewed relatively quickly, especially as it will be readily apparent that many sites are in or out of scope. 

Version 1 emphasized simplicity over efficiency.  It produced a report based on a single website.  Further, the entire list would  have to be reviewed each time it was run, on the same website or a different website.

Version 2 addresses those problems by storing the domains in a database.  Domains extracted from analyzing many websites are integrated into a single list.  The tool also tracks when the domain was added to the database; a simple query to the database can return only newly-discovered domains.  Version 2 is also designed with the same considerations to make it accessible to those with limited technical skills or resources.

Version 3 will provide a web interface to the database, allowing users to make notes about specific domains and to flag whether the domain is in or out of scope.  Version 3 should be considered vaporware with no promise that it will be available REAL SOON NOW.  It will require greater technical expertise to install the software, although it will continue to use tools available at no cost and that run on consumer grade equipment.  The tools will have a web interface and should be easier to use.  It will provide the ability to record whether a domain is in scope, make notes about the domains, and to group domains that share the same second-level domain.

Requisite Software and Configuration

1-4.  Version 2 requires the same software and configuration in Version 1: Xenu's LinkSleuth, and PHP.  The domainid.php tool is not used in Version 2, but it does not have to be deleted.  See the previous entry for installation instructions.  If you've not already installed Version 1, see those instructions and continue here.

5.  MySQL

MySQL is commercial-grade database management system.  Installing MySQL is a bit tricky, but more intimidating than difficult.  Although unlilkely, MySQL might cause an older system with limited resources to run slow. If things go awry, open the MySQL folder under Start | Programs and look for the MySQL Installer.  Although the name is counterintuitive, launch that program and select Uninstall to remove MySQL from the system.

Download the MySQL Installer 5.6 for Windows from http://dev.mysql.com/downloads/mysql/.  (Note that the MySQL Installer is 32-bit, but is used to install both the 32- and 64-bit versions of the program.) Also note that after clicking the link to the installer, you'll see options for two installers – one small, the other large.  The former downloads the program during installation; the latter downloads it all at once. 

The Installer offers to install all products.  The Server Only option is adequate for the Domain ID programs. You can use the Installer to add additional features at a later date, if you want to invest some time learning this very powerful tool so that you can build your own queries.  
-      Select Development Machine for the Server Configuration Type. 
-      When prompted for a root password for MySQL, be sure to note it as you will need it in subsequent steps.  If this is the first time you've set up MySQL, leave "Current Root Password" blank. 
-      Accept the defaults to start MySQL at Windows Startup and to run the Windows service as a standard system user.
-      Accept the defaults for Config Type (Development Machine), to Enable TCP/IP Networking, and no Advanced Configuration
-      Restart your computer

6.      Download the code from http://arstweb.clayton.edu/domainid/domainid2.zip and extract the files. In the instructions for version 1, I recommended that you create the directory c:\domainid for those files. You may continue to use that directory for these new files.  I also noted that more experienced users may choose to use another directory; in the instructions below they should replace c:\domainid\with the directory of their choice.   

Note:  Several of the programs below create, update, and query the database.  They include the userid 'archivist' and password 'Kaczmarek'.  (The latter in honor of my colleague who helped with the Arizona model years ago.)  In general, it's a bad idea to use userids and passwords you found on the Internet.  You would be well advised to take a few minutes to change the userid and password as save the files on your computer.  Be consistent, using the same userid and password throughout.

At the same time, using the defaults in a local installation on a workstation that's password protected poses relatively little risk.  As configured, the user can access the database only from the local system.  Anyone with that access could view the contents of the programs to discover the userid and password combination. 

If the programs are run from a server – especially if that server has other MySQL databases – you or someone you work likely can address these security issues.  Take care not to forget this step.

7.      Create the database by opening a window with the command prompt (Windows Key, the cmd).  The database structure is described in Appendix 1. Change to the directory you've created with the command
                   cd c:\domainid

then issue the command
                   mysql -u root -p < domainid2_createdb.sql

You will be prompted for the MySQL root password you created above.  Type it, and hit enter.

Note: Be aware that this command is intended to be used only for initial configuration of a blank database.  If you run it again after capturing websites, all data will be deleted.  You may want to do that to start from a clean slate after playing with the tools to see how they work.  If you have data that needs to be preserved, see the instructions for backup below.

Test the database by issuing the command
                   php domainid2_testdb.php

You should see the following results
                   C:\Users\rpm\Dropbox\DomainID\v2>php domainid2_testdb.php
          <h2>Found 1 domains</h2>This database currently has 0 rows


Instructions for Use

1.      Xenu's LinkSleuth

Follow the same instructions for step 1 in the instructions for Version 1 to run Xenu's LinkSleuth and save the file as tab separated values.

2.      domainid2_add.php

Open a window with the command prompt and change to c:\domain (or your preferred working directory) as above.

Run the command as follows, substituting the file name for the LinkSleuth report.  Note that in Version 2, you do not supply a name for a file to store the discovered domains.
    php domainid_add.php < LinkSleuthFile.txt  

If you did not add PHP to your path, use the following form:
    c:\php\php domainid2_add.php < LinkSleuthFile.txt

You'll see a list of newly discovered domains scroll up the screen as they are added to the database.  If you want to capture that list to a file, add > ExtractedDomains.txt to the end of the command.

3.    domainid2_new.php

To see a list of all domains in the database, issue the command
          php domainid2_new.php all

To see a list of all domains new to the database since a given date, issue the command
          php domainid2_new.php YYYY-MM-DD

To capture the report to a file, append > newdomains.txt to the end of either command, substituting something appropriate for the filename.

Note: the domains are sorted by the second and top level domains.  It looks a bit weird; en.wikipedia.org might sort right next to my.yahoo.com (W is just before Y), but after www.az.gov (wikipedia.org follows az.gov).  However, the sort groups all sites ending with the same two levels, so az.gov, procurement.az.gov, and revenue.az.gov are together.
         
4.      Repeat

Version 2 combines the domains found on multiple sites into a single database.  You can continue to grow the list of domains by running Xenu's LinkSleuth against other websites, then running domainid2_add.php to add the list or URLs on those websites to the database.

5.      Backup

If you find the tool useful and it grows over time, you'll want to back it up to avoid loss in case of a system failure.

From the command prompt, change to a directory where you want to initially store the backup file.  If you have a directory linked to a cloud storage service such as Dropbox, you may want to store the file there so it is automatically copied to another system that's physically remote.

Issue the command below, providing the root password when prompted
    mysqldump -u root -p --quick domains > domains_YYYY-MM-DD.sql

Copy the file domains_YYYY-MM-DD.sql elsewhere for safekeeping.

If you have to restore the database entirely, run the domainid2_createdb.sql command as above, then issue the command below, providing the root password when prompted
          mysqldump -u root -p domains < domains_YYYY-MM-DD.sql

6.      Optionally: Use Microsoft Access to work with the MySQL database

If you're familiar with Microsoft Access, you can link to the MySQL database and create queries and forms to work with the data.  You'll need to install the MySQL ODBC connector and create a Data Source Name (DSN). Instructions are beyond the scope of this post, but can be easily found on the web.  (A quick search revealed a likely starting point at http://dev.mysql.com/doc/connector-odbc/en/connector-odbc-examples-tools-with-access-linked-tables.html.  


Appendix 1.  Domains Database Structure

The domainid database includes the following fields

domainKey
int(11)
The unique identifier for each row in the database.
domain
varchar(250)
The domain of the website server
TLD
tinytext
The top-level domain (.gov, .org, .edu, .com, etc)
SLD
varchar(150)
The second-level domain below the TLD.  This field is useful for grouping websites that span several domains, such as www.clayton.edu, faculty.clayton.edu.
firstseen
varchar(25)
Date first added to the database.  Note: the site may have existed long before it was seen by the domainid tool.
inscope
Tinyint
A binary field to indicate if a field is in or out of scope.  Set to NULL by default. To be used in Version 3.
notes
Longtext
A field to make comments about the domain.  To be used in Version 3.


2 comments:

  1. website creation is a part of online service and development. For better site content and students thesis writing work please choose the best thesis writing service provider for a successful completion.

    ReplyDelete