Page 1 of 1

Website inventory with CSE?

PostPosted: Tue Oct 09, 2012 3:53 pm
by mdevette
Can we perform a full website content inventory with CSE HTML Validator (latest version)? If so, how?
Its to run on a live public website.

Hi was hoping that CSE HTML Validator would also do something link this online tool does: http://www.pagetrawler.com.

Thank you.

Re: Website inventory with CSE?

PostPosted: Wed Oct 10, 2012 9:14 am
by Albert Wiersch
Hello,

Thank you. I've tried that site and looked at the CSV file. It looks like it is just a summary of page titles and headers on the pages, with some additional information.

It may be possible to do something similar in CSE HTML Validator v12 using the customization ability and the Batch Wizard to crawl a site.

If you'd like me to investigate this, and are willing to work with me and try the v12 BETA as this is developed and tested, then please let me know.

Re: Website inventory with CSE?

PostPosted: Wed Oct 10, 2012 2:02 pm
by mdevette
Yes sure, I'm willing too help on this, just let me know.

Marco

Re: Website inventory with CSE?

PostPosted: Fri Oct 12, 2012 6:09 am
by Albert Wiersch
mdevette wrote:Yes sure, I'm willing too help on this, just let me know.

Marco


Great, thanks. I'll get back to you. I have a few things I need to complete first.

Re: Website inventory with CSE?

PostPosted: Fri Oct 12, 2012 6:16 pm
by Albert Wiersch
Hi Marco,

I hope to be able to look into this next week. Can you provide specific details on what you'd like the website inventory to include. I assume you want a CSV (comma separated value) file. Exactly what fields and what information did you want in there? Please be as specific as possible.

Thanks.

Re: Website inventory with CSE?

PostPosted: Fri Oct 12, 2012 8:55 pm
by Lou
Albert Wiersch wrote:Exactly what fields and what information did you want in there? Please be as specific as possible.


Albert a quick run of pageTrawler and I see two things:
  • It would be nice to be able to exclude directories.
    When I ran it, it wasted time running through the phpBB subdirectory, which gives me no information.
  • It would be nice to know where/how a file was found.
    pageTrawler found two 404 references. There is no clue where these links are.
    I have not seen any indication of a 404 before. Of course just because I don't find them doesn't mean they are not there.

Re: Website inventory with CSE?

PostPosted: Tue Oct 16, 2012 12:45 pm
by Albert Wiersch
Hi Lou and all,

If you'd like to try making a "website inventory/content map" using CSE HTML Validator, then please download and install the latest v12 PUBLIC BETA (BETA 5), which I just released here:
http://www.htmlvalidator.com/freebeta/

You'll also need this 'user functions' file:
http://www.htmlvalidator.com/user-funct ... entmap.cfg

The 'user functions' file adds this additional functionality. It creates a simple CSV (comma separated value) file with the document location, document title, document keywords, h1 text, and h2 text for every document that is validated. You can configure it differently if you want, but will have to edit the programming in the user functions file.

One important thing you will have to do is to change the hardcoded filename to what you want (where you want the CSV data stored). Currently it is set to "T:\content_map.csv". To do this, simply edit the user functions config file with a text editor and make the change where this code is:
Code: Select all
 $cmap.filename='T:\content_map.csv';


In the Validator Engine Options, Validator Engine->Config File page, set the 'user functions' file to the user functions config file and check the 'Enable potentially destructive functions' option because this feature will need writeFile() to write the data. If it is not checked, then no data will be written to the CSV file.

Finally, be sure to reload the configuration or restart CSE HTML Validator for the new 'user functions' file to take effect.

If everything is working well, then every time you validate an HTML/XHTML document (using the editor or the Batch Wizard), it should append a line to the CSV file. It will just keep appending data, so you'll need to delete or move or rename the file when you want to start a new file.

This is an example of the improved customization ability in CSE HTML Validator v12. It will let you add your own fields and other information you want. However, not everything may be available, like link checking status. If there is anything that you'd really like to see added, then please let me know.

Re: Website inventory with CSE?

PostPosted: Tue Oct 16, 2012 12:50 pm
by Albert Wiersch
Lou wrote:Albert a quick run of pageTrawler and I see two things:
  • It would be nice to be able to exclude directories.
    When I ran it, it wasted time running through the phpBB subdirectory, which gives me no information.


Hi Lou, if you use the Batch Wizard, then you can exclude folders from being checked using the 'Don't process these targets' option in the Target List Options Tab of the Batch Wizard.

Lou wrote:
  • It would be nice to know where/how a file was found.
    pageTrawler found two 404 references. There is no clue where these links are.
    I have not seen any indication of a 404 before. Of course just because I don't find them doesn't mean they are not there.


  • The current CSV config file doesn't support link information. Of course you could do a link check using the Batch Wizard and the link reports should tell you which documents contain bad links. Due to the current architecture of CSE HTML Validator, adding link information to the CSV file is not simple, because validation is a separate process from link checking.

    Re: Website inventory with CSE?

    PostPosted: Tue Oct 16, 2012 8:40 pm
    by Lou
    Albert Wiersch wrote:Hi Lou and all,

    If everything is working well,

    Albert, how can we figure out why everything is not working well.
    • I downloaded the new files
    • edited the new config file to the output file location
    • made changes to the CSE HTML configuration.

    I don't get an output file when validating a single file or using the batch wizard.

    Your instructions seem straight forward, but...

    Re: Website inventory with CSE?

    PostPosted: Tue Oct 16, 2012 10:01 pm
    by Albert Wiersch
    Lou wrote:I don't get an output file when validating a single file or using the batch wizard.

    Your instructions seem straight forward, but...


    Hi Lou,

    Please:

    1. Make sure CSE HTML Validator has write access (permission) to the file & folder that you've set.
    2. Make sure you checked the 'Enable potentially destructive functions' option.

    Can you confirm the above two conditions are met?

    Re: Website inventory with CSE?

    PostPosted: Tue Oct 16, 2012 11:06 pm
    by Lou
    So when you change a config file, you need to reload the program - right?

    dah All seems to work fine now.

    Re: Website inventory with CSE?

    PostPosted: Tue Oct 16, 2012 11:36 pm
    by Albert Wiersch
    Lou wrote:So when you change a config file, you need to reload the program - right?


    Yep! Sorry, I should have explicitly said that. I've updated the directions.

    Lou wrote:All seems to work fine now.


    Great!

    Re: Website inventory with CSE?

    PostPosted: Wed Oct 17, 2012 6:13 am
    by Lou
    Albert Wiersch wrote:Yep! Sorry, I should have explicitly said that. I've updated the directions.

    Some days I'm a bit slower than others. Now I need to work on fixing the (lack of) content this add-on reveals.

    Re: Website inventory with CSE?

    PostPosted: Wed Jan 30, 2013 11:35 am
    by mdevette
    Hi Albert,

    I just installed the latest version (12).
    Ii the feature available in that version? Do we have do follow all of the same steps that you have described in this topic (that applied to the beta version) or is there an easier way?

    Thanks,

    marco

    Re: Website inventory with CSE?

    PostPosted: Wed Jan 30, 2013 4:37 pm
    by Albert Wiersch
    mdevette wrote:Hi Albert,

    I just installed the latest version (12).
    Ii the feature available in that version? Do we have do follow all of the same steps that you have described in this topic (that applied to the beta version) or is there an easier way?


    Hello,

    Yes, this is available in v12, but only in the pro and enterprise editions. The steps should be the same, except use the release version instead of the old BETA.

    Here are the updated steps:

    1. If you haven't already, download and install CSE HTML Validator v12, pro or enterprise edition. The lite and standard editions don't support this ability, and neither do versions prior to v12.

    2. Download this 'user functions' file: http://www.htmlvalidator.com/user-funct ... entmap.cfg

    3. Change the hardcoded filename in the user functions file to what you want (where you want the CSV data stored). Currently it is set to "T:\content_map.csv". To do this, simply edit the user functions config file with a text editor and make the change where this code is:

    Code: Select all
     $cmap.filename='T:\content_map.csv';


    Make sure CSE HTML Validator has write access (permission) to the file & folder that you've set.

    4. In the Validator Engine Options, Validator Engine->Config File page, set the 'user functions' file to the user functions config file and check the 'Enable potentially destructive functions' option because this feature will need writeFile() to write the data. If it is not checked, then no data will be written to the CSV file.

    5. Finally, be sure to reload the configuration or restart CSE HTML Validator for the new 'user functions' file to take effect.

    The 'user functions' file adds this additional functionality. It creates a simple CSV (comma separated value) file with the document location, document title, document keywords, h1 text, and h2 text for every document that is validated. You can configure it differently if you want, but will have to edit the programming in the user functions file.

    If everything is working well, then every time you validate an HTML/XHTML document (using the editor or the Batch Wizard), it should append a line to the CSV file. It will just keep appending data, so you'll need to delete or move or rename the file when you want to start a new file.

    This is an example of the improved customization ability in CSE HTML Validator v12. It will let you add your own fields and other information you want. However, not everything may be available, like link checking status. If there is anything that you'd really like to see added, then please let me know.