Speeding up Batch Validate revisited

For topics about current BETA or future releases, including feature requests.
User avatar
roedygr
Rank V - Professional
Rank V - Professional
Posts: 370
Joined: Fri Feb 17, 2006 5:22 am
Location: Victoria BC Canada
Contact:

Speeding up Batch Validate revisited

Post by roedygr » Fri Mar 21, 2014 5:45 pm

I have asked for this many times. It is so simple to do in Java, I find it odd you would not just have knocked it off in C++. It will speed up batch validate 100 fold. My C++ is pretty rusty, but I will try to talk in those terms.

Let us say you have 3000 files in a batch. You work for a few days making edits with half a dozen different tools. You then want to reverify the batch. Nearly all the files are unchanged since the last verify. You could set the "changed since" number manually, but you don't remember how long ago you last verified. You want the effect of having that set automatically, but also to include files that have errors you have not yet corrected.


Here is how to get the effect:
PLAIN METHOD
You need a file of records containing pairs, timestamp and fully qualified file name.
This might be a CSV file, a binary file, a tiny embeddable SQL database.
In it you track all the files in the batch you have every verified and when, but only the ones that were error/warning free for the batch.
Internally the data can be represented as a C++ unordered_map.

When the batch verifier is charging along, before it verifies a file, it looks it up in the unordered_map. If when it looks it up in the unordered_map, if it finds it and the current timestamp of the file is less or equal to the database timestamp, it can bypass the verify, and just generate a dummy result.

Otherwise it verifies the file, and updates the unordered map. When HTMLValidator batch completes it flushes the unordered_map to a file.

FANCY METHOD
In the PLAIN method, if every file had a warning that had never been corrected, you would get no speedup. To fix that, now track your files with three fields per record. The third field is a pointer to a file containing the results from the most recent verify run if there are errors/warnings.

If you change the Htmlvalidator or batch configuration, that invalidates the cache.

CATCH
If the document has messages, but no errors/warning, you will not see them with the PLAIN method.

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3222
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: Speeding up Batch Validate revisited

Post by Albert Wiersch » Fri Mar 21, 2014 6:05 pm

There are 2 main issues with this that I can think of:
1. There 'devil is in the details'. The general concept seems easy, but the actual implementation could be much more time consuming than anticipated.
2. There doesn't seem to be enough demand for a feature like this. I haven't had any other requests similar to this.

What about the 'Limit to Age' feature in the Batch Wizard? If you specify a folder target, then in the target properties you can limit it to only checking documents in that folder that have been modified within the last x days. This would prevent checking old documents and could significantly increase the speed of the check. Is this helpful to you?

It may be possible to also add similar functionality using a custom user function. You could limit the checking to documents that have been modified within the last x # of days.

Please let me know what you think about the above alternatives.
Image
Albert Wiersch

User avatar
roedygr
Rank V - Professional
Rank V - Professional
Posts: 370
Joined: Fri Feb 17, 2006 5:22 am
Location: Victoria BC Canada
Contact:

Re: Speeding up Batch Validate revisited

Post by roedygr » Thu Apr 03, 2014 7:06 am

>There doesn't seem to be enough demand for a feature like this. I haven't had any other requests similar to this.

1. There is an implied "make it faster" change request to every computer program. Users presume the vendor is doing the best he can.

2. HTML editor users are not programmers. They don't have the experience to think up an internals request.

3. When is this useful?
I generate a lot of my HTML with programs by macro expansion of magic comments. I also do a lot of batch editing, using a group search/replace
on all files. I need to verify the results of such changes. I don't know which files were changed, but typically maybe only 5 percent of them were changed. To revalidate them all take perhaps as hour. With this new feature, it would be 20 times faster.

Each day I do edits to correct dead URLs and redirected URLs. There is no pattern to the pages I fix. I want to verify the entire site even though most pages did not change.

The problem with the current time limit feature is I don't know the magic limit number to use. If I make it too small, I miss some validations. If I make it too big, I end up validating everything anyway. The other problem with using time limit is it will not catch old flaws I failed to correct earlier. In most cases it needs to be the time since the last batch validation, a number I can only guess at. In the meantime, perhaps you could do an auto time limit that calculates time since last successful finishing of that script.


Devil Details
One brute force technique you can use is if anything goes wrong (corruption), just delete the cache and start over.
Treat old cache entries older than the time the current batch script was modified as if they had been freshly modified.
Treat old cache entries older than the time any config that could affect the validation was done as if there had been freshly modified.
If a script in interrupted, you will have only add recent cache entries for files validated ok. All comes out in the wash.

Things are always more complicated than they appear beforehand, but I am pretty sure they would not be unmanageable. This is far from rocket science.

User avatar
roedygr
Rank V - Professional
Rank V - Professional
Posts: 370
Joined: Fri Feb 17, 2006 5:22 am
Location: Victoria BC Canada
Contact:

Re: Speeding up Batch Validate revisited

Post by roedygr » Thu Apr 03, 2014 7:28 am

Albert Wiersch wrote:There are 2 main issues with this that I can think of:
1. There 'devil is in the details'..
We Java programmers are very comfortable with HashMaps, getting the loaded, saved, saving on error etc.

I can see you are running in a 32-bit space, so are likely very concerned about saving RAM.

So here is a simpler, slower, but ram-sparing implementation.

Your cache on disk is filled with zero length files. The name mirrors the entire universe of files being validated with a structure like
cache/E/mindprod/jgloss/jdk.html
The date of the file tells when the file was last successfully validated.
If the cache file entry is missing, then it has either not yet been validated, or has not been validated successfully (and unchanged).

When you change the config, you delete the directory tree.

You can periodically prune the tree of cache files whose match has disappeared, or whose match has a more recent date. You try to keep the tree containing only perfect files.

In a fancier version the cache file contains the list of errors if the file failed. Then you don't have to regenerate that either when you revalidate an old file.

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3222
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: Speeding up Batch Validate revisited

Post by Albert Wiersch » Thu Apr 03, 2014 7:57 am

Hi Roedy,

Thanks for the ideas. I am intrigued by your idea of using the filesystem as a database. Using TNPL with your idea of using the filesystem like a database with those cache files, there may be a practical solution that works well enough for you and that I can justify the time implementing. However, there would still be an issue if you change the configuration. The simple solution would be to manually delete all the cache files if you want everything validated again because you changed the configuration.

I will probably begin working on v15 soon. Would you be able to work with me and test out possible solutions using a v15 BETA?
Image
Albert Wiersch

User avatar
roedygr
Rank V - Professional
Rank V - Professional
Posts: 370
Joined: Fri Feb 17, 2006 5:22 am
Location: Victoria BC Canada
Contact:

Re: Speeding up Batch Validate revisited

Post by roedygr » Thu Apr 03, 2014 7:07 pm

>. There doesn't seem to be enough demand for a feature like this

Keep in mind my suggestion does not change the User Interface is any way. It is not a user feature. It is an optimisation.

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3222
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: Speeding up Batch Validate revisited

Post by Albert Wiersch » Thu Apr 10, 2014 10:06 am

Hello,

I will be emailing you a link to v14.0210, which you can use to try the below TNPL Batch Wizard functions. You can append it to your current (if any) user functions file for any Batch Wizard target list you want to use it with.

I've added new functions to support this, some of which I've been thinking of adding for a while, like the JSON encoding and decoding functions.

You will need to set $cachefolder to the folder you want to use, to store the "cache files" that are used to store the dates.

If it works, it will check the $cachefolder for every target to see if a cache file exists for that target.
1) If the cache file doesn't exist then the target will be validated.
2) If it does exist, then the target is checked using the date information in the cache file to see if it's changed since the last validation with no errors and no warnings. If it has changed, then it's validated.

When the target is finished validating:
1) if there are no errors and no warnings then a cache file is made storing the last write time of the file (year, month, day, hour, and minute) to prevent the file from being changed again until its last write time is after the time stored in the cachefile.
2) if there are any errors or warnings then it makes sure there is no cache file by deleting any that might exist.

NOTE: Link checking results are not considered because they are not available until much later as the link checker runs in the background, so if there is a bad link in the target but it has not validator errors and no validator warnings, then it will create the cache file.

I'd really know how this works for you, and hope that it addresses the performance issue.

Code: Select all

/*********************
 * Set $cachefolder in onBeforeMainStart() to the folder to use to store data (it must end in a backslash)
 *********************/

function onBeforeMainStart() {
 $cachefolder='T:\\cache\\';
}

function onTargetCanAdd() {
 $cachefile=$cachefolder+replaceRegEx($otca_target,'[^a-zA-Z0-9\.\-\_]','_');
// ProgressMessage('$cachefile: '+$cachefile);
 $cachefilecontents=readFile($cachefile);

 if $cachefilecontents.isSet() {
  $cachedata=json_decode($cachefilecontents);
  $fileinfo=getFileInfo($otca_target,1);
  
  if $fileinfo.isSet() {
   $otca_add=false;

   if $fileinfo.lastwrite_year>$cachedata.lastwrite_year { $otca_add=true; }
   else { if $fileinfo.lastwrite_year==$cachedata.lastwrite_year {
   if $fileinfo.lastwrite_month>$cachedata.lastwrite_month { $otca_add=true; }
   else { if $fileinfo.lastwrite_month==$cachedata.lastwrite_month {
   if $fileinfo.lastwrite_day>$cachedata.lastwrite_day { $otca_add=true; }
   else { if $fileinfo.lastwrite_day==$cachedata.lastwrite_day {
   if $fileinfo.lastwrite_hour>$cachedata.lastwrite_hour { $otca_add=true; }
   else { if $fileinfo.lastwrite_hour==$cachedata.lastwrite_hour {
   if $fileinfo.lastwrite_min>$cachedata.lastwrite_min { $otca_add=true; }
   }}}}}}}}
  }
 }
}

function onTargetProcessed() {
 $cachefile=$cachefolder+replaceRegEx(getValueString(5),'[^a-zA-Z0-9\.\-\_]','_');

 if getValueInt(1) || getValueInt(2) {
  deleteFile($cachefile);
 }
 else {
  $fileinfo=getFileInfo(getValueString(5),1);
  writeFile($cachefile,json_encode($fileinfo),2);
 }
}
Image
Albert Wiersch

User avatar
roedygr
Rank V - Professional
Rank V - Professional
Posts: 370
Joined: Fri Feb 17, 2006 5:22 am
Location: Victoria BC Canada
Contact:

Re: Speeding up Batch Validate revisited

Post by roedygr » Thu Apr 10, 2014 5:42 pm

To clear the cache you don't have to literally delete all the cache files. You need to record the timestamp of the date the config last changed or the cache was otherwise cleared. When you read a cache entry you ignore it if the file date (as opposed to the embedded date) is prior to that. Y You can provide a bat file to clear the cache with del *.* for support. You might literally clear obsolete cache entries on fire up. Even if you literally clear the cache every time the config changes, only the first change will take detectable time.

You might as well put the cache in C:\Users\xxx\AppData\Roaming\AI Internet Solutions\CSE HTML Validator\14.0\cache
because the files are not voluminous.


If the cache is cleared unnecessarily it is not the end of the world. Things just behave as they do now for one more batch.
Last edited by roedygr on Thu Apr 10, 2014 6:01 pm, edited 1 time in total.

User avatar
roedygr
Rank V - Professional
Rank V - Professional
Posts: 370
Joined: Fri Feb 17, 2006 5:22 am
Location: Victoria BC Canada
Contact:

Re: Speeding up Batch Validate revisited

Post by roedygr » Thu Apr 10, 2014 5:53 pm

Where do I insert that hunk of code?

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3222
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: Speeding up Batch Validate revisited

Post by Albert Wiersch » Thu Apr 10, 2014 7:44 pm

roedygr wrote:Where do I insert that hunk of code?
Basically, put it in a text file, save it, and specify it in the Batch Wizard, 'Target List Options' tab as the 'user functions file'. If you have a file already specified there then you can append it to the file already specified. Note that you will need v14.0210 or later (I emailed you a download link).

Tomorrow I am going to try your idea of using empty files and the date on the files instead of storing date information in the file. I think you are right that it will work and be more efficient. If it works, then I will post the code for the new method.
Image
Albert Wiersch

User avatar
roedygr
Rank V - Professional
Rank V - Professional
Posts: 370
Joined: Fri Feb 17, 2006 5:22 am
Location: Victoria BC Canada
Contact:

Re: Speeding up Batch Validate revisited

Post by roedygr » Thu Apr 10, 2014 9:02 pm

When you put this into production, all batch scripts should share the same cache.
Sometimes scripts are subsets of others.

User avatar
roedygr
Rank V - Professional
Rank V - Professional
Posts: 370
Joined: Fri Feb 17, 2006 5:22 am
Location: Victoria BC Canada
Contact:

Re: Speeding up Batch Validate revisited

Post by roedygr » Thu Apr 10, 2014 9:11 pm

There is also the matter of whether you include warnings/errors in the autoload and in the cache.
There is also whether you ignore files on the batch summary list or simulate them as if no caching were being used.

User avatar
roedygr
Rank V - Professional
Rank V - Professional
Posts: 370
Joined: Fri Feb 17, 2006 5:22 am
Location: Victoria BC Canada
Contact:

Re: Speeding up Batch Validate revisited

Post by roedygr » Thu Apr 10, 2014 9:22 pm

The batch script is running, but no files are appearing in the cache directory.

User avatar
roedygr
Rank V - Professional
Rank V - Professional
Posts: 370
Joined: Fri Feb 17, 2006 5:22 am
Location: Victoria BC Canada
Contact:

Re: Speeding up Batch Validate revisited

Post by roedygr » Thu Apr 10, 2014 9:31 pm

The script finished. Still no cache files. I ran the script again. It ran at normal speed.

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3222
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: Speeding up Batch Validate revisited

Post by Albert Wiersch » Thu Apr 10, 2014 9:57 pm

roedygr wrote:When you put this into production, all batch scripts should share the same cache.
Sometimes scripts are subsets of others.
I don't really plan to put this "into production" as a standard feature. I consider it a customization using TNPL and the user functions. I may put the script in the help file as an example for others who might want to do something similar, and I'll definitely point it out to anyone else who requests this same type of enhancement/feature.

As for no files being created, I'm sorry that I forgot to mention that you need to go to the Validator Engine Options and the 'Config File' page and enable the option that "enables potentially destructive functions like writeFile()" before the script will be able to create any files. You should then see the files being created as targets are processed as long as the folder is correct and CSE HTML Validator has permission to create files there.
Image
Albert Wiersch

Post Reply