Page 1 of 1

Data corruption when using Tidy and utf-8

Posted: Sat Oct 03, 2009 2:10 pm
by tranglos
I was trying to use Tidy in TopStyle, to convert html 4.0 to xhtml. After some trial and error I found out that Tidy does not handle non-ascii characters in charsets other than Latin1 (bascially), so the only option for Central European charsets (Windows-1250 or iso-8859-2) is to ensure both input and output use utf-8. Of course, utf-8 files open correctly in TopStyle.

The problem is that while Tidy performs the task correctly, the data it returns to TopStyle are not correctly interpreted. In the result pane, raw utf-8 bytes are displayed, and they are retained after clicking "Copy to active editor".

Here is my Tidy configuration. (This is the only set of options that allows Tidy to do the job at all. Any other settings cause tidy to either replace non-ascii characters with entities, or to reduce them to the 0-127 ascii range, which corrupts data).
Tidy Convert to XHTML configuration
Tidy Convert to XHTML configuration
01-tidy-config.png (12.69 KiB) Viewed 5711 times
And this is the result - original text in the top pane, text from Tidy in the bottom pane:
Conversion result: note raw utf-8
Conversion result: note raw utf-8
02-tidy-result.png (12.53 KiB) Viewed 5711 times
I think I have eliminated the possibility that Tidy is somehow at fault here, since I got the correct result using Tidy with the same set of options in a different commercial HTML editor (a TopStyle competitor, so I won't name it here).

Is this something that can be fixed in TopStyle?

Re: Data corruption when using Tidy and utf-8

Posted: Mon Oct 05, 2009 3:33 pm
by TopStyle Support
Fixed in 4.0.0.67

Thanks, Stefan.

Re: Data corruption when using Tidy and utf-8

Posted: Tue Oct 06, 2009 7:06 am
by tranglos
Fantastic, thanks a lot, Stefan!