Data corruption when using Tidy and utf-8
Posted: Sat Oct 03, 2009 2:10 pm
I was trying to use Tidy in TopStyle, to convert html 4.0 to xhtml. After some trial and error I found out that Tidy does not handle non-ascii characters in charsets other than Latin1 (bascially), so the only option for Central European charsets (Windows-1250 or iso-8859-2) is to ensure both input and output use utf-8. Of course, utf-8 files open correctly in TopStyle.
The problem is that while Tidy performs the task correctly, the data it returns to TopStyle are not correctly interpreted. In the result pane, raw utf-8 bytes are displayed, and they are retained after clicking "Copy to active editor".
Here is my Tidy configuration. (This is the only set of options that allows Tidy to do the job at all. Any other settings cause tidy to either replace non-ascii characters with entities, or to reduce them to the 0-127 ascii range, which corrupts data).
And this is the result - original text in the top pane, text from Tidy in the bottom pane:
I think I have eliminated the possibility that Tidy is somehow at fault here, since I got the correct result using Tidy with the same set of options in a different commercial HTML editor (a TopStyle competitor, so I won't name it here).
Is this something that can be fixed in TopStyle?
The problem is that while Tidy performs the task correctly, the data it returns to TopStyle are not correctly interpreted. In the result pane, raw utf-8 bytes are displayed, and they are retained after clicking "Copy to active editor".
Here is my Tidy configuration. (This is the only set of options that allows Tidy to do the job at all. Any other settings cause tidy to either replace non-ascii characters with entities, or to reduce them to the 0-127 ascii range, which corrupts data).
And this is the result - original text in the top pane, text from Tidy in the bottom pane:
I think I have eliminated the possibility that Tidy is somehow at fault here, since I got the correct result using Tidy with the same set of options in a different commercial HTML editor (a TopStyle competitor, so I won't name it here).
Is this something that can be fixed in TopStyle?