Charset

For topics about current BETA or future releases, including feature requests.
Post Reply
ramona
Rank 0 - Newcomer
Rank 0 - Newcomer
Posts: 1
Joined: Sun Jun 29, 2008 12:06 pm

Charset

Post by ramona » Sun Jun 29, 2008 12:47 pm

Hello
When I press fix format and tidy html button then why its change my charset UTF-8 in to us-ascii and many strange sign appear in my text like the “three Doshasâ€.
I downloaded as you suggested beta version 9 and working on that but having this problem still I don’t want after I tidy your software change my charset UTF-8 to us-ascii. Please tell how my charset could remain same as UTF-8. How do I configure this. Please explain in detail and clear as I am not technical person that’s why I bought this software but it is creating this another problem in my documents many strange charterer comes after I tidy with your software. My pages appear fine in browser with UTF-8 charset but after using this software all this strange sign comes.
See url http://www.jaisiyaram.com/acookingworkshop1.htm Please help. Thanks

User avatar
MikeGale
Rank VI - Professional
Rank VI - Professional
Posts: 709
Joined: Mon Dec 13, 2004 1:50 pm
Location: Tannhauser Gate

Post by MikeGale » Sun Jun 29, 2008 4:05 pm

From my personal experience tidy is a tool I use only once on any page. (In fact I prefer not to use it at all.)

(I've found that it is not possible to set it up to do exactly what I want.)

You may be fairly happy with what it does. If so ignore this. (Note Tidy is not part of CSE but is a separate program.)

Do you use Tidy again and again on the same page? or just once?

Here's my suggestion;

1) Get the tidy settings right. I have found you need to convert the same page repeatedly until it does what you want. (In the process you find that some things cannot be set up in Tidy.)

2) Convert your target page, once only.

3) From then on edit that page with other tools.

I can't comment specifically on the format change issue.

User avatar
MikeGale
Rank VI - Professional
Rank VI - Professional
Posts: 709
Joined: Mon Dec 13, 2004 1:50 pm
Location: Tannhauser Gate

Post by MikeGale » Sun Jun 29, 2008 9:12 pm

Setting up Tidy might be a bit daunting.

Here's some hints how you might do it. (I haven't used this for a long time and don't expect to use it soon, Caveat Emptor.)

1) Set up the tidy profiles you want to use. See help with CSE. i.e. <profile name="My Settings" args="-config E:\Web\MyHTMLTidy.tdy" hint="Attempts to avoid Tidy problems, not perfect, this is from 2001/04/03" />
2) Create the file with settings. See sample below. Adjust it.

Code: Select all

markup:yes
wrap:255
tab-size:2
indent:yes
indent-spaces:2
hide-endtags:no
input-xml:no
output-xml:no
output-xhtml:yes
char-encoding:ISO2022
numeric-entities:no
quote-marks:no
quote-ampersand:no
quote-nbsp:yes
wrap-script-literals:no
uppercase-tags:no
break-before-br:no
uppercase-tags:no
uppercase-attributes:no
clean:no
write-back:yes
show-warnings:yes
split:no
add-xml-pi:no
doctype:LOOSE
fix-backslash:no
wrap-asp:no
drop-font-tags:no
word-2000:yes
tidy-mark:no
wrap-attributes:no
wrap-jste:no
wrap-php:no
assume-xml-procins:yes
logical-emphasis:no
drop-empty-paras:no
enclose-text:yes
fix-bad-comments:yes
keep-time:no
quiet:no
alt-text:
error-file:TidyOut.log
new-inline-tags:
new-blocklevel-tags:
new-empty-tags:
new-pre-tags:
logerrors:yes
runsilently:no
This is an old one (2001). You'd want to change a few things like character encoding. Check the Tidy documentation for that.

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3416
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Post by Albert Wiersch » Mon Jun 30, 2008 9:08 am

Hello,

I would suggest you try to force HTML Tidy to treat the input and output as UTF-8.

Try this with v9.0 BETA:

1. Edit the Tidy Profiles config file by going to CSE HTML Validator's editor and choosing Options->Validator Engine Options->Edit Configuration Files->HTML Tidy Profiles.

2. Insert a new "profile" tag in the "tidyprofiles" section that looks like this:

Code: Select all

<profile name="Default (UTF-8 In/Out)" args="--char-encoding utf8" hint="Treat input and output as UTF-8." />
3. Save this file. Exit and reload CSE HTML Validator so the new profiles are loaded.

4. Load your document into the editor and go to the HTML Tidy/Fix and Format tool (with dialog).

5. Select the new profile you just created and click "Refresh".

Please let me know if that helps.
Image
Albert Wiersch

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3416
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Post by Albert Wiersch » Tue Jul 01, 2008 5:45 pm

Also, I just released v9.0 BETA 2 today. This might work in v9.0 BETA 2:

1. Goto File->Open with Encoding and choose the file and the proper encoding (Unicode (UTF-8) in your case). Click 'OK' to open the document.

2. Now use the HTML Tidy/Fix and Format tool (with dialog) and see if it works better.

If you try this, then please let me know how it works. I worked on this just before releasing BETA 2.
Image
Albert Wiersch

User avatar
MikeGale
Rank VI - Professional
Rank VI - Professional
Posts: 709
Joined: Mon Dec 13, 2004 1:50 pm
Location: Tannhauser Gate

Post by MikeGale » Tue Jul 01, 2008 7:39 pm

I had a very quick look at that document you mention.

In it's default configuration Tidy converts the named entities to literal characters. (Like &acirc;&euro;&ldquo; in one of the original titles...)

As I understand it you want to retain those as named entities (&acirc;&euro;&ldquo; etc.)

Here's a tidy profile that does that:

Code: Select all

markup:yes
wrap:0
tab-size:2
indent:yes
indent-spaces:2
hide-endtags:no
input-xml:no
output-xml:no
output-xhtml:yes
char-encoding:utf8
numeric-entities:no
preserve-entities:yes
quote-marks:no
quote-ampersand:no
quote-nbsp:yes
wrap-script-literals:no
uppercase-tags:no
break-before-br:no
uppercase-tags:no
uppercase-attributes:no
clean:no
write-back:yes
show-warnings:yes
split:no
doctype:LOOSE
fix-backslash:no
wrap-asp:no
drop-font-tags:no
word-2000:yes
tidy-mark:no
wrap-attributes:no
wrap-jste:no
wrap-php:no
assume-xml-procins:yes
logical-emphasis:no
drop-empty-paras:no
enclose-text:yes
fix-bad-comments:yes
keep-time:no
quiet:no
alt-text:
error-file:TidyOut.log
new-inline-tags:
new-blocklevel-tags:
new-empty-tags:
new-pre-tags:
The key bit in there is probably preserve-entities:yes.

Is that what you wanted?

I used version 9.0 Beta 2 to test this.

Post Reply