UTF-8 and PHP and html

Post here if your message doesn't fit into another forum but is still about web development. Includes site critiques, web hosting and server questions, helpful software and resources, and more.
Post Reply
PacificSpeech
Rank 0 - Newcomer
Rank 0 - Newcomer
Posts: 4
Joined: Fri Jan 08, 2016 6:00 pm

UTF-8 and PHP and html

Post by PacificSpeech » Fri Jan 08, 2016 7:23 pm

Although I have used PHP for a number of years and CSE Validator for even more, the several issues that arise from special characters in an html document normally have more than one solution, but I got stumped yesterday. A new project required extensive work with non-latin characters, and a lot of them. Normally we can use "entities" like © to produce © the copyright symbol and others, which are found in the CSE editor drop down lists in several places. Many numeric entities also exist in parallel with the shorthand one. © is the numeric unicode entity which is the same as © . This is not really news, but many may be unaware that the CSE editor handles copy and pasted characters that are non-Latin also. For example,
高级妓女 or Ἀσπασία
These are saved properly in an html file, even when your operating system language is English. So, there was some surprise when this method produced question marke "??????" in the rendered output of this current php project. Needless to say UTF-8 (unicode) is implicated. But, I've used this cut and paste method in a similar manner for years with the CSE editor without a problem.

So, to simplify only a little, why does

Code: Select all

<?php print "<h1>Ἀσπασία</h1>" ?>
produce (source code)

Code: Select all

<h1>???????</h1>
?[/b]

Before everyone raises their hands, note that there indeed is

Code: Select all

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
in the head section.

James

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3242
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: UTF-8 and PHP and html

Post by Albert Wiersch » Fri Jan 08, 2016 8:31 pm

Hi James,

Do you have a URL I could access for testing? I'd like to see the HTTP headers. If the encoding is specified there (in the HTTP headers) then that overrides whatever might be specified in the document.
Image
Albert Wiersch

PacificSpeech
Rank 0 - Newcomer
Rank 0 - Newcomer
Posts: 4
Joined: Fri Jan 08, 2016 6:00 pm

Re: UTF-8 and PHP and html

Post by PacificSpeech » Sun Jan 10, 2016 8:05 pm

Test link http://31313.net/grayrabbit/_multifeed_asia.php

Headers will show utf8
Apologies, my isp down for past 30 hrs. On back up system.
James

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3242
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: UTF-8 and PHP and html

Post by Albert Wiersch » Sun Jan 10, 2016 9:39 pm

Hi James,

Sorry about your ISP being down. Hope it comes back up soon and you can get off the backup.

I can't see anything wrong. The problem may be related to the copying and pasting and/or saving the document with the correct encoding.

If you create a new PHP document in CSE HTML Validator, then copy and paste some non-Latin text into it, then go to "File->Save with Encoding" and make sure the encoding is "Unicode (UTF-8)", then click OK, does it work?

If this doesn't help, then can you provide me exact steps that I can use to reproduce the problem? STeps like creating a new document, copying non-Latin text to it, saving it, and loading it from a web browser through a PHP server. I have a web server running PHP that I can use for this test.
Image
Albert Wiersch

PacificSpeech
Rank 0 - Newcomer
Rank 0 - Newcomer
Posts: 4
Joined: Fri Jan 08, 2016 6:00 pm

Re: UTF-8 and PHP and html

Post by PacificSpeech » Tue Jan 12, 2016 4:57 pm

It is the "Save with encoding" that fixes the problem. I had not even noticed this option before. Unlike a normal html page, which nowadays I always build to include the meta tag with charset=utf-8, php pages need have no charset specified, even when their internal html code may very well have it. Any utf-8 characters, that may have been pasted would lose the encoding after saving. Only upon reload would the issue become visible. Otherwise, saved files including php, javascript, css, and .txt files use the "system codepage" character set. By contrast, .htm and .html pages including may or may not (html5) save as UTF-8 by default. In fact, the CSE editor will save a blank document as UTF-8 by default!
I guess there is only one question remaining: what does the check box labelled "Use encoding signature" do?
Many thanks for your assistance.
James

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3242
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: UTF-8 and PHP and html

Post by Albert Wiersch » Tue Jan 12, 2016 6:09 pm

Hi James,

I'm glad that fixes the problem. In recent versions I've made some changes to be smarter about saving with the proper encoding and defaulting more to UTF-8. Are you using v16? If not, then it is possible that you wouldn't have encountered this issue (at least I hope not!).

The encoding signature adds some bytes to the beginning of the document to make it easier for programs that support it to use the correct encoding. Here is some more information:
https://en.wikipedia.org/wiki/Byte_order_mark
Image
Albert Wiersch

PacificSpeech
Rank 0 - Newcomer
Rank 0 - Newcomer
Posts: 4
Joined: Fri Jan 08, 2016 6:00 pm

Re: UTF-8 and PHP and html

Post by PacificSpeech » Tue Jan 12, 2016 8:20 pm

Indeed, my version is getting old, V11, yet trouble free, almost without exception. Quite unlike this power supply of identical age, which died two weeks ago. The BOM is what I suspected. And yet again, I know more than before.

James

User avatar
Albert Wiersch
Site Admin
Site Admin
Posts: 3242
Joined: Sat Dec 11, 2004 9:23 am
Location: Near Dallas, TX
Contact:

Re: UTF-8 and PHP and html

Post by Albert Wiersch » Wed Jan 13, 2016 9:56 am

James, I'm glad that v11 has worked so well for you! However, v16 is a lot newer and even better than v11. :)

I do recommend an upgrade since much has changed (although the interface you know is mostly the same). Also, as of now (January 2016), versions prior to v12 are considered obsolete (which includes v11 and below).

In case you are interested, here is more details on the changes:
https://www.htmlvalidator.com/whats-new.php

I also hope to release v16.02 at the end of the month. It will be the first 2016 release and free for all customers who are already licensed for v16 (because all minor updates are free).
Image
Albert Wiersch

Post Reply