![]() |
#31
|
|||
|
|||
![]()
Hi Schmosby,
I don't think it matters which operating system is being used. All that matters is if you post Unicode or not. The problem looks to be the board's Unicode support is broken. All I am doing is putting Unicode characters into the edit box. Tests show that most Unicode characters do not render. The few that do are trashed by editing the post. I am using firefox and manually constructing the codes I want by copy and paste. |
#32
|
|||
|
|||
![]()
I have contacted our administrator about this issue to see if he can do anything
![]() |
#33
|
|||
|
|||
![]()
I think there may be two issues.
The root cause may be that the Unicode support is broken by the code stuffing not being handled correctly. To work around that a kludge may have been written to try to get a few punctuation marks to work. This works for virgin posts but breaks when a submitted post is then edited. I don't think either of these are things that the admin here can address directly. However, I have a cunning plan... @Indigo_: Does the board have a word filter that allows you to replace one word with another? If so maybe it could be set to replace the text ***8217; with ' If that works then the other codes could be caught similarly. To cover more of the field, it may pay to try replacing the Unicode punctuation with an appropriate ASCII character. This may catch the Unicode at the earliest opportunity. This last bit may not work if Unicode characters are not supported in the filter code. How helpful this is will depend on where in the chain the filter is applied. I think it is worth a go in case it does make the posts easier to read. Some downsides are that it won't be possible to refer to the offending string directly as it will be filtered, and when the root bug gets fixed there may be some side effects with old posts. |
#34
|
|||
|
|||
![]()
I already thought of word filter. It doesn't.
I also thought to use javascript, but no way for the co-admins to add javascript either. The database charset is latin1_swedish_ci (ISO-8859-1), so it just doesn't support many characters. The others could be converted to html entities maybe. Other option would be to replace the common ones and strip the rest. UPDATE: Interestingly, I don't have this issue on my install, curly quotes get replaced with their html entities before being added to the database, so display correctly. I wounder if maybe during the server move the database was created with the wrong charset or a different one to the specified one in the admin panel. OK I'm 99% sure I've found the issue, &# seems to be in the censored words list. Testing if &# is a swear word...***. |
#35
|
|||
|
|||
![]()
Hi Schmosby,
Do you mean it doesn't have a word filter or that the word filter doesn't work? I don't think it should matter too much about how the database holds characters as long as it supports the characters used to represent the Unicode character, most charsets have *; and the digits. My guess was that it was held as, for example, ***8217; and that anything looking like this in the user text was stuffed to ****8217; Or something like that anyway. I would use fewer * though. Not sure what is used as the escape character. At least some Unicode characters make it correctly into the database and come back as themselves so there has to be at least two layers to the issue otherwise it would be an all or nothing effect. If you have a live copy try putting in some Unicode and see how it is held in the database. |
#36
|
|||
|
|||
![]()
The board doesn't have a word replacement feature at all.
Using quick reply, it's saved in the database using html entities, If Go Advanced is used or if you quote someones post (which uses the advanced page) it is saved raw. If I edit a post saved with quick reply and hit the go advanced button before saving, it gets converted from entities to raw. I'm pretty sure the swear filter is the issue, as if you type &# here you will see it gets replaced with 3 stars, therefore the HTML entity ’ becomes ***8217; instead of showing up as an apostrophe. That's why you only get the stars if you use quick reply, as they contain the entities which include &#, while the advanced page saves it raw, which contains no entities, so no &#. ***33336; |
#39
|
|||
|
|||
![]()
However that still doesn't explain everything. There shouldn't be exceptions to this rule so why do some Unicode characters work?
Also does the basement treat Unicode differently if we can swear there? And why is such an important couple of characters in the swear filter in the first place? Is it to fix some other problem? I am going into the basement to *** and *** |
#40
|
|||
|
|||
![]()
Maybe ‟ is also added to the swear filter separately or is automatically changed by the forum software as being a quote it could be used for code injection attacks. I'm just guessing.
UPDATE: Yes there is a separate list called Blank Character Stripper, I think u201F is added to that on this board. This second list applies to both quick reply and advanced, so it always gets converted like a swear word. ![]() |
#41
|
|||
|
|||
![]()
Yes, that was my guess too.
![]() It looks like the basement is no different so it is global. If a mod is watching please will they kill this thread:http://www.social-anxiety-community....ad.php?t=91396 Sorry about messing the place up. I think this is above my pay grade so I am going back to lurking. Well done Schmosby. ![]() |
#42
|
|||
|
|||
![]()
My Brain will not leave this alone.
My understanding to date:- Issue 1: Unicode has been deliberately broken using the swear filter for some reason that I will probably sleep better not knowing. Let us assume it is going to stay that way. Issue 2: For me, extended punctuation marks are accepted but are then trashed if the post is subsequently edited. The second issue happens when characters are used that are included in charset=Windows-1252 but not included in charset=ISO-8859-1. Looking online it does appear that these two charsets are not handled consistently and sometimes considered synonymous. Could it be these two different charsets (or two different interpretations of the same charset) are active in two different sections of the board’s engine room and active translation between them cause promotion of the extended punctuation marks to Unicode and BAM! The swear filter kicks in? The solution would seem to be to ensure that only one charset is used throughout and that it contains the extended punctuation marks, i.e. charset=windows-1252. In August 2018 a page header was:- <html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en"><head> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <meta name="generator" content="vBulletin 3.8.7"> Today it is:- <html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /> <meta name="generator" content="vBulletin 3.8.7" /> This makes me think that there may be two different interpretations of ISO-8859-1 being used at the same time. Of course I could be completely wrong. It has happened before. ![]() |
#43
|
|||
|
|||
![]()
To put it bluntly , whoever designed things so that happens must be as thick as two short planks .
|
#44
|
|||
|
|||
![]()
^ No one person is in control and it has taken on a life of its own. It is not that people are thick, it is that it is growing faster than the standards are developing. It is not just plants and meat that evolve.
![]() |
#45
|
|||
|
|||
![]()
^^^I doubt that this is the issue, as the default charset in the forum software and mysql is Latin1/ISO-8859-1, so I think if someone deliberately chose differently, they wouldn't go for windows-1252, they would go for UTF-8 and I'm sure they would update the charset in the forum software if they did that.
I think someone added it to the swear filter to stop people using Unicode characters in a negative way. My guess is that they were being used like... ![]() |
#46
|
|||
|
|||
![]()
The reson I was think what I did was that ISO-8859-1 doesn't contain the extended punctuation and yet they are still stored without being converted to Unicode. It is windows-1252 that does contain them.
If the board strictly used ISO-8859-1 throughout “ ” „ ‘ ’ ‚ – — would never get to the database and yet here they are in the database. As I said, I could be wrong. |
#47
|
|||
|
|||
![]()
Interesting.
I just read that mysql actually uses CP125 for Latin1 instead of Latin1/ISO-8859-1. I guess because Latin1/ISO-8859-1 is a subset of CP125. Code:
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION | latin1 | latin1_swedish_ci | cp1252 West European The forum software does correctly convert characters outside CP1252 to entities for storage in the database. If I try to add non CP1252 characters to the database directly, It just throws an error. UPDATE: I just tested changing the forum charset to be windows-1252 and that also results in our issue being resolved, as the apostrophes etc. are no longer converted to entities in quick reply and so avoid the swear filter. |
#48
|
|||
|
|||
![]()
Good work.
![]() ... and we know that the board has changed from using windows-1252 to ISO-8859-1 sometime since May this year because it can be seen from the page headers. Coincidence? Maybe changing it back to windows-1252 will solve this issue. However it must have been changed for some reason. You don't go changing things like that on a whim; 'if it ain't broke, don't fix it.' This change isn't the root cause, it just manifests deeper issues. If everything was done correctly the extended punctuation would have never made it to the database in the first place. Even my browser is a little naughty. It has been told the page is ISO-8859-1 and yet it took an unprintable character and rendered it as punctuation as if the page was windows-1252. So much for standards. ![]() |
#49
|
|||
|
|||
![]()
Well lets not talk about browsers regarding standards
![]() Where are you getting your info about the header being different previously? I checked the web archive and all the snapshots going back until the beginning of time had ISO-8859-1, even the Aug 2018 one. (I didn't check every snapshot obviously, just random ones). |
#50
|
|||
|
|||
![]()
I took a scrape of the site FAQ a while back and looked at that.
Checking again, I see that it was May that I moved it from one disc to another and I looked at the file date. ![]() <html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en"><head> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <meta name="generator" content="vBulletin 3.8.7"> Other pages I have around that time are also windows-1252. The August 2018 one is windows-1252 as well. Whatever the problem is I think we have gone as far as we can from the outside. It is now up to 'the management' to either do or not do. It was fun but I don't think I will be giving up my day job. ![]() |
#51
|
|||
|
|||
![]()
Ah, things are not quite as I thought.
I found it odd that we were seeing different things and so I just scraped this page. If I look at the source of this live page while looking at it I see ISO-8859-1 and if I look at the source of the scraped page it shows windows-1252. There is some transcoding going on somewhere and it is not obvious to me where it is happening. It could the server or my browser being (un)helpful. If I have time I will log the network traffic and then I can see which it is. This is one good side to it being just http. In the meantime I cannot be sure what is going on so the answer may or may not be a change in code page. Wrong again. ![]() I wonder how many people on the board are still keeping up with this thread? ![]() |
#53
|
|||
|
|||
![]()
Nothing so complex
![]() I guess it is the browser doing the conversion because of my PC setup. It doesn't really matter anyway. All it says is that I cannot draw any conclusion about if or when the board changed code page. Everything still points to some confusion over code pages within the forum server. And they would be mad to let me close enough to the server to find out what it was. ![]() Sorry, I am running through a firewall that blocks most things so I can't see that. But thanks anyway. ![]() |
#54
|
|||
|
|||
![]()
Well the web archive would be just using curl to save the web page raw, so the snapshots should be correct.
What kind of firewall blocks youtube? are you in china? |
#57
|
|||
|
|||
![]()
So people keep telling me. I will get around to sorting it out one day.
I was thinking of getting the new Raspberry Pi 4 and using that as a web browser for that sort of thing. Trouble is I have so many half finised projects that it will probably never happen. This is now in danger of going well off topic and turning into a nerdfest so we should probably get back on track. |
#58
|
|||
|
|||
![]()
I sent you a private message quite a while back explaining how to fix this.
|