SAUK Discussion Board

Go Back   SAUK Discussion Board > Board Management > The Garage
Join! Blogs FAQ Calendar Search Today's Posts Mark Forums Read

Notices

Reply  Post New Thread
 
Thread Tools
  #31  
Old 30th July 2019, 20:42
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

Hi Schmosby,

I don't think it matters which operating system is being used.

All that matters is if you post Unicode or not.

The problem looks to be the board's Unicode support is broken. All I am doing is putting Unicode characters into the edit box.

Tests show that most Unicode characters do not render.

The few that do are trashed by editing the post.

I am using firefox and manually constructing the codes I want by copy and paste.
Reply With Quote
  #32  
Old 31st July 2019, 11:31
Indigo_ Indigo_ is offline
Co-Administrator
 
Join Date: May 2013
Location: Cheshire
Posts: 19,452
Default Re: ***8217;

I have contacted our administrator about this issue to see if he can do anything
Reply With Quote
  #33  
Old 31st July 2019, 12:41
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

I think there may be two issues.

The root cause may be that the Unicode support is broken by the code stuffing not being handled correctly.

To work around that a kludge may have been written to try to get a few punctuation marks to work. This works for virgin posts but breaks when a submitted post is then edited.

I don't think either of these are things that the admin here can address directly.

However, I have a cunning plan...

@Indigo_: Does the board have a word filter that allows you to replace one word with another?

If so maybe it could be set to replace the text ***8217; with '

If that works then the other codes could be caught similarly.

To cover more of the field, it may pay to try replacing the Unicode punctuation with an appropriate ASCII character. This may catch the Unicode at the earliest opportunity.

This last bit may not work if Unicode characters are not supported in the filter code.

How helpful this is will depend on where in the chain the filter is applied. I think it is worth a go in case it does make the posts easier to read.

Some downsides are that it won't be possible to refer to the offending string directly as it will be filtered, and when the root bug gets fixed there may be some side effects with old posts.
Reply With Quote
  #34  
Old 31st July 2019, 14:12
Schmosby Schmosby is online now
Member
 
Join Date: Jan 2012
Location: London
Posts: 3,907
Blog Entries: 1

Mood
Relaxed

Default Re: ***8217;

I already thought of word filter. It doesn't.

I also thought to use javascript, but no way for the co-admins to add javascript either.

The database charset is latin1_swedish_ci (ISO-8859-1), so it just doesn't support many characters. The others could be converted to html entities maybe. Other option would be to replace the common ones and strip the rest.

UPDATE: Interestingly, I don't have this issue on my install, curly quotes get replaced with their html entities before being added to the database, so display correctly. I wounder if maybe during the server move the database was created with the wrong charset or a different one to the specified one in the admin panel.

OK I'm 99% sure I've found the issue, &# seems to be in the censored words list.

Testing if &# is a swear word...***.
Reply With Quote
  #35  
Old 31st July 2019, 17:53
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

Hi Schmosby,

Do you mean it doesn't have a word filter or that the word filter doesn't work?

I don't think it should matter too much about how the database holds characters as long as it supports the characters used to represent the Unicode character, most charsets have *; and the digits. My guess was that it was held as, for example, ***8217; and that anything looking like this in the user text was stuffed to ****8217; Or something like that anyway. I would use fewer * though. Not sure what is used as the escape character.

At least some Unicode characters make it correctly into the database and come back as themselves so there has to be at least two layers to the issue otherwise it would be an all or nothing effect.

If you have a live copy try putting in some Unicode and see how it is held in the database.
Reply With Quote
  #36  
Old 31st July 2019, 18:29
Schmosby Schmosby is online now
Member
 
Join Date: Jan 2012
Location: London
Posts: 3,907
Blog Entries: 1

Mood
Relaxed

Default Re: ***8217;

The board doesn't have a word replacement feature at all.

Using quick reply, it's saved in the database using html entities, If Go Advanced is used or if you quote someones post (which uses the advanced page) it is saved raw. If I edit a post saved with quick reply and hit the go advanced button before saving, it gets converted from entities to raw.

I'm pretty sure the swear filter is the issue, as if you type &# here you will see it gets replaced with 3 stars, therefore the HTML entity ’ becomes ***8217; instead of showing up as an apostrophe.

That's why you only get the stars if you use quick reply, as they contain the entities which include &#, while the advanced page saves it raw, which contains no entities, so no &#.

***33336;
Reply With Quote
  #37  
Old 31st July 2019, 18:36
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

DOH!!!

I see what you mean. I am being thick.

You are saying that the ***8217; is being converted to ***8217; because the swear filter is replacing *** with *** because *** is a swear word.
Reply With Quote
  #38  
Old 31st July 2019, 18:40
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

I was typing while you posted.

I think you have got it.
Reply With Quote
  #39  
Old 31st July 2019, 18:50
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

However that still doesn't explain everything. There shouldn't be exceptions to this rule so why do some Unicode characters work?

Also does the basement treat Unicode differently if we can swear there?

And why is such an important couple of characters in the swear filter in the first place? Is it to fix some other problem?

I am going into the basement to *** and ***
Reply With Quote
  #40  
Old 31st July 2019, 19:02
Schmosby Schmosby is online now
Member
 
Join Date: Jan 2012
Location: London
Posts: 3,907
Blog Entries: 1

Mood
Relaxed

Default Re: ***8217;

Maybe ‟ is also added to the swear filter separately or is automatically changed by the forum software as being a quote it could be used for code injection attacks. I'm just guessing.

UPDATE: Yes there is a separate list called Blank Character Stripper, I think u201F is added to that on this board. This second list applies to both quick reply and advanced, so it always gets converted like a swear word.

Reply With Quote
  #41  
Old 31st July 2019, 19:15
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

Yes, that was my guess too.

It looks like the basement is no different so it is global.

If a mod is watching please will they kill this thread:http://www.social-anxiety-community....ad.php?t=91396

Sorry about messing the place up.

I think this is above my pay grade so I am going back to lurking.

Well done Schmosby.
Reply With Quote
  #42  
Old 1st August 2019, 11:08
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

My Brain will not leave this alone.

My understanding to date:-

Issue 1: Unicode has been deliberately broken using the swear filter for some reason that I will probably sleep better not knowing. Let us assume it is going to stay that way.

Issue 2: For me, extended punctuation marks are accepted but are then trashed if the post is subsequently edited.

The second issue happens when characters are used that are included in charset=Windows-1252 but not included in charset=ISO-8859-1.

Looking online it does appear that these two charsets are not handled consistently and sometimes considered synonymous.

Could it be these two different charsets (or two different interpretations of the same charset) are active in two different sections of the board’s engine room and active translation between them cause promotion of the extended punctuation marks to Unicode and BAM! The swear filter kicks in?

The solution would seem to be to ensure that only one charset is used throughout and that it contains the extended punctuation marks, i.e. charset=windows-1252.

In August 2018 a page header was:-
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en"><head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta name="generator" content="vBulletin 3.8.7">

Today it is:-
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
<meta name="generator" content="vBulletin 3.8.7" />

This makes me think that there may be two different interpretations of ISO-8859-1 being used at the same time.

Of course I could be completely wrong. It has happened before.
Reply With Quote
  #43  
Old 1st August 2019, 11:18
firemonkey firemonkey is offline
Member
 
Join Date: Apr 2010
Location: Calne,Wiltshire
Posts: 5,381

Mood
Balanced

Default Re: ***8217;

To put it bluntly , whoever designed things so that happens must be as thick as two short planks .
Reply With Quote
  #44  
Old 1st August 2019, 11:29
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

^ No one person is in control and it has taken on a life of its own. It is not that people are thick, it is that it is growing faster than the standards are developing. It is not just plants and meat that evolve.


Reply With Quote
  #45  
Old 1st August 2019, 14:03
Schmosby Schmosby is online now
Member
 
Join Date: Jan 2012
Location: London
Posts: 3,907
Blog Entries: 1

Mood
Relaxed

Default Re: ***8217;

^^^I doubt that this is the issue, as the default charset in the forum software and mysql is Latin1/ISO-8859-1, so I think if someone deliberately chose differently, they wouldn't go for windows-1252, they would go for UTF-8 and I'm sure they would update the charset in the forum software if they did that.

I think someone added it to the swear filter to stop people using Unicode characters in a negative way. My guess is that they were being used like...

Reply With Quote
  #46  
Old 1st August 2019, 17:46
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

The reson I was think what I did was that ISO-8859-1 doesn't contain the extended punctuation and yet they are still stored without being converted to Unicode. It is windows-1252 that does contain them.

If the board strictly used ISO-8859-1 throughout “ ” „ ‘ ’ ‚ – — would never get to the database and yet here they are in the database.

As I said, I could be wrong.
Reply With Quote
  #47  
Old 1st August 2019, 19:19
Schmosby Schmosby is online now
Member
 
Join Date: Jan 2012
Location: London
Posts: 3,907
Blog Entries: 1

Mood
Relaxed

Default Re: ***8217;

Interesting.

I just read that mysql actually uses CP125 for Latin1 instead of Latin1/ISO-8859-1. I guess because Latin1/ISO-8859-1 is a subset of CP125.

Code:
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION
| latin1             | latin1_swedish_ci    | cp1252 West European
I guess data posted through forms is not governed by the page character set, so would get through.

The forum software does correctly convert characters outside CP1252 to entities for storage in the database.

If I try to add non CP1252 characters to the database directly, It just throws an error.

UPDATE: I just tested changing the forum charset to be windows-1252 and that also results in our issue being resolved, as the apostrophes etc. are no longer converted to entities in quick reply and so avoid the swear filter.
Reply With Quote
  #48  
Old 2nd August 2019, 13:18
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

Good work.

... and we know that the board has changed from using windows-1252 to ISO-8859-1 sometime since May this year because it can be seen from the page headers.

Coincidence?

Maybe changing it back to windows-1252 will solve this issue. However it must have been changed for some reason. You don't go changing things like that on a whim; 'if it ain't broke, don't fix it.'

This change isn't the root cause, it just manifests deeper issues. If everything was done correctly the extended punctuation would have never made it to the database in the first place.

Even my browser is a little naughty. It has been told the page is ISO-8859-1 and yet it took an unprintable character and rendered it as punctuation as if the page was windows-1252.

So much for standards.
Reply With Quote
  #49  
Old 2nd August 2019, 14:04
Schmosby Schmosby is online now
Member
 
Join Date: Jan 2012
Location: London
Posts: 3,907
Blog Entries: 1

Mood
Relaxed

Default Re: ***8217;

Well lets not talk about browsers regarding standards

Where are you getting your info about the header being different previously? I checked the web archive and all the snapshots going back until the beginning of time had ISO-8859-1, even the Aug 2018 one. (I didn't check every snapshot obviously, just random ones).
Reply With Quote
  #50  
Old 2nd August 2019, 14:32
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

I took a scrape of the site FAQ a while back and looked at that.

Checking again, I see that it was May that I moved it from one disc to another and I looked at the file date. It was actually scraped in January this year:-

<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en"><head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<meta name="generator" content="vBulletin 3.8.7">

Other pages I have around that time are also windows-1252.

The August 2018 one is windows-1252 as well.

Whatever the problem is I think we have gone as far as we can from the outside. It is now up to 'the management' to either do or not do.

It was fun but I don't think I will be giving up my day job.
Reply With Quote
  #51  
Old 3rd August 2019, 13:28
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

Ah, things are not quite as I thought.

I found it odd that we were seeing different things and so I just scraped this page.

If I look at the source of this live page while looking at it I see ISO-8859-1 and if I look at the source of the scraped page it shows windows-1252.

There is some transcoding going on somewhere and it is not obvious to me where it is happening. It could the server or my browser being (un)helpful.

If I have time I will log the network traffic and then I can see which it is. This is one good side to it being just http.

In the meantime I cannot be sure what is going on so the answer may or may not be a change in code page.

Wrong again.

I wonder how many people on the board are still keeping up with this thread?
Reply With Quote
  #52  
Old 3rd August 2019, 14:06
Schmosby Schmosby is online now
Member
 
Join Date: Jan 2012
Location: London
Posts: 3,907
Blog Entries: 1

Mood
Relaxed

Default Re: ***8217;

That's odd.

Are you using file_get_contents or cURL?

I think this is the most interesting thread that's exsisted on this site.

You will probably enjoy this.
Reply With Quote
  #53  
Old 3rd August 2019, 18:21
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

Nothing so complex , I just use right click; view source, and File; save page as. I could turn on the debugging to see which it used but I was just going to go for raw sniffing of the network and then I would know.

I guess it is the browser doing the conversion because of my PC setup. It doesn't really matter anyway. All it says is that I cannot draw any conclusion about if or when the board changed code page. Everything still points to some confusion over code pages within the forum server.

And they would be mad to let me close enough to the server to find out what it was.

Sorry, I am running through a firewall that blocks most things so I can't see that. But thanks anyway.
Reply With Quote
  #54  
Old 3rd August 2019, 18:34
Schmosby Schmosby is online now
Member
 
Join Date: Jan 2012
Location: London
Posts: 3,907
Blog Entries: 1

Mood
Relaxed

Default Re: ***8217;

Well the web archive would be just using curl to save the web page raw, so the snapshots should be correct.

What kind of firewall blocks youtube? are you in china?
Reply With Quote
  #55  
Old 3rd August 2019, 18:49
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

I will go with your version of events. I don't trust my machine anymore.

No I am not in China, more like the stone age. I have set my own firewall up that way and I don't watch television either.
Reply With Quote
  #56  
Old 3rd August 2019, 18:57
Schmosby Schmosby is online now
Member
 
Join Date: Jan 2012
Location: London
Posts: 3,907
Blog Entries: 1

Mood
Relaxed

Default Re: ***8217;

Ohhh ok, that's good, I don't have a TV either, but there is a lot of good stuff on YouTube amoung the junk. Interesting factual stuff although I guess you can get the same from books.
Reply With Quote
  #57  
Old 3rd August 2019, 19:19
Sisyphus Sisyphus is offline
Member
 
Join Date: Aug 2012
Location: Lost
Posts: 277
Blog Entries: 1
Default Re: ***8217;

So people keep telling me. I will get around to sorting it out one day.

I was thinking of getting the new Raspberry Pi 4 and using that as a web browser for that sort of thing. Trouble is I have so many half finised projects that it will probably never happen.

This is now in danger of going well off topic and turning into a nerdfest so we should probably get back on track.
Reply With Quote
Reply

Thread Tools

Forum Jump


All times are GMT +1. The time now is 19:46.


SAUK Award
Logo designed by abc
Powered by vBulletin
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.