Sorting by title or author if non-latin characters are prevalent in the catalog

DiscussionsBug Collectors

Rejoignez LibraryThing pour poster.

Sorting by title or author if non-latin characters are prevalent in the catalog

1ivorytower
Modifié : Fév 9, 2009, 12:10 pm

Hi,
When I try to sort all books in my catalog, say, by title or by author - by any non-numerical field - the titles and names and etc are not sorted in the alphabetical order for non-latin characters. They don't appear random, but looks like the engine has its own idea on what is the correct sequence of letters in Russian/Bulgarian/any Cyrillic alphabet. For instance, on LT, it starts with П (actually, it starts with A ;)
I believe it is an old bug but before I had less books and hence I didn't care. Now my library has grown pretty big :) so this bug became rather annoying.
If you want to see this effect in my catalog, sort the items by title and then go to page 7
http://www.librarything.com/catalog/ivorytower

2ivorytower
Fév 25, 2009, 12:08 am

*bump*

I see this problem has been reported previously as well - Cyrillic characters appear in the middle of English alphabet when the titles or authors are alphabetized.
Please help!

3ivorytower
Mai 5, 2009, 2:03 pm

*bump*
Well, while we are talking on the close subject here: http://www.librarything.com/topic/61229 let me bump this as well.

4ivorytower
Mai 13, 2009, 1:49 pm

*bump*
ok, ok, no search option for now, but could you please fix the alphabetical order issue? Please-please-please?

5vs36
Mai 17, 2009, 3:41 pm

Similar (smaller) problem for latin2 but non-English characters (or at lest Czech and Slovak): accented characters are sorted before non-accented ones.

6gcthomas
Fév 25, 2021, 1:17 am

This is still an issue in 2021. When I sort my library by title, books with Cyrillic titles appear between "D" and "E".

7bnielsen
Fév 25, 2021, 4:29 am

There's no perfect solutions since i.e. Swedes and Danes have different sorting orders on o ö æ ø å and how should a Danish library sort Swedish books?
As far as I know data in LibraryThing are stored in the Unicode character set and my guess is that LT just uses the standard sort in the programming language of choice. Probably PHP. So changing it is not easy. But yes, Cyrilic between "D" and "E" is not pretty.

8Nicole_VanK
Modifié : Fév 25, 2021, 4:51 am

Non-Latin script, names and/or titles containing diacritics and/or ligatures. Yes, they can go all over the place. My copy of "À Rebours" sorts before any title starting with any number. One of my favorite authors (Jan Švankmajer) sorts after Z, as soon as I spell his name correctly. Sigh, it's messy.

9bnielsen
Modifié : Fév 26, 2021, 1:13 am

>8 Nicole_VanK: Yes. I've worked with several library systems. None of them handles this very good, imho. It also tends to confuse users when the system is too clever and sorts "The Bogen" (Danish for The Book of Tea) like "Bogen" and similar. So I prefer some predictably stupid way of sorting to some librarian-devised unpredictably clever way. Some of the Danish rules for sorting are also computer-unfriendly like sorting "aa" as "å" if it is pronounced as one sound but "aa" if not. Some of the rules have changed over time, so I wonder if Danish books printed after 1948 should sort differently than books printed before :-)

As long as I can't imagine a feasible system (even if completely coded from the ground up by myself and no others) I don't think LT should spend time on it. I can imagine an unfeasible system, but that includes a "collating sequence" for each user on LT PLUS sorting by that one for each list displayed on the screen. It will also certainly be slower than now, but you could code in a way that might not be hideously slower. But clever code requires more time spend on development. Development time is a limited resource.

I think the best we can hope for is some better standard sort, that doesn't put Cyrilic between "D" and "E". But where would we like Pakistani or Arabic or Vietnamese or ...? I hope we don't have to vote too often on new letters :-)

10gcthomas
Modifié : Fév 25, 2021, 8:16 pm

I think there are a few related issues/features here. From most to least important:

1. Having characters from different alphabets collated together.

2. Having a reasonable sort order within each alphabet. Useful for LT sites in other languages, especially those with non-Latin alphabets.

3. Correctly handling diacritics and language-specific sorting rules. This is more of a nice-to-have.

>7 bnielsen: I'm sure PHP supports some form of Unicode collation, but I think the indexing and collation capabilities of the underlying data store are the determining factor when it comes to paginated queries.

11melannen
Modifié : Fév 25, 2021, 9:38 pm

Yeah, I think LT currently sorts by extended Unicode character number, which isn't ideal but at least it now puts all the Cyrillic together at the end, and all non-Latin alphabets grouped with themselves, a vast improvement on 2009!

The only thing I can think of that might help and be reasonably doable, given all the different possible "correct" sort orders, is if the custom sort order option let you put in whatever you wanted, instead of just picking a character in the title to start with - so you could tell it to sort Švankmajer as Svankmajer or SŠvankmejer, whichever you prefer, while still displaying correctly. That would be useful for a lot of things (you could also, i.e., tell it to sort Anton Chekov as Антон Чехов, to put your translations next to your originals) and a lot of library cataloging systems have that kind of option.

>6 gcthomas: My Cyrillic authors and titles all sort to the end. That's weird.

12bnielsen
Fév 26, 2021, 1:13 am

>11 melannen: Yes. Sorting Švankmajer as Svankmajer or SŠvankmejer would be useful but also confusing. (If I look at your catalogue I'll see books in your order? Or in my order? I hope the bug is fixed, but at some point Å would sort differently than Å etc. Numbers could be a pain too, some people prefer "8 novellas" to sort like "Eight novellas" or "Otte novellas".
Anton Chekov vs Антон Чехов, yeah, I remember searching on the shelves of the local library for books by Evgenij Evtutjenko or Jevgenij Jevtutjenko or the Russian version. And of course one of them was "Large format", so that was another place. Nice theme for a treasure hunt in the library.

(I wonder if this is doable with Collections. I.e. each book in "author-Svankmajer" and "title-Eight_novellas". That would allow sorting by Collections and getting the desired result. Probably also crashing LT if everybody did this.)

>10 gcthomas: Nice point with the LT sites in other languages. I imagine books sorting slightly different on the Swedish version than on the Danish version. That would be fun :-)

13MarthaJeanne
Modifié : Fév 26, 2021, 2:40 am

It's not just between languages. German speakers cannot agree on how to sort the Umlauts. Should Ä sort as A, as AE, or after A? Since the underlying code can vary, a computer sorted list can have them in different places, including at the end of the list.

14melannen
Modifié : Fév 26, 2021, 2:24 pm

>12 bnielsen: It would work the same way as the existing "change sort order" option for titles (which is a blank dropdown next the title on "Edit your book" pages). Right now, the option only lets you choose which character to start with - so, say, for "The Bogen" you could force it to sort under T for The by setting it to character 1, or you had an English book called The Tea Book and you wanted to make sure it sorted under Tea, you would set it to start the sort at character 1, whichever you prefer. It still looks the same everywhere it appears, but LT knows to use something other than its standard algorithm to sort the item when it appears on your catalog pages (and only your catalog, so people wouldn't have to fight about it.) Books in your catalog show in your sort order to everyone, books in other people's catalogs show in their order to you too.

I think that was originally introduced to deal with complaints about how articles in different languages were handled in sorting, in fact. But it won't help when the issue is character sets, because Švankmajer doesn't have an S you can pick to sort on.

GR (for example) rather than letting you set a starting point, has a whole separate field for "Sorting title" where you can enter anything you want. So for The Bogen, on GR instead of setting sort to Character 1, you would just copy The Bogen to the sorting field, to make sure it knows to use the whole title, or for The Book of Tea you would leave the actual title as The Book of Tea but set the sorting title as just Book of Tea.

But since it's free entry, you could also do something like setting the sort title of "8 Novellas" to "Eight Novellas". It would still show as "8 Novellas" everywhere the title displays, but it would know to sort it under E (or o, or whatever.)

So by setting the Sort String of Švankmajer to Svankmajer instead, it would look like Švankmajer everywhere in you see the title, but it would sort under S instead of Š. (Stacking SŠvankmajer as the sort string would sort it to the end of the S's.)

It would certainly be possible to use this in a way that was very confusing, but you can use the current sort field to confuse yourself too - in fact I discovered when I checked my catalog sort that Ferðafélaginn was sorting under ð, which turned out to be because I'd set that as the starting digit for sort at some point (probably while testing stuff...) And since on LT it would only affect your own data and not everybody's, even if someone decided to really misuse it, it wouldn't really matter to everybody else, and nobody would have to agree on Å and Æ.

>10 gcthomas: (I still don't know why your Cyrillic is sorting like that, nobody else's is. Sometimes LT just decides to be weird about Unicode, and if you edit and resave the entry it fixes it? Or it might be a real bug that needs a bug report.)

15AnnieMod
Modifié : Fév 26, 2021, 2:33 pm

>14 melannen: When sorting by title, mine does the same - look at my "Read in 2020" collection, sort by title and the Cyrillic titles show up between D and E. At least now they seem to sort properly in there even if they are in a weird place - so I am not that bothered anymore.

Sorting by Author sends them at the bottom though. :)

16melannen
Fév 26, 2021, 5:39 pm

>15 AnnieMod: Weird! I checked several random libraries that had Cyrillic titles and they all sorted to the end. That must be some kind of weird bug with only certain libraries, but darned if I can deduce why it would happen.

17AnnieMod
Modifié : Fév 26, 2021, 6:20 pm

>16 melannen: It has to do with when the books were added - LT had changed its handling of non-Latin characters quite a lot and that had changed a few things.

The newer titles (added in 2019 and 2020) end up between D and E. Older ones (early 2018 and earlier) sort at the end. I am not sure when the change happened but records from 2018-01-01 sort at the end; any I have from 2019 and 2020 is between D and E.

Which also means that
https://www.librarything.com/catalog/AnnieMod/allcollections&language=bul (and https://www.librarything.com/catalog/AnnieMod&language=rus ) sort weirdly - first are the ones that go between D and E and then are the older records. You cannot see it on the Russian list but it is obvious on the Bulgarian one. :)

I may actually open a new bug based on that - this should give them something to explore and see if they can figure it out.

18melannen
Fév 26, 2021, 6:33 pm

Ah, that makes sense, I guess. And is super annoying - at the very least Cyrillic should all sort together!! Does it happen with other non-Latin charsets too?

And I still can't come up with any logic as to why between D and E...

19AnnieMod
Modifié : Fév 26, 2021, 6:50 pm

>18 melannen: No clue about other charsets - I cannot read any other scripts (well... I can transliterate Greek and Japanese Kana but I do not speak or read a language using either so no books in it).

Tell me about it :) I rarely sort by title so had not even seen that they had diverged - I went digging to try to figure out what is going on.

About the D-E

D is U+0044
E is U+0045
The Cyrillic set is U+0400 and starting there for the Basic set. Something in the sorting sorts them in between 44 and 45... almost as if it treats them as 44000 and 45000 and the Cyrillic goes into 44400 and so n... That is the only thing that makes SOME sense...

I wonder where the supplement set the U+0500 will sort - and maybe that will go between U and V :) Or who knows. Neither Bulgarian nor Russian uses the supplement. I may try just for giggles later.

PS: https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode for the sets.

20melannen
Modifié : Fév 26, 2021, 9:02 pm

>19 AnnieMod: Oh, that does make some sense, if something is stripping zeros when it isn't supposed to maybe

21AnnieMod
Fév 26, 2021, 9:46 pm

>20 melannen: I may be very very wrong and this may be just a coincidence but... this 4xx range clustered between 44 and 45 will be a huge coincidence indeed.

22bnielsen
Fév 27, 2021, 2:52 am

>14 melannen: Thanks. The Aleph version I've worked with in the past also had this feature. Since the librarians were the only one editing this, it was sort of consistent :-) I'm wondering what the Cyrillic titles from >19 AnnieMod: look like if you view them in the export file. I'll go experiment a bit.

23bnielsen
Modifié : Fév 27, 2021, 3:19 am

>22 bnielsen:: Taking notes :-)

https://www.librarything.com/catalog.php?tag=Russian&view=bnielsen in style D and sorting by title gives 33 books. In LT they sort as described above, i.e. with Cyrillic titles sorting between D and E.

Exporting them as TSV and turning them into a poor man's database, extracting the titles and sorting them using the standard sort on Linux and putting a line number on the output gives this:

$ cat /tmp/lt.rdb | perl /tmp/row Tag_List mat '/;Russian;/' | perl /tmp/column Title | perl /tmp/headchg --delete | sort | nl -w1 -nln
1 4 russiske tekster
2 4 russiske tekster, gloser
3 Dansk-russisk Lommeordbog
4 Dansk-russisk ordbog
5 De russiske bevægelsesverber
6 Kammerater : båndmanuskript
7 Learning Russian, 1
8 Learning Russian, 2
9 Learning Russian, 3
10 Learning Russian, 4
11 Lette overgangstekster. Moderne russisk
12 Lærebog i Russisk for Begyndere : Grammatik, Læsebog, Glossar
13 Practical Russian
14 Russisk
15 Russisk-dansk ordbog
16 Russiske sprogøvelser
17 Russisk for alle
18 Russisk Grammatik
19 Russisk grundgrammatik
20 Russisk, I
21 Russisk parlør
22 Russisk på en anden måde = Русский язык по-другому
23 Sovjetisk satire
24 Trojka 1, Tekstbog
25 Trojka 1, Øvebog
26 Trojka 2, Tekstbog
27 Trojka 2, Øvebog
28 Карманный датско-русский и русско-датский словарь
29 Русско-англо-немецко-французский математический словарь
30 Спасибо за внимание
31 Спасибо за внимание
32 Толковый математический словарь
33 Учебник Руссково языка

which looks as what we want. So where does the weird sorting in the Catalogue view come from?

24bnielsen
Modifié : Fév 27, 2021, 9:01 am

More facts: All my cyrillic titles sort exactly between the last D* and the first E* title.

Idea? Look at byte sequences for various encodings ot unicode. utf-8 / utf-16 encodings? Nope.

It looks more like some "clever" coding in LT gone bad, IMHO.

I looked at sorting in php, but didn't find anything useful to this problem here.
https://wojnowski.net.pl/main/index/sorting-utf-8-strings-in-php

25AnnieMod
Fév 27, 2021, 1:14 pm

>24 bnielsen: See >19 AnnieMod:
Someone tried to fix something and messed up with the codes. It is the Unicode codes.

26bnielsen
Fév 27, 2021, 5:48 pm

>25 AnnieMod: Yeah, that's my conclusion too. Some buggy program code inside LT that we cannot see, but only guess at.

27Antheii
Modifié : Mar 3, 2021, 1:04 pm

>17 AnnieMod: You may be right, but .... I only started adding my books to LT last month, and maybe with one or two exceptions, all were newly created by me; some are sorted between D and E, and some at the end (after the Chinese ones).
Bit annoying ;-)

28AnnieMod
Mar 3, 2021, 11:34 pm

>27 Antheii: What sources do you usually use?

29Antheii
Mar 4, 2021, 3:59 am

>28 AnnieMod: Most are manual entry, as in copy/paste from other sites (both at the D-E sorting, as at the end), one from Russian State Library (at the end of list), and one found at the Library of Congress (between D-E)

30AnnieMod
Mar 4, 2021, 4:53 pm

>29 Antheii: Which points out to possibly treating different encodings differently - and thus splitting them (and not just based on when they were added). Thanks for checking!

For the record: Mine are manual, typing the titles for the most part although it is possible that I may have copied a title from somewhere as well on some of them. :)

31AnnieMod
Modifié : Mar 4, 2021, 11:09 pm

>29 Antheii: >30 AnnieMod: It is not the encoding, it is the sort character.

Check the ones that sort at the end for you: Their sort character (if you edit the book, this is the drop down next to the title) is set to 2. All of mine that were at the end had 2 instead of 1 (which I know I had not set so need to figure out why but it may be something with the way the site sets it for weird fonts? I will try to chase that down because this looks like a separate bug). Once updated, they all sorted between D and E.

Change it back to 2 and off they fly at the end of the list - and it is not because it is a small letter now - I tried, with sort character 1 and a title starting with a small letter, they stay between D and E. There is something weird with sorting characters and the non-Latin sets (or at least the Cyrillic one). An invisible character or counting Cyrillic characters as 2 each? Set it to 3 and it sorts based on the second letter (First all caps, then all smalls - if you change the second letter if a title to a capital letter, it sorts it with the capitals).

So
Софийско, sorting character 2 -> end of list
Софийско, sorting character 1 -> Between D and E, based on С
софийско, sorting character 1 -> Between D and E, based on с
Софийско, sorting character 3 -> Between D and E, based on о
СОфийско, sorting character 3 -> Between D and E, based on О

The order that puts the smalls after all capitals is because this is the Unicode codes order as well.

32AnnieMod
Mar 4, 2021, 11:01 pm

Bug for the sort character being set to 2 created: https://www.librarything.com/topic/330337 (easily reproducible).

I will add another one for the D/E position as soon as someone from stuff comments on the sort character (maybe this setting of the sort character was to get the Cyrillic at the bottom... but then there is another bug in the sort character (see the bug I opened) and it is absolutely weird this way...)

33lorax
Mar 5, 2021, 9:47 am

AnnieMod (#31):

You probably know this, but for people who don't:

Both "C" and "O" are characters that have both ASCII and Cyrillic representations in Unicode which look identical but have different encodings.
This was a security concern when non-ASCII support in URLs was first rolled out, that bad actors would snap up domains with the Cyrillic lookalikes for major sites and use them for phishing purposes). See https://en.wikipedia.org/wiki/IDN_homograph_attack#Cyrillic for details and other characters with the same characteristics. So even with a consistent encoding, if a given catalog has some titles using ASCII "C" and some using Cyrillic "C" they could sort differently for no apparent reason.

34AnnieMod
Modifié : Mar 5, 2021, 1:04 pm

>33 lorax: Yeah - not the problem here though.

These are all Cyrillic letters - probably not the best chosen examples because of the fact that they look like Latin ones but that was the work I was playing with and I did not even think of that - I knew they are Cyrillic ones. :) And the fact that they are sorting between D and E and not with the Cs and Os also shows that :)

35melannen
Modifié : Mar 5, 2021, 3:07 pm

>32 AnnieMod: looking again - my pre-2018 Cyrillic titles that sort to the end are also set to sort character 2 - two added manually, one from Moscow Library Network. (There is also a title in Han that sorts between them, also pre-2018, added from Amazon.co.jp.)

Going to go drop this in the new thread too, thanks for starting that.

36AnnieMod
Mar 5, 2021, 3:11 pm

>35 melannen: Thanks for the additional data point!

Yep - the ones flying at the end are because of the sort character - I checked a few more libraries and all are the same - 2 goes at the end; 1 sorts between D and E :) One mystery solved. Thus the bug. And it seems like it may be all/most non-Latin.

If we are supposed to use 2 instead of 1, that needs documenting and consistency and (start) should lead to 2 as well. Or it should stop adding 2 when adding.

37melannen
Mar 5, 2021, 3:15 pm

>36 AnnieMod: Even with the character set to 2 there's still something weird happening; my catalog is putting a katakana/kanji character (depending on what character it's actually sorting on) between two Cyrillic titles, and I can't think of why any deliberate sort order choice would do that.

38AnnieMod
Modifié : Mar 5, 2021, 4:45 pm

>37 melannen: That kinds gives some credibility to my idea of counting 1 character as 2.
If they sort based on the first part of the character only (think of 0x+12 for Cyrillic and 1x+12 for Kana - not real values, just illustration). If we sort only based on the second part, they will get mixed.

I had seen this with double-byte character sets before when the solution forgets that these exist - it sorts based on bytes and not how a character is defined. It almost smells like that.

PS: And it could have been deliberate to try to kick the non-Latin out from D/E but if only one set was tested, they did not realize it then mixes them at the bottom... Or it IS a bug.

39Antheii
Mar 5, 2021, 4:40 pm

>31 AnnieMod: You are correct.... totally missed that (and also not sure how they got in there). But indeed after I changed all sort characters to blank (1), all books in Cyrillic moved to between D and E.

40AnnieMod
Modifié : Mar 5, 2021, 4:44 pm

>39 Antheii: See the bug I opened referenced in >32 AnnieMod: -- the server set the 2 for some reason (but not always and not if you edit back to start) thus making this mess.

And leaving it to 2, leads to >37 melannen: (Japanese Kana, Cyrillic and probably other alphabets mixing up). Fun! :)

41AnnieMod
Mar 5, 2021, 5:11 pm

>37 melannen: If you set it to 1, do the Kana and Cyrillic mix again? A few tests I did just sent any Kana I added between A and B (once I fixed that pesky sort character).

I did a quick check with a couple of Kana characters and they all went between A and B.
The Kana Range is U+30A0..U+30FF
A is U+0041
B is U+0042

D is U+0044
E is U+0045
The Cyrillic set is U+0400

So something is weird. At least U+30A0..U+30FF sorts before U+0400 as expected when you put it to 1 (or so it seems) but how exactly they ended up between A/B and D/E respectively is a different headache...

42Antheii
Modifié : Mar 5, 2021, 5:22 pm

>40 AnnieMod: Just added a book with a Japanese title (copied the title from a Japanese book shop); initially it also got the sorting at 2, and was sorted at the end of the list.
When removed, it was sorted between E and F. (the first character is U+822A)

Also have two book with Chinese titles, those are sorted at the end of the list, even though the sort character is 1.

43AnnieMod
Modifié : Mar 5, 2021, 6:09 pm

>42 Antheii:

U+822A is part of the CJK ranges (U+4E00-U+9FFF) - Japanese Kanji, Chinese, Korean and Vietnamese (when they used them...) are all in here. But not Kana - Kana is earlier in U+30A0..U+30FF. So Kana should sort before Cyrillic and these should go after that (Latin characters go in between based on some principle... - Kana seems to go between A and B, Cyrillic between D and E and these go lower (clustered between E and F or maybe split between D/E, E/F and maybe F/G?)

Now... Why the Chinese are going to the bottom is an interesting question. Do you know the first character of the books you have in Unicode so we can see which range they fall into? Maybe they both use the extension sets and not the main CJK set?

I have a headache now...:)

44melannen
Mar 5, 2021, 8:11 pm

>41 AnnieMod: I was thinking maybe I should leave them be in case an example that hasn't been touched since pre-2018 was needed? But maybe the date issue is just a red herring.

45AnnieMod
Mar 5, 2021, 8:18 pm

>44 melannen: Who knows - if you want to leave them that's fine (you can always add a second book with the same title and fix it so you can see where it goes and then delete it) :)

There had been multiple changes in the handling of non-Latin characters (I almost had to go through circles to get this 2009 book added - and I had more books before I wiped them clean when I moved) and the sorting character was not initially here at all - so it is unclear what changes were done (and how many and when)... There is a time component but I do not think that there is anything in this besides at what point libraries started adding with 1 and not 2... :)

46bnielsen
Modifié : Mar 6, 2021, 3:37 am

>45 AnnieMod: Thanks for the extended forensic analysis :-)

Note to self:
$ echo -n "Соф" | od -tx1
0000000 d0 a1 d0 be d1 84
$ echo -n "ÆØÅ" | od -tx1
0000000 c3 86 c3 98 c3 85
$ echo -n "BCD" | od -tx1
0000000 42 43 44

So Соф is encoded i UTF-8 as (hex bytes) d0 a1 d0 be d1 84, ÆØÅ as c3 86 c3 98 c3 85 and BCD as 42 43 44.
"Sort character" seems to be "byte count" but yes, still not idea about what goes on inside LT sorting.
The "programming" word for stuff like this is "collating sequence". But that didn't help me either :-)

We don't really know how LT represents strings internally. UTF-8 strings, UTF-16 strings, something else?
ETA: The Subjects field in the TSV export sometimes contain stuff that's not Unicode, so maybe it's just stored as a byte sequence.

47AnnieMod
Mar 6, 2021, 3:48 am

>46 bnielsen: Hold on.

This d0 and d1 as the first of the two bytes may be explaining exactly why the Cyrillic is showing between d and e.

Are Æ or Ø or Å sorting between c and d with sort character 1?

Does not explain what shenanigans happen on the second character but that may be the logic I had been trying to find on the first.

48Antheii
Modifié : Mar 6, 2021, 4:25 am

>43 AnnieMod: Sure:
Book 1, first characters; U+5929 and U+9B42
Book 2, first characters; U+4E2D and U+56FD

And as a new snippet... a new entry for a book in Polish, mostly standard West European characters, but the first happened to be with a diacritic Ś (U+015A); again the starting character was set at 2, after which is sorted at the end (even after the Chinese entries), even though the second character was just a standard 'm' (U+006D).
After resetting the sorting character, it now sorts after the Z, but before the Chinese entries - which I assume might be correct (although I would personally favor it between the S and the T ;-) )

(as a side note; this entry was imported from the National Library of Poland)

49AnnieMod
Mar 6, 2021, 2:00 pm

>48 Antheii: These (the Chinese ones) are in the CJK ranges so it seems like Unicode is not what the sort is based on - or not entirely.

Try to use "3" if you want that Polish title to sort on its second character - 2 is basically splitting the first character and using just the high bytes.

The diacritics are tricky - I would love to see them where they belong as well but... it gets complicated across languages. The fact that the Polish book also got added with "2" may be worth adding to the thread referenced in >32 AnnieMod: - it looks like all the double byte characters get the treatment. What happens if you try to add a Polish book with a diacritic somewhere in it but starting with a clean Latin letter? Is it 1 or 2?

>47 AnnieMod: Now that I am awaken, I think I am wrong again - d0 would have sorted before da (unless there is funny coding)... So ignore this... for now.

50Antheii
Mar 6, 2021, 4:03 pm

>49 AnnieMod: I understand sorting the Ś (and such) is tricky, it is just a personal preference. It was merely an observation the start character was automatically set at 2 for those characters too. But I'll add that to the other thread too.
And for your other question: when a title with some (Polish) diacritic is entered, but not at the first position, including ą ę ł ś ź, they sort just fine it seems.

51bnielsen
Mar 7, 2021, 9:19 am

>47 AnnieMod: "This d0 and d1 as the first of the two bytes may be explaining exactly why the Cyrillic is showing between d and e."

I hope not, since d and e would become 44 and 45 if treated the same way as the Cyrillic characters. So I think it's a red herring :-)

>50 Antheii: Yes "sort character" seems to be "sort byte" which is really weird. (But allowing experiments like >49 AnnieMod: mentions with "sort byte" set to 2 on a title like Tффф. My guess is that it sorts between D and E like the rest.)

52Antheii
Avr 21, 2021, 4:40 am

Hi,

Is this issue on somebodies to-do list?
Looks like because it is so old, no ticket/bug was automatically created for it - so was wondering the dev-team is even aware (but maybe just having it categorized as low priority ;-) )

53AnnieMod
Avr 21, 2021, 4:55 pm

Probably not... but https://www.librarything.com/topic/330337 is in the proper system (And it seems to be the root cause of the sorting issue with the non-Latin letters).

Of course, it is also just sitting there. :) Maybe in another decade? :)