Highlight search result in Arabic text

General TRichView support forum. Please post your questions here
saeid2016
Posts: 72
Joined: Wed Mar 16, 2016 11:56 am

Highlight search result in Arabic text

Post by saeid2016 »

Hello,
I use MarkSubStringW function to highlight search result in RichViewEdit. but there is a problem in Arabic texts which have diacritics.

I want to search Arabic text that contains diacritics and highlight the matched words but I don't want the user to type the words with diacritics, because most people don't use them when typing.

But what I want to happen is that if the text is " الَلَهُم صَلِ عَلى مُحَمّد و آل مُحمد " and user searches for " محمد " it should highlight "مُحَمّد " and " مُحمد " regardless of all the diacritics.
Speed of search and highlight very important for me.

please help.
Sergey Tkachenko
Site Admin
Posts: 17557
Joined: Sat Aug 27, 2005 10:28 am
Contact:

Post by Sergey Tkachenko »

Sorry, I do not understand Arabic, so I am afraid I cannot modify this function for you.

However, if you look in MarkSearch.pas, you will see that it has
function StrPosW(s, SubStr: PRVUnicodeChar): PRVUnicodeChar;

This function is used to find SubStr in s.
If you can modify it so that it handles diacritics, I can create a modified version of MarkSubStringW.

Since the found substring may have a different length than SubStr, this function must return not only the found position (SubStrPos), but also length of the text to highlight (Len).
procedure StrPosWEx(s, SubStr: PRVUnicodeChar; var SubStrPos: PRVUnicodeChar; Len: Integer);
saeid2016
Posts: 72
Joined: Wed Mar 16, 2016 11:56 am

Post by saeid2016 »

Thank you,
I found this code in JavaScript:

Code: Select all

function stripDiacritics(rawString) {
   return rawString.replace(/ُ|ِ|َ|ٍ|ً|ّ/g,"");        
}

function highlight() {
    var textblock = document.getElementById('textSection').innerHTML;
    var words = textblock.split(' ');
    var resultText = '';
    
    for (var i=0; i < words.length; i++) {
        var strippedWord = stripDiacritics(words[i]);
        if (strippedWord == 'محمد') {
            resultText += ' <span class="highlighted">'+words[i]+'</span>';
        }            
        else {
            resultText += ' '+words[i];
        }
    }
        
    document.getElementById('textSection').innerHTML = resultText;
    
}
if RichViewEdit has a function to get words of text in array for example, and has another function to highlight a word with index, it is possible to convert the java code to Delphi.
Sergey Tkachenko
Site Admin
Posts: 17557
Joined: Sat Aug 27, 2005 10:28 am
Contact:

Post by Sergey Tkachenko »

Well, this solution is far from optimal:
- it requires disassembling strings to words
- works only for whole words
- recreates the original string, even if nothing was found.

However, I can see how it could be implemented. We can create a function removing all diacritics from string, and making a map of character positions from the new string to the original string. Then we can search for a substring in the processed string, and, if found, use the map to find the position of highlighting in the original document.

It's not very difficult, but is not a trivial as well. We are currently busy on releasing Report Workshop components, so I cannot do it right now.
But I'll try to implement it later in this week.
I think it's an interesting option to add to the regular SearchText method of TRichView as well.
saeid2016
Posts: 72
Joined: Wed Mar 16, 2016 11:56 am

Post by saeid2016 »

Well, Thank you Sergey,

There is another way in my mind. that way is difining and using "?" & "*" or other characters in searching and highlighting, like searching in Windows.
For example we can define "*" character instead of diacritics in MarkSubStringW function, then when user typed any words to search we add this character after each characters of user word and send new string to the function. like this:

Code: Select all

 function CreateNewText (OriginalText : string) : string;
 var
    i : Integer;
    s : string;
 begin
   s := OriginalText[1];

   for i := 2 to Length(OriginalText) do
     s := s + '*' + OriginalText[i];

   Result := s;
 end;
procedure TForm1.btn1Click(Sender: TObject);
begin
  MarkSubStringW(CreateNewText(Edit1.Text), clRed, clYellow, False, False, RichViewEdit1)
end;
Is it a good way?
Sergey Tkachenko
Site Admin
Posts: 17557
Joined: Sat Aug 27, 2005 10:28 am
Contact:

Post by Sergey Tkachenko »

Searching by templates (or even regular expressions) would be an interesting feature, but it's not easy to implement, and it would be an overkill for a simple task of ignoring diacritics.

Also, if '*' means any number of characters, it may match not only diacritic characters, but any, may be very long, text.
saeid2016
Posts: 72
Joined: Wed Mar 16, 2016 11:56 am

Post by saeid2016 »

Sergey Tkachenko wrote:Well, this solution is far from optimal:
- it requires disassembling strings to words
- works only for whole words
- recreates the original string, even if nothing was found.

However, I can see how it could be implemented. We can create a function removing all diacritics from string, and making a map of character positions from the new string to the original string. Then we can search for a substring in the processed string, and, if found, use the map to find the position of highlighting in the original document.

It's not very difficult, but is not a trivial as well. We are currently busy on releasing Report Workshop components, so I cannot do it right now.
But I'll try to implement it later in this week.
I think it's an interesting option to add to the regular SearchText method of TRichView as well.
Hello Sergey, Did you implement it?
saeid2016
Posts: 72
Joined: Wed Mar 16, 2016 11:56 am

Post by saeid2016 »

Sergey Tkachenko wrote:Sorry, I do not understand Arabic, so I am afraid I cannot modify this function for you.

However, if you look in MarkSearch.pas, you will see that it has
function StrPosW(s, SubStr: PRVUnicodeChar): PRVUnicodeChar;

This function is used to find SubStr in s.
If you can modify it so that it handles diacritics, I can create a modified version of MarkSubStringW.

Since the found substring may have a different length than SubStr, this function must return not only the found position (SubStrPos), but also length of the text to highlight (Len).
procedure StrPosWEx(s, SubStr: PRVUnicodeChar; var SubStrPos: PRVUnicodeChar; Len: Integer);
Hi Sergey,
The Arabic diacritics are:
#1611
#1612
#1613
#1614
#1615
#1616
#1617
#1618

In Arabic Text, The diacritics are always placed after letters(not before), so the StrPosWEx function must first remove this diacritics from S then Return the address of the first occurence of SubStr in S and length of the text to highlight.
Sergey Tkachenko
Site Admin
Posts: 17557
Joined: Sat Aug 27, 2005 10:28 am
Contact:

Post by Sergey Tkachenko »

Sorry, I did not complete this work yet.
I want to write an universal search function for ignoring diacritics, not only for Arabic.
Actually, in any language, diacritic characters follow the base character.
There is one problem, though. The same text can be written differently - in decomposed form (where base characters and diacritic characters are separate) and in precomposed form (where they are represented as a single character).
saeid2016
Posts: 72
Joined: Wed Mar 16, 2016 11:56 am

Post by saeid2016 »

Sergey Tkachenko wrote:There is one problem, though. The same text can be written differently - in decomposed form (where base characters and diacritic characters are separate) and in precomposed form (where they are represented as a single character).
Yes this is a problem but solving it not very difficult, I am impatiently waiting for your help.
Sergey Tkachenko
Site Admin
Posts: 17557
Joined: Sat Aug 27, 2005 10:28 am
Contact:

Post by Sergey Tkachenko »

Well, this problem may be not very difficult, but still requires attention and accuracy. Windows includes NormalizeString function that does this work, but this function does not allow to get a correspondence between characters of an original and a resulting strings.

Finally, I created a modification of MarkSubStringW with additional Options parameter. It may include:
rvmsoNormalize - normalizes strings before searching
rvmsoStripDiacritic - normalizes strings and removes diacritics before searching.

Download the modified unit and the demo:
http://www.trichview.com/support/files/ ... search.zip
(for Arabic test, assign BiDiMode property in the demo)
Sergey Tkachenko
Site Admin
Posts: 17557
Joined: Sat Aug 27, 2005 10:28 am
Contact:

Post by Sergey Tkachenko »

Examples:

Search with ignored diacritic characters, your text:
Image

Search with ignored diacritic characters, Western text
Image

The Unicode normalization introduces some interesting side effects.

For example, different forms of the same character can be found:
Image

Or you can search "inside" composite characters:
Image

Please check if everything is ok. If yes, I'll include this change in RichViewActions, and add the same feature to TRichView.SearchText
Sergey Tkachenko
Site Admin
Posts: 17557
Joined: Sat Aug 27, 2005 10:28 am
Contact:

Post by Sergey Tkachenko »

I uploaded a fixed version.

In addition to an important fix in the functions themselves, it has the following changes:

1) compatibility with XE5 and older is restored
2) Boolean parameters (IgnoreCase and WholeWords) are removed, the corresponding options are added in the Options parameter instead
3) WinAPI NormalizeString function is now linked dynamically, rvmsoNormalize and rvmsoStripDiacritic options are processed only if this function is available.
It is available starting from Windows Vista. For XP, it requires a separate download https://msdn.microsoft.com/en-us/librar ... s.85).aspx
saeid2016
Posts: 72
Joined: Wed Mar 16, 2016 11:56 am

Post by saeid2016 »

Sergey Tkachenko wrote:I uploaded a fixed version.

In addition to an important fix in the functions themselves, it has the following changes:

1) compatibility with XE5 and older is restored
2) Boolean parameters (IgnoreCase and WholeWords) are removed, the corresponding options are added in the Options parameter instead
3) WinAPI NormalizeString function is now linked dynamically, rvmsoNormalize and rvmsoStripDiacritic options are processed only if this function is available.
It is available starting from Windows Vista. For XP, it requires a separate download https://msdn.microsoft.com/en-us/librar ... s.85).aspx
Thank you very very much Sergey, I will check it and get you feedback.

Two question:

1. if we set options to rvmsoNormalize or rvmsoStripDiacritic, is it possible to define new rule to normalization of string?
for example in Arabic text we have this two characters: ه and ة which are not same characters but if users want to search for a word which it's last character is ة like فاطمة, sometimes they type ه instead of ة like this: فاطمه. is it possible to highlight فاطمة when user types فاطمه ?

2. I want to get the Position of marked words by MarkSubStringW as a list to move caret position to them by clicking on NextButton and PrevButton. I should modify MarkSubStringW to get this list or RichView has a function to move caret position to highlighted word position?
Sergey Tkachenko
Site Admin
Posts: 17557
Joined: Sat Aug 27, 2005 10:28 am
Contact:

Post by Sergey Tkachenko »

This code uses WinAPI NormalizeString to get NFKD (compatibility decomposition normalization form) of text, when these normalized texts are compared.
While this type of normalization treats different forms of a character as the same, for example (①= 1), it still treats ه and ة as different characters.
So you need to patch the function manually.

Find

Code: Select all

Len2 := RVNormalizeString(NormalizationKD, Ptr, Len,
      PtrDest, Length(Result));
Add after:

Code: Select all

if Ptr^ = 'ة' then
  PtrDest^ := 'ه';
Find:

Code: Select all

Len2 := RVNormalizeString(NormalizationKD, Ptr, Len,
      PtrDest, Length(Result) div 2);
Add after:

Code: Select all

if Ptr^ = 'ة' then
  PtrDest^ := 'ه';
Post Reply