Highlight search result in Arabic text

saeid2016 · Post by **saeid2016** » Sat Mar 19, 2016 8:01 pm

Hello,
I use MarkSubStringW function to highlight search result in RichViewEdit. but there is a problem in Arabic texts which have diacritics.

I want to search Arabic text that contains diacritics and highlight the matched words but I don't want the user to type the words with diacritics, because most people don't use them when typing.

But what I want to happen is that if the text is " الَلَهُم صَلِ عَلى مُحَمّد و آل مُحمد " and user searches for " محمد " it should highlight "مُحَمّد " and " مُحمد " regardless of all the diacritics.
Speed of search and highlight very important for me.

please help.

Post by **Sergey Tkachenko** » Sat Mar 19, 2016 8:38 pm

Sorry, I do not understand Arabic, so I am afraid I cannot modify this function for you.

However, if you look in MarkSearch.pas, you will see that it has
function StrPosW(s, SubStr: PRVUnicodeChar): PRVUnicodeChar;

This function is used to find SubStr in s.
If you can modify it so that it handles diacritics, I can create a modified version of MarkSubStringW.

Since the found substring may have a different length than SubStr, this function must return not only the found position (SubStrPos), but also length of the text to highlight (Len).
procedure StrPosWEx(s, SubStr: PRVUnicodeChar; var SubStrPos: PRVUnicodeChar; Len: Integer);

saeid2016 · Post by **saeid2016** » Sat Mar 19, 2016 9:01 pm

Thank you,
I found this code in JavaScript:

Code: Select all

function stripDiacritics(rawString) {
   return rawString.replace(/ُ|ِ|َ|ٍ|ً|ّ/g,"");        
}

function highlight() {
    var textblock = document.getElementById('textSection').innerHTML;
    var words = textblock.split(' ');
    var resultText = '';
    
    for (var i=0; i < words.length; i++) {
        var strippedWord = stripDiacritics(words[i]);
        if (strippedWord == 'محمد') {
            resultText += ' <span class="highlighted">'+words[i]+'</span>';
        }            
        else {
            resultText += ' '+words[i];
        }
    }
        
    document.getElementById('textSection').innerHTML = resultText;
    
}

if RichViewEdit has a function to get words of text in array for example, and has another function to highlight a word with index, it is possible to convert the java code to Delphi.

Post by **Sergey Tkachenko** » Mon Mar 21, 2016 8:58 am

Well, this solution is far from optimal:
- it requires disassembling strings to words
- works only for whole words
- recreates the original string, even if nothing was found.

However, I can see how it could be implemented. We can create a function removing all diacritics from string, and making a map of character positions from the new string to the original string. Then we can search for a substring in the processed string, and, if found, use the map to find the position of highlighting in the original document.

It's not very difficult, but is not a trivial as well. We are currently busy on releasing Report Workshop components, so I cannot do it right now.
But I'll try to implement it later in this week.
I think it's an interesting option to add to the regular SearchText method of TRichView as well.

saeid2016 · Post by **saeid2016** » Mon Mar 21, 2016 10:36 am

Well, Thank you Sergey,

There is another way in my mind. that way is difining and using "?" & "*" or other characters in searching and highlighting, like searching in Windows.
For example we can define "*" character instead of diacritics in MarkSubStringW function, then when user typed any words to search we add this character after each characters of user word and send new string to the function. like this:

Code: Select all

 function CreateNewText (OriginalText : string) : string;
 var
    i : Integer;
    s : string;
 begin
   s := OriginalText[1];

   for i := 2 to Length(OriginalText) do
     s := s + '*' + OriginalText[i];

   Result := s;
 end;
procedure TForm1.btn1Click(Sender: TObject);
begin
  MarkSubStringW(CreateNewText(Edit1.Text), clRed, clYellow, False, False, RichViewEdit1)
end;

Is it a good way?

Post by **Sergey Tkachenko** » Tue Mar 22, 2016 6:04 pm

Searching by templates (or even regular expressions) would be an interesting feature, but it's not easy to implement, and it would be an overkill for a simple task of ignoring diacritics.

Also, if '*' means any number of characters, it may match not only diacritic characters, but any, may be very long, text.

saeid2016 · Post by **saeid2016** » Sat Mar 26, 2016 10:52 am

Sergey Tkachenko wrote:Well, this solution is far from optimal:
- it requires disassembling strings to words
- works only for whole words
- recreates the original string, even if nothing was found.

However, I can see how it could be implemented. We can create a function removing all diacritics from string, and making a map of character positions from the new string to the original string. Then we can search for a substring in the processed string, and, if found, use the map to find the position of highlighting in the original document.

It's not very difficult, but is not a trivial as well. We are currently busy on releasing Report Workshop components, so I cannot do it right now.
But I'll try to implement it later in this week.
I think it's an interesting option to add to the regular SearchText method of TRichView as well.

Hello Sergey, Did you implement it?

saeid2016 · Post by **saeid2016** » Sun Mar 27, 2016 11:37 am

Sergey Tkachenko wrote:Sorry, I do not understand Arabic, so I am afraid I cannot modify this function for you.

However, if you look in MarkSearch.pas, you will see that it has
function StrPosW(s, SubStr: PRVUnicodeChar): PRVUnicodeChar;

This function is used to find SubStr in s.
If you can modify it so that it handles diacritics, I can create a modified version of MarkSubStringW.

Since the found substring may have a different length than SubStr, this function must return not only the found position (SubStrPos), but also length of the text to highlight (Len).
procedure StrPosWEx(s, SubStr: PRVUnicodeChar; var SubStrPos: PRVUnicodeChar; Len: Integer);

Hi Sergey,
The Arabic diacritics are:
#1611
#1612
#1613
#1614
#1615
#1616
#1617
#1618

In Arabic Text, The diacritics are always placed after letters(not before), so the StrPosWEx function must first remove this diacritics from S then Return the address of the first occurence of SubStr in S and length of the text to highlight.

Post by **Sergey Tkachenko** » Sun Mar 27, 2016 6:51 pm

Sorry, I did not complete this work yet.
I want to write an universal search function for ignoring diacritics, not only for Arabic.
Actually, in any language, diacritic characters follow the base character.
There is one problem, though. The same text can be written differently - in decomposed form (where base characters and diacritic characters are separate) and in precomposed form (where they are represented as a single character).

saeid2016 · Post by **saeid2016** » Mon Mar 28, 2016 7:28 am

Sergey Tkachenko wrote:There is one problem, though. The same text can be written differently - in decomposed form (where base characters and diacritic characters are separate) and in precomposed form (where they are represented as a single character).

Yes this is a problem but solving it not very difficult, I am impatiently waiting for your help.

Post by **Sergey Tkachenko** » Thu Mar 31, 2016 9:09 pm

Well, this problem may be not very difficult, but still requires attention and accuracy. Windows includes NormalizeString function that does this work, but this function does not allow to get a correspondence between characters of an original and a resulting strings.

Finally, I created a modification of MarkSubStringW with additional Options parameter. It may include:
rvmsoNormalize - normalizes strings before searching
rvmsoStripDiacritic - normalizes strings and removes diacritics before searching.

Download the modified unit and the demo:
http://www.trichview.com/support/files/ ... search.zip
(for Arabic test, assign BiDiMode property in the demo)

Post by **Sergey Tkachenko** » Thu Mar 31, 2016 9:14 pm

Examples:

Search with ignored diacritic characters, your text:

Search with ignored diacritic characters, Western text

The Unicode normalization introduces some interesting side effects.

For example, different forms of the same character can be found:

Or you can search "inside" composite characters:

Please check if everything is ok. If yes, I'll include this change in RichViewActions, and add the same feature to TRichView.SearchText

Post by **Sergey Tkachenko** » Fri Apr 01, 2016 12:45 pm

I uploaded a fixed version.

In addition to an important fix in the functions themselves, it has the following changes:

1) compatibility with XE5 and older is restored
2) Boolean parameters (IgnoreCase and WholeWords) are removed, the corresponding options are added in the Options parameter instead
3) WinAPI NormalizeString function is now linked dynamically, rvmsoNormalize and rvmsoStripDiacritic options are processed only if this function is available.
It is available starting from Windows Vista. For XP, it requires a separate download https://msdn.microsoft.com/en-us/librar ... s.85).aspx

saeid2016 · Post by **saeid2016** » Fri Apr 01, 2016 5:11 pm

Sergey Tkachenko wrote:I uploaded a fixed version.

In addition to an important fix in the functions themselves, it has the following changes:

1) compatibility with XE5 and older is restored
2) Boolean parameters (IgnoreCase and WholeWords) are removed, the corresponding options are added in the Options parameter instead
3) WinAPI NormalizeString function is now linked dynamically, rvmsoNormalize and rvmsoStripDiacritic options are processed only if this function is available.
It is available starting from Windows Vista. For XP, it requires a separate download https://msdn.microsoft.com/en-us/librar ... s.85).aspx

Thank you very very much Sergey, I will check it and get you feedback.

Two question:

1. if we set options to rvmsoNormalize or rvmsoStripDiacritic, is it possible to define new rule to normalization of string?
for example in Arabic text we have this two characters: ه and ة which are not same characters but if users want to search for a word which it's last character is ة like فاطمة, sometimes they type ه instead of ة like this: فاطمه. is it possible to highlight فاطمة when user types فاطمه ?

2. I want to get the Position of marked words by MarkSubStringW as a list to move caret position to them by clicking on NextButton and PrevButton. I should modify MarkSubStringW to get this list or RichView has a function to move caret position to highlighted word position?

Post by **Sergey Tkachenko** » Sat Apr 02, 2016 12:08 pm

This code uses WinAPI NormalizeString to get NFKD (compatibility decomposition normalization form) of text, when these normalized texts are compared.
While this type of normalization treats different forms of a character as the same, for example (①= 1), it still treats ه and ة as different characters.
So you need to patch the function manually.

Find

Code: Select all

Len2 := RVNormalizeString(NormalizationKD, Ptr, Len,
      PtrDest, Length(Result));

Add after:

Code: Select all

if Ptr^ = 'ة' then
  PtrDest^ := 'ه';

Find:

Code: Select all

Len2 := RVNormalizeString(NormalizationKD, Ptr, Len,
      PtrDest, Length(Result) div 2);

Add after:

Code: Select all

if Ptr^ = 'ة' then
  PtrDest^ := 'ه';