[Israel.pm] Displaying bidi text in re (e.g. in the editor).

Amit Aronovitch aronovitch at gmail.com
Sun Feb 1 04:36:34 PST 2009


On 02/01/2009 09:13 AM, Gaal Yahas wrote:
> #2 is very near what you get today by looking at source code in an
> environment that attempts to run the Unicode BiDi algorithm on your
> source, for example a browser. In my experience, this can be very
> confusing.
>
> If I'm understanding your proposal correctly, in #2 you still want to
> fix the crazy cases of flipped capture parentheses etc., but that
> implies your editor must understand regular expressions, which at best
> it will do when something is a regexp literal but which will fail when
> the programmer is constructing a regexp piecemeal from strings.
>   
Yes, thats why I said "ideal editor" - assume for the purpose of 
discussion that it can magically distinguish strings literals which are 
used for RE from general purpose strings (which are treated as tokens 
and have the general bidi algorithm applied to them). I imagine a 
practical solution would allow the user to manually adjust "roles" for 
literal strings (however, I do not know where it would save this info 
for future edits...).
If you construct your regexp from pieces, I expect each part to be 
meaningful as re on its own, so have the same rules.

> Indeed, your solution doesn't attempt to handle simple strings, so the
> user can get confused by inconsistency issues: why $foo = "SHALOM" but
> $foo = qr/MOLAHS/?
Actually, the suggestion for source code (programming languages 
dependant - again we assume full syntax awareness) specifically defines 
string literals and comments as tokens, so it would be $foo = "MOLAHS" 
as well.
>  Also, there's the tokenization problem you already
> mention: NATBA"G would come out very bad from this transformation.
>   
In #1 certainly it would not be much different than NATBAG. In #2, since 
" is not a special character in RE, it will be included in the token and 
have the standard Bidi algorithm applied to it - so you'd get G"ABTAN , 
just as you'd probably expect.
> So if these are the two options, I'd vote for #1. (I'd also recommend
> in source code representing invisible override characters as entities,
> so that if the actual data has RLM/LRM/etc. marks, where possible it's
> better not to have those as literals in the source code. That's a
> recommendation for the end programmer, not the editor, I think.)
>   
Another one would probably be to use i18n and have all user-destinated 
strings in po files, which as a nice byproduct would also reduce 
potential problems with string literals...
> Perhaps one day we can have an alternate surface syntax for strings
> and regular expressions that is designed to be RTL friendly. Hebrew or
> Arabic metacharacters, introduced with slashes instead of backslashes.
> This would be used in conjunction with an editor hint that makes the
> whole expression RTL (in Unicode terms, sets the paragraph
> directionality). If used on whole lines of code, it can also change
> the alignment to be right-justified. Perl 6 has many quotelike
> operators, with room for extensions like this. The problem with these
> things is that they're hard to get right, and pesky problems will
> still appear no matter how hard you try if the data really does
> contain mixed directionality characters. We can ask Larry for his
> opinion, though.
>   
Interesting idea. We tend to think in terms of adjusting bidi display 
algorithms to cope with existing syntax. Adjusting syntax to cope with 
bidi issues would probably be less quirky and easier for editor devs to 
implement.

Thanks for this input,
Amit
> On Sun, Feb 1, 2009 at 5:08 AM, Amit Aronovitch <aronovitch at gmail.com> wrote:
>   
>> Hi,
>>
>> Following a discussion I took part in about standartization of the
>> display of Hebrew text in structured expressions and source code, I
>> would be happy to hear some opinions about how we would like regular
>> expressions containing bidi chars to be displayed (in an "ideal
>> editor" that is fully syntax aware).
>>
>> In the examples below, caps represent RTL characters and lowercase LTR chars.
>>
>> The basic principle that was proposed (for structured expressions) is
>> that text should be split into "separators" and "tokens" according to
>> the relevant syntax, the general-purpose Bidi rules be applied within
>> each token only, and then tokens and separators should be concatenated
>> left to right always.
>>
>> Applied to regular expressions, I thought that since in RE each
>> pattern character is an atom (1), then this effectively means to force
>> LTR everywhere (except maybe stuff like named captures
>> (?<NAME>...) etc.).
>> However, it was suggested instead that any sequence of pattern
>> characters (not containing "special" characters) should be treated as
>> a token (2).
>> This would make simple searches easier to read,
>>
>> e.g. /SHALOM/ would be displayed /SHALOM/ by (1),
>> but /MOLAHS/ (much more readable if it was actual Hebrew) by (2).
>> On the other hand, /YADAII?M/ in (2) would show as /IIADAY?M/ ,
>>  which is very confusing, so I thought the simplification was not worth it.
>>
>> However, I am used to languages where simple searches are commonly
>> done by other means, whereas in Perl using RE for simple text search
>> might be more common because of the specialized syntax. What do you
>> think?
>>
>>    Amit
>>     



More information about the Perl mailing list