[Israel.pm] Displaying bidi text in re (e.g. in the editor).

Amit Aronovitch aronovitch at gmail.com
Wed Feb 4 01:44:01 PST 2009


Gaal Yahas wrote:
> On Sun, Feb 1, 2009 at 2:36 PM, Amit Aronovitch <aronovitch at gmail.com> wrote:
>   
>> If you construct your regexp from pieces, I expect each part to be
>> meaningful as re on its own, so have the same rules.
>>     
>
> No, you can't make this assumption. Here's a common Perl 5 idiom:
>
> my $alternatives = "(" . (join ")|(", map quotemeta $_, @choices) . ")";
> my $re = qr/$alternatives/;
>
> There happens to be no bidi trouble here, but it's evidence that
> people assemble regexps from small pieces. I'm sure there are other
> cases.
>
>   
In fact this is very like the example I had in mind. My expectation to
each part being a meaningful re came because I could not think of any
useful counterexample. Nevertheless, you are right and I suspect that
such counterexamples would pop up eventually.
The role-selection scheme I proposed earlier could cover that case, and
I believe that such edge-case confusions would not outweigh the
de-confusions that a standardized complex-expression bidi display method
would bring.

>> Actually, the suggestion for source code (programming languages
>> dependant - again we assume full syntax awareness) specifically defines
>> string literals and comments as tokens, so it would be $foo = "MOLAHS"
>> as well.
>>     
>>>  Also, there's the tokenization problem you already
>>> mention: NATBA"G would come out very bad from this transformation.
>>>       
>> In #1 certainly it would not be much different than NATBAG. In #2, since
>> " is not a special character in RE, it will be included in the token and
>> have the standard Bidi algorithm applied to it - so you'd get G"ABTAN ,
>> just as you'd probably expect.
>>     
>
> Hmmm, you're right. I can't off-hand give you a counterexample where
> the bidi algorithm's idea of tokenization and RE's differ, but I have
> a hunch it's difficult to cover all cases.
>
>   
We are not talking about the general purpose bidi algorithm, but on a
new standard that would provide a "higher-level protocol" for it (as
suggested in unicode UAX#9). The main idea is that the application (e.g.
editor) would do a *syntax dependant* tokenization and run the bidi
algorithm separately on each token (maybe the term "token" here is
misleading, as  it may refer to long strings e.g. comments).
If we define the "tokenization" process for RE correctly, there should
be no collisions (that is exactly its purpose).

> You're welcome. I may be interested in future discussions on this
> topic, if you're cooking something up.
>   
As probably became apparent by now, the thing that is "cooked up" is a
proposal for a new standard for bidi display of complex expressions (it
includes general guidelines, and several examples for specific syntaxes
- one of which is RE).
I'll send a current draft to you and Gabor - comments are very welcome.
I prefer to avoid posting it publicly because it is still preliminary
and undergoing major modifications, but if anyone is interested please
contact me.

  Amit




More information about the Perl mailing list