[Israel.pm] Displaying bidi text in re (e.g. in the editor).

Gaal Yahas gaal at forum2.org
Sun Feb 1 05:12:45 PST 2009


On Sun, Feb 1, 2009 at 2:36 PM, Amit Aronovitch <aronovitch at gmail.com> wrote:
> On 02/01/2009 09:13 AM, Gaal Yahas wrote:
>> #2 is very near what you get today by looking at source code in an
>> environment that attempts to run the Unicode BiDi algorithm on your
>> source, for example a browser. In my experience, this can be very
>> confusing.
>>
>> If I'm understanding your proposal correctly, in #2 you still want to
>> fix the crazy cases of flipped capture parentheses etc., but that
>> implies your editor must understand regular expressions, which at best
>> it will do when something is a regexp literal but which will fail when
>> the programmer is constructing a regexp piecemeal from strings.
>>
> Yes, thats why I said "ideal editor" - assume for the purpose of
> discussion that it can magically distinguish strings literals which are
> used for RE from general purpose strings (which are treated as tokens
> and have the general bidi algorithm applied to them). I imagine a
> practical solution would allow the user to manually adjust "roles" for
> literal strings (however, I do not know where it would save this info
> for future edits...).
> If you construct your regexp from pieces, I expect each part to be
> meaningful as re on its own, so have the same rules.

No, you can't make this assumption. Here's a common Perl 5 idiom:

my $alternatives = "(" . (join ")|(", map quotemeta $_, @choices) . ")";
my $re = qr/$alternatives/;

There happens to be no bidi trouble here, but it's evidence that
people assemble regexps from small pieces. I'm sure there are other
cases.

An RTL-biased quote op could help here (with the question of what an
editor to do). Perl 6 has those for both REs and strings.

>> Indeed, your solution doesn't attempt to handle simple strings, so the
>> user can get confused by inconsistency issues: why $foo = "SHALOM" but
>> $foo = qr/MOLAHS/?
> Actually, the suggestion for source code (programming languages
> dependant - again we assume full syntax awareness) specifically defines
> string literals and comments as tokens, so it would be $foo = "MOLAHS"
> as well.
>>  Also, there's the tokenization problem you already
>> mention: NATBA"G would come out very bad from this transformation.
>>
> In #1 certainly it would not be much different than NATBAG. In #2, since
> " is not a special character in RE, it will be included in the token and
> have the standard Bidi algorithm applied to it - so you'd get G"ABTAN ,
> just as you'd probably expect.

Hmmm, you're right. I can't off-hand give you a counterexample where
the bidi algorithm's idea of tokenization and RE's differ, but I have
a hunch it's difficult to cover all cases.

>> So if these are the two options, I'd vote for #1. (I'd also recommend
>> in source code representing invisible override characters as entities,
>> so that if the actual data has RLM/LRM/etc. marks, where possible it's
>> better not to have those as literals in the source code. That's a
>> recommendation for the end programmer, not the editor, I think.)
>>
> Another one would probably be to use i18n and have all user-destinated
> strings in po files, which as a nice byproduct would also reduce
> potential problems with string literals...

Well, sure, but sometimes you can't or don't want to extract
everything to strings. Also, do you have a good solution for
maintaining bidi PO files? It's a similar problem there.

>> Perhaps one day we can have an alternate surface syntax for strings
>> and regular expressions that is designed to be RTL friendly. Hebrew or
>> Arabic metacharacters, introduced with slashes instead of backslashes.
>> This would be used in conjunction with an editor hint that makes the
>> whole expression RTL (in Unicode terms, sets the paragraph
>> directionality). If used on whole lines of code, it can also change
>> the alignment to be right-justified. Perl 6 has many quotelike
>> operators, with room for extensions like this. The problem with these
>> things is that they're hard to get right, and pesky problems will
>> still appear no matter how hard you try if the data really does
>> contain mixed directionality characters. We can ask Larry for his
>> opinion, though.
>>
> Interesting idea. We tend to think in terms of adjusting bidi display
> algorithms to cope with existing syntax. Adjusting syntax to cope with
> bidi issues would probably be less quirky and easier for editor devs to
> implement.
>
> Thanks for this input,
> Amit

You're welcome. I may be interested in future discussions on this
topic, if you're cooking something up.

>> On Sun, Feb 1, 2009 at 5:08 AM, Amit Aronovitch <aronovitch at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Following a discussion I took part in about standartization of the
>>> display of Hebrew text in structured expressions and source code, I
>>> would be happy to hear some opinions about how we would like regular
>>> expressions containing bidi chars to be displayed (in an "ideal
>>> editor" that is fully syntax aware).
>>>
>>> In the examples below, caps represent RTL characters and lowercase LTR chars.
>>>
>>> The basic principle that was proposed (for structured expressions) is
>>> that text should be split into "separators" and "tokens" according to
>>> the relevant syntax, the general-purpose Bidi rules be applied within
>>> each token only, and then tokens and separators should be concatenated
>>> left to right always.
>>>
>>> Applied to regular expressions, I thought that since in RE each
>>> pattern character is an atom (1), then this effectively means to force
>>> LTR everywhere (except maybe stuff like named captures
>>> (?<NAME>...) etc.).
>>> However, it was suggested instead that any sequence of pattern
>>> characters (not containing "special" characters) should be treated as
>>> a token (2).
>>> This would make simple searches easier to read,
>>>
>>> e.g. /SHALOM/ would be displayed /SHALOM/ by (1),
>>> but /MOLAHS/ (much more readable if it was actual Hebrew) by (2).
>>> On the other hand, /YADAII?M/ in (2) would show as /IIADAY?M/ ,
>>>  which is very confusing, so I thought the simplification was not worth it.
>>>
>>> However, I am used to languages where simple searches are commonly
>>> done by other means, whereas in Perl using RE for simple text search
>>> might be more common because of the specialized syntax. What do you
>>> think?
>>>
>>>    Amit
>>>
> _______________________________________________
> Perl mailing list
> Perl at perl.org.il
> http://perl.org.il/mailman/listinfo/perl
>



-- 
Gaal Yahas <gaal at forum2.org>
http://gaal.livejournal.com/



More information about the Perl mailing list