[Israel.pm] Displaying bidi text in re (e.g. in the editor).

Gaal Yahas gaal at forum2.org
Sat Jan 31 23:13:43 PST 2009


#2 is very near what you get today by looking at source code in an
environment that attempts to run the Unicode BiDi algorithm on your
source, for example a browser. In my experience, this can be very
confusing.

If I'm understanding your proposal correctly, in #2 you still want to
fix the crazy cases of flipped capture parentheses etc., but that
implies your editor must understand regular expressions, which at best
it will do when something is a regexp literal but which will fail when
the programmer is constructing a regexp piecemeal from strings.
Indeed, your solution doesn't attempt to handle simple strings, so the
user can get confused by inconsistency issues: why $foo = "SHALOM" but
$foo = qr/MOLAHS/? Also, there's the tokenization problem you already
mention: NATBA"G would come out very bad from this transformation.

So if these are the two options, I'd vote for #1. (I'd also recommend
in source code representing invisible override characters as entities,
so that if the actual data has RLM/LRM/etc. marks, where possible it's
better not to have those as literals in the source code. That's a
recommendation for the end programmer, not the editor, I think.)

Perhaps one day we can have an alternate surface syntax for strings
and regular expressions that is designed to be RTL friendly. Hebrew or
Arabic metacharacters, introduced with slashes instead of backslashes.
This would be used in conjunction with an editor hint that makes the
whole expression RTL (in Unicode terms, sets the paragraph
directionality). If used on whole lines of code, it can also change
the alignment to be right-justified. Perl 6 has many quotelike
operators, with room for extensions like this. The problem with these
things is that they're hard to get right, and pesky problems will
still appear no matter how hard you try if the data really does
contain mixed directionality characters. We can ask Larry for his
opinion, though.

On Sun, Feb 1, 2009 at 5:08 AM, Amit Aronovitch <aronovitch at gmail.com> wrote:
> Hi,
>
> Following a discussion I took part in about standartization of the
> display of Hebrew text in structured expressions and source code, I
> would be happy to hear some opinions about how we would like regular
> expressions containing bidi chars to be displayed (in an "ideal
> editor" that is fully syntax aware).
>
> In the examples below, caps represent RTL characters and lowercase LTR chars.
>
> The basic principle that was proposed (for structured expressions) is
> that text should be split into "separators" and "tokens" according to
> the relevant syntax, the general-purpose Bidi rules be applied within
> each token only, and then tokens and separators should be concatenated
> left to right always.
>
> Applied to regular expressions, I thought that since in RE each
> pattern character is an atom (1), then this effectively means to force
> LTR everywhere (except maybe stuff like named captures
> (?<NAME>...) etc.).
> However, it was suggested instead that any sequence of pattern
> characters (not containing "special" characters) should be treated as
> a token (2).
> This would make simple searches easier to read,
>
> e.g. /SHALOM/ would be displayed /SHALOM/ by (1),
> but /MOLAHS/ (much more readable if it was actual Hebrew) by (2).
> On the other hand, /YADAII?M/ in (2) would show as /IIADAY?M/ ,
>  which is very confusing, so I thought the simplification was not worth it.
>
> However, I am used to languages where simple searches are commonly
> done by other means, whereas in Perl using RE for simple text search
> might be more common because of the specialized syntax. What do you
> think?
>
>    Amit
> _______________________________________________
> Perl mailing list
> Perl at perl.org.il
> http://perl.org.il/mailman/listinfo/perl
>



-- 
Gaal Yahas <gaal at forum2.org>
http://gaal.livejournal.com/



More information about the Perl mailing list