[Israel.pm] my first program

Gaal Yahas gaal at forum2.org
Tue Aug 7 12:19:06 PDT 2012


On Sun, Aug 5, 2012 at 10:18 PM, moshe nahmias <moshegrey at ubuntu.com> wrote:

> First of all, thanks for your comments.
>
> On Sun, Aug 5, 2012 at 10:32 AM, Gaal Yahas <gaal at forum2.org> wrote:
>
>> I wanted to mention one other thing -- how do you know this thing works?
>>
>> When you're developing it, you probably ran it several times. This is
>> fine to do occasionally, but it has a few problems:
>> - it puts load on Gabor's servers
>> - it's slow! you have to fetch a bunch or resources from the network
>> every time and uncompress them.
>> - when you have a bug, you have to figure out what's wrong from looking
>> at the output of the entire program. This can be hard.
>> - it's possible you have a bug in the statistics code that doesn't crash
>> the program, but which causes the wrong output! You'd never find it by
>> looking at the report, unless you had some prior knowledge about the actual
>> stats.
>>
>> Testing is a big subject, and I don't know how well you're already
>> familiar with it, so I'll just again give a few general remarks:
>>
>> - The main point I'm making here is that your script is a single flow of
>> code, and if you break it down to functions (sub get_page, sub analyze_page
>> etc.) you can cause only specific parts of your code to run, with input
>> that's under your control, and verify that the output is what you expect.
>>
>> - You might be tempted to comment out pieces of code / add flags like
>> $TEST or whatever -- it's okay here and there during development, but it
>> doesn't scale well down the line. You end up with messy code or things you
>> have to clean up before release, and have to repeat some work every time
>> you want to test something.
>>
>> - Whatever works. Perl has spectacular support for testing, but it means
>> it's easy to get overwhelmed by the options. Start with Test::More or
>> something else that's well-known.
>>
>>
> Right, I really need to start making tests for this and do it in parts, I
> think this will be for my next version of the script (tests and building it
> in smaller chunks).
>

Sure. One nice thing about testing is that there are gains to be had even
if you only have very simple tests -- any addition can be an improvement,
but you don't have to do it all at once.


>
>
>> On Sun, Aug 5, 2012 at 9:51 AM, Gaal Yahas <gaal at forum2.org> wrote:
>>
>>> Since you asked for comments, here are a few general ones.
>>>
>>> - You use autodie, but aren't telling it to capture errors in system.
>>> The most likely part of your program to fail is the network part (what you
>>> currently delegate to wget), so it's probably the one you want the most
>>> control over. Currently you'd fail silently. If you say "use autodie
>>> qw(:all)" you'd die on the first wget failure. What would you like to do?
>>> I'm guessing that ideally, you should retry (or have wget retry). But at
>>> the very least, you want good error messages.    To summarize this point:
>>> don't give up contorl of error reporting in the heart of what you're trying
>>> to accomplish.
>>>
>>>
>> OK, so I removed the autodie, but can i use it somehow and still have my
> own warnings? (want to learn, so a link will be good enough).
>

Sure. The autodie manual says,

       The "autodie" pragma has lexical scope, meaning that functions and
       subroutines altered with "autodie" will only change their behaviour
       until the end of the enclosing block, file, or "eval".

This means you can do:

use autodie qw(:all);  # at the top of your program

open my $fh, "<", $filename;  # benefits from autodie

{
    no autodie;
    open my $other_fh, "<", $otherfile or die "my own error message: $!";
}

close $fh;  # autodie again, because we've left the scope.


> About the wget, I meant the program on linux distro's, not something on
> perl itself.
> How do I get the errors of a system error? I use system() for the wget and
> gzip so if there is an error i can see it in the terminal, but if I will be
> able to get the errors it can help if and when there will be problems and
> save them for a time I can look at them.
>

system's error values are pretty unusual. "perldoc -f system" will give you
the details, but at the very least expect 0 for success, something else is
failure.

A modern alternative is IPC::Cmd - http://perldoc.perl.org/IPC/Cmd.html -
the example there even uses wget!

Corelist says this is a standard module since Perl 5.9.5, so it's likely
that you can use it.


>  - You're using use strict, which is very good, but notice that the way
>>> your file handles are named, they are unchecked globals. Avoid
>>> all-uppercase, sigilless names (FILE and so one) if you want protection;
>>> use lexicals instead:
>>>
>>>   open my $fh, "<", "myfile" or die ...
>>>
>>> (BTW, you have or die here and there, despite autodie. I'd recommend for
>>> now just getting rid of autodie entirely and doing all the errors yourself.)
>>>
>>> When you use lexical filehandles, the files are closed automatically
>>> when the variable goes out of scoped. This is great for reads. For writes,
>>> you should always close the file yourself to check for errors. There are
>>> fancy ways of doing that automatically but never mind them for now.
>>>
>>> I understand from your comment that I can check for errors when closing
> a file, how? (again, if you prefer giving a link that's OK too).
>

If you are using autodie, just "close $fh" will work. Otherwise, "close $fh
or die "close: $!".


>
>
>> - It's not unreasonable to use external tools in your script. Whatever
>>> gets the job done. Of course there are tradeoffs: you comment that you
>>> don't know if these tools are available on other platforms. Well, modules
>>> for gzip and http may not be available on all platforms either -- take a
>>> look at corelist if you want to make educated guesses about what to expect.
>>>
>>> Didn't understand what you mean, I used the linux programs wget and gzip
> in this script, thus i don't know if those programs are available for
> windows for example, thus I don't know if it will work on windows.
>

This stuff is up to you -- you can't support every system in the world, and
if you don't know yet where your script will be running your safest bet is
to stick to modules that come bundled with Perl itself (the corelist).
Today, that's quite a lot of stuff. For example, HTTP::Tiny can fetch
things from the network. And you can read gzipped files from Perl with
IO::Uncompress::Gunzip.


>
>
>> - Syntactically, this:
>>>
>>>   if (CONDITION 1) {
>>>     CODE 1
>>>   } if (CONDITION 2) {
>>>     CODE 2
>>>   } if (CONDITION 3) {
>>>     CODE 3
>>>   } ...
>>>
>>>   checks all the conditions and may run all the branches. I think you
>>> meant "elsif" there.
>>>
>>> I use it this way because it doesn't get the right numbers and make
> mistakes if I use elsif (at least on some of the years, don't remember why).
>

The only way this can be possible is if for some inputs, more than one
condition is true. What's happening then is that the last condition wins.


>
>
>> - Semantically, what you're doing is extracting a bunch of values from a
>>> hash into a number of scalars. Why? This has a few costs:
>>>
>>> 1. You have to pick names for these. If you ever scratch your head about
>>> the name of a variable, and end up with something mundane like "first",
>>> something's wrong. (I mean, first is perfectly all right if its scope is
>>> very small, and within that block of code it's clear what it's intended to
>>> refer to. But here the scope of these variables is pretty much the whole
>>> program, and the reader is confounded with arbitrary distinctions like
>>> "first" and "firs".
>>>
>>> I have refactored the names, it's better, not great.
> The reason for the names is that $first is the one with the most mails on
> this year, so may be the name isn't great but I don't know if there is a
> much better choice. I am open to suggestions.
>

You have a collection of people, who you are putting in some order. To me
this suggests an array, where the first person is the most prolific author,
the second is the second most prolific, and so on. The decision to cut off
after 10 authors is artificial, so ideally it should be a parameter and
there should be exactly one place in the code that you have to change in
order to instead see the top 3 or 20 or whatever.

So how about @authors, or @authors_by_volume? Then $authors[0] is your
$first, $authors[1] is your $second, $authors[-1] is your $tenth and so on.

Oh wait, this is a collection of their scores, not their names. Your @sort
is *all* the authors by volume. (Bad name, probably.) So you need a
parallel array with their scores. Here are two ways to do it:

# %names is also a bad name. It stores name => score elements, so
%score_by_name, or just %scores if you can get rid of @scores. More on this
below.

my $cutoff = 10;
my @scores;
die "gaal was too lazy to handle the case where there are few authors" if
@sorted < $cutoff;
for my $i (0 .. $cutoff - 1) {
  push @scores, $names{ $sorted[$i] };
}

Another way is to use slices. If I have a collection and want to get a
subset of it, I can use multiple keys at once to get multiple values.

@scores_for_moshe_and_gaal = @names{"moshe", "gaal"};  # or with fancier
syntax,
@scores_for_moshe_and_gaal = @names{qw/moshe gaal/};

So if we had @top_10_names = qw/gabor moshe .../;, we could do

@top_10_scores = @names{ @top_10_names };

But I said I don't like hardcoding the value "10" anywhere, it's arbitrary.
So you can slice both the name-to-score hash, and the top name array!

@top_scores = @names{ @sorted[0 .. $cutoff - 1] };

You're still doing ten ($cutoff) hash lookups behind the scenes, and in
fact are creating a temporary array of size 10 for the smaller sorted
authors collection. But 10 is better than 100, and 1 line of code is nicer
than 20.

Yet another way is to use map:

@top_scores = map $names{$_}, @sorted[0 .. $cutoff - 1];


> 2. It's inefficient in code. You have a 20 lines of code that aren't
>>> furthering your goal as a programmer here. Couldn't the user of this data
>>> may as well just say $names{$some_name}?
>>>
>>> The point of these 20 lines is to get the numbers of the most active
> participants, I guess there are other ways (and may be better ways) but I
> don't know them yet.
>
>
>> 3. It's inefficient in performance. Your for loop runs as many times as
>>> there are elements in %names, and performs ten conditions. So you have 100
>>> branches in your code where you didn't need any. In this case, it doesn't
>>> matter; but you have to keep these things in mind for when you have larger
>>> data structures.
>>>
>>> So, what you say is that i should find a way to make it run only on the
> "right" parts of my variable? the big question is how can I do that?
>

You aren't ever printing the top scores, just counting them for statistics.
So really even a collection of the top scores is more than you need. So you
can do:

my $top_posters_contribution;
for my $i (0 .. $cutoff - 1) {
  $top_posters_contribution += $names{ $sorted[$i] };
}

Same thing with fancier syntax:

use List::Util qw(sum);
my $top_posters_contribution = sum map $names{$_}, @sorted[0 .. $cutoff -
1];

You should certainly only use the styles that you like. There's more than
one way to do it! The shortest one isn't necessarily the best. But try to
avoid wasteful ways.


>
>> - Again on naming, "percent" vs. "per" -- the reader can't tell what
>>> these mangled names are for.
>>>
>>> - I see you're using "say", which is a new language feature. If you're
>>> doing that, you may as well know about and use given/when instead of
>>> if/else, though like I said, it's not actually required here.
>>>
>>>
>>> Keep having fun,
>>> And I hope this is useful,
>>> Gaal
>>>
>>> Thanks again for your comments, it's very helpful and I think I learnt
> some things already :)
>
>
>> On Sun, Aug 5, 2012 at 3:39 AM, moshe nahmias <moshegrey at ubuntu.com>wrote:
>>>
>>>> Hi all,
>>>> I just uploaded my first little program to github. I want your feedback
>>>> on the script.
>>>> the script is intended to make the statistics for mailing groups from
>>>> their archives, for now it only works on linux (or at least if wget and
>>>> gzip are installed).
>>>>
>>>> So, what do you think?
>>>> I know its not perfect so feel free to teach me, after all, that's one
>>>> of the reasons to publish this :)
>>>> the link: https://github.com/moshe742/mail-stats.git
>>>>
>>>> Moshe
>>>>
>>>> _______________________________________________
>>>> Perl mailing list
>>>> Perl at perl.org.il
>>>> http://mail.perl.org.il/mailman/listinfo/perl
>>>>
>>>
>>>
>>>
>>> --
>>> Gaal Yahas <gaal at forum2.org>
>>> http://gaal.livejournal.com/
>>>
>>
>>
>>
>> --
>> Gaal Yahas <gaal at forum2.org>
>> http://gaal.livejournal.com/
>>
>> _______________________________________________
>> Perl mailing list
>> Perl at perl.org.il
>> http://mail.perl.org.il/mailman/listinfo/perl
>>
>
>
> _______________________________________________
> Perl mailing list
> Perl at perl.org.il
> http://mail.perl.org.il/mailman/listinfo/perl
>



-- 
Gaal Yahas <gaal at forum2.org>
http://gaal.livejournal.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.perl.org.il/pipermail/perl/attachments/20120807/54e44bda/attachment-0001.htm 


More information about the Perl mailing list