Atos

Iz Fizika za študente Praktične matematike 2007 - 2008

Revision as of 12:45, 8 julij 2008 by 193.2.4.4 (Pogovor)
(prim) ← Starejša redakcija | poglejte trenutno redakcijo (prim) | Novejša redakcija → (prim)
Skoči na: navigacija, iskanje

Queuing system

As an example let us write a program which calculates a product and a quotient of two numbers. We first create a directory test where the program will be located:

rejec@atos:~> mkdir test
rejec@atos:~> cd test
rejec@atos:~/test> 

Let's call the program programcek. The code in perl is

rejec@atos:~/test> cat programcek
#!/usr/bin/perl
my ($a, $b) = @ARGV;
print "a = $a\n";
print "b = $b\n";
print "a * b = ", $a * $b, "\n"; 
print "a / b = ", $a / $b, "\n";

An example output:

rejec@atos:~/test> programcek 2 0
a = 2
b = 0
a * b = 0
Illegal division by zero at ./programcek line 6.

Now we'll run the program using the queuing system. First we need to write a shell script. Let's call it programcek.sh.

rejec@atos:~/test> cat programcek.sh
#!/bin/bash
cd test
programcek 2 0

The shell script should first change the directory to the one where the program itself is located (cd test). Then it should run the program (programcek 2 0).

Now we are ready to put the program in the queue. This we can do with the qsub command:

rejec@atos:~/test> qsub programcek.sh
1849.atos.ijs.si

Note that an id number 1849 was assigned to the job. From this point on, you'll be able to check the status of job, stop it and access its results only through the id number. So don't forget it.

Let's check what is going on in the queue with the showq command:

rejec@atos:~/test> showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

1732                 vilfan    Running     5  3:20:30:14  Tue Apr  1 10:32:35
1646                 vilfan    Running    12    INFINITY  Thu Mar 27 17:09:44
1716                 vilfan    Running     6    INFINITY  Mon Mar 31 17:52:49
1717                 vilfan    Running     6    INFINITY  Mon Mar 31 18:01:39
1739                 vilfan    Running    12    INFINITY  Tue Apr  1 13:36:11
1740                 vilfan    Running    12    INFINITY  Tue Apr  1 14:14:54
1826                 vilfan    Running     6    INFINITY  Wed Apr  2 09:37:38
1827                 vilfan    Running     4    INFINITY  Wed Apr  2 13:32:32
1835                  rejec    Running     4    INFINITY  Wed Apr  2 18:00:22
1836                  rejec    Running    10    INFINITY  Wed Apr  2 18:00:35
1838                  rejec    Running     1    INFINITY  Wed Apr  2 18:00:51
1839                  rejec    Running    10    INFINITY  Wed Apr  2 18:00:57
1841                  rejec    Running     4    INFINITY  Wed Apr  2 18:01:33
1845                  rejec    Running     1    INFINITY  Wed Apr  2 18:01:54
1846                  rejec    Running     1    INFINITY  Wed Apr  2 18:02:03
1847                  rejec    Running     1    INFINITY  Wed Apr  2 18:02:14
1848                  rejec    Running     1    INFINITY  Wed Apr  2 18:02:17

    17 Active Jobs      96 of   96 Processors Active (100.00%)
                        24 of   24 Nodes Active      (100.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

1849                  rejec       Idle     1    INFINITY  Wed Apr  2 18:02:20

1 Idle Job

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 18   Active Jobs: 17   Idle Jobs: 1   Blocked Jobs: 0

We note that our job is listed under the IDLE JOBS. This means that all the processors are currently occupied and our job has to wait until one of them is free before it can be run. Let's check again few hours later:

rejec@atos:~/test> showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

1732                 vilfan    Running     5  3:20:18:17  Tue Apr  1 10:32:35
1646                 vilfan    Running    12    INFINITY  Thu Mar 27 17:09:44
1716                 vilfan    Running     6    INFINITY  Mon Mar 31 17:52:49
1717                 vilfan    Running     6    INFINITY  Mon Mar 31 18:01:39
1739                 vilfan    Running    12    INFINITY  Tue Apr  1 13:36:11
1740                 vilfan    Running    12    INFINITY  Tue Apr  1 14:14:54
1826                 vilfan    Running     6    INFINITY  Wed Apr  2 09:37:38
1827                 vilfan    Running     4    INFINITY  Wed Apr  2 13:32:32
1849                  rejec    Running     1    INFINITY  Wed Apr  2 18:14:18

     9 Active Jobs      64 of   96 Processors Active (66.67%)
                        23 of   24 Nodes Active      (95.83%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 9   Active Jobs: 9   Idle Jobs: 0   Blocked Jobs: 0

Now the program running: it is listed under the ACTIVE JOBS. If in the meantime you've found out that there is a bug in your program, you should stop it to make the resources available to other users. You can do this with the qdel command:

rejec@atos:~/test> qdel 1849

Let's assume everything was OK with your program, and you'd waited until it stopped running. The results of the calculation are available in the directory where the job was started:

rejec@atos:~/test> ls
programcek  programcek.sh  programcek.sh.e1849  programcek.sh.o1849

rejec@atos:~/test> cat programcek.sh.o1849
a = 2
b = 0
a * b = 0

rejec@atos:~/test> cat programcek.sh.e1849
Illegal division by zero at ./programcek line 6.

The output and the errors of the job were saved in the programcek.sh.o1849 and programcek.sh.e1849 files, respectively.

You can find additional information in the users manual.