Wednesday, June 15, 2016

Domino Data Lab python/bash hacks

In a prior post I covered the data science platform provided by Domino Data Labs. I'm still a fan and here are a few minor gimmicks I've found useful when working with the product, starting with the trivial. You may prefer to cut and paste from domino_data_hacks on github.

(A run utilizing 8320 processors on Amazon. See below for a bash script to do this.)

Starting with the basics... Well I said I would start with the trivial. For readability:

import os
def domino_run_id():
    try:
        return os.environ['DOMINO_RUN_ID']
    except:
        return None

def running_on_domino():
    return ( domino_run_id() is not None )

def running_on_local():
    return not running_on_domino()
Domino environment variables While we are at it:
current_project       = os.environ['DOMINO_PROJECT_NAME']
current_project_owner = os.environ['DOMINO_PROJECT_OWNER']
current_run_id        = os.environ['DOMINO_RUN_ID']
current_run_number    = os.environ['DOMINO_RUN_NUMBER']
domino_project_path   = '{0}/{1}'.format(current_project_owner, current_project)

Shelling out

Due to vagaries I don't fully understand but are undoubtedly related to security, permissions on files might not be what you expect. To work around this you might go so far as to set the permission right before you need it:

    subprocess.call(["chmod", "+x", "stockfish.sh"])
    subprocess.call(["chmod", "+x", "stockfish_"+binary])
    cmd =  ' '.join( ['./stockfish.sh' ,fen, str(seconds) , binary, str(threads), str(memory) ] )
    print cmd
    subprocess.call( cmd, shell=True )
Now unfortunately if your code is not in the same project as your data (which I recommend, see below) this won't work, at least if run from the data project. One can make a temporary copy of the script you are shelling out to and also call that just before shelling out.
interface_dir="${project}/lib/bash/interface"
interface_mirror="${project}/data_project/bash_mirror/interface"
mkdir -p "${project}/data_project/bash_mirror"
mkdir -p "${project}/data_project/bash_mirror/interface"
cp -R ${interface_dir}/* ${interface_mirror}
Yeah, not the most elegant but it works. There is probably a better way. One of the domino engineers suggested adding sh in front of commands (see this note) but whatever you try, be aware that in Domino, file permissions set on a run in one project will not be preserved when you import that project into another.

Setting paths in bash

For the times when you don't want to rely on Domino variables explicitly

USER=$(whoami)
if [[ $OSTYPE == *"darwin" ]]  # Your machine -ne AWS hopefully :) 
then 
    # We're local
    default_project="/Users/projects/my_project"   # Hardwired
    default_size=small
then
    # We're on AWS
    default_project="/mnt/${USER}/my_project"   
    default_size=full
fi

# allow override
project=${1:-${default_project}}
sz=${2:-${default_sz}}      

# Then do something...       
Consistent path names Incidentally a project importing other projects can use full paths such as
  /mnt/${USER}/my_project/etc
but a project that does not share any other projects cannot. To avoid inconsistency, just make sure every project imports one other project, even if it is a dummy project.

Drop-in multiple endpoint functions

Domino allows only one endpoint function per project. To allow the client to call any functions you care to drop in your endpoint file instead (by their names), include these four lines of code at the top of the same and register dispatcher as the official endpoint function.

def dispatcher( func, *args ):
    """ As Domino allows one endpoint per project """
    module  = sys.modules[ __name__ ]
    endpoint_func = getattr( module, func )
    return endpoint_func( *args )
The price you pay is specifying the function you really want as the first parameter in any call. This is not recommended for national security applications.

Safe sync

Very, very occasionally a failed sync can leave client and server in a state where it is inconvenient or confusing to revert to a previous code version. If you're a nervous Nellie like me there is a simple way to reduce the chance of any code confusion:

  1. Copy your source code to a second domino project
  2. Sync both with two separate domino syncs
Long odds against both sync's failing at all, much less in a confusing way. Obviously this needs to be scripted (and turned into a one-click as below) or you'll never actually do it. I've provided a little script save_sync.sh that you might wish to modify to your taste. It does even more, manually saving as many previous versions as you care for. It relies on a slightly hacky checkpoint.sh script which again, might be swapped out for a "better" way as you see fit, such as integration with your source control. Again, not really an issue unless you commingle code with huge piles of data. One click sync Moving on to profound observations that don't really have anything to do with Domino per se, I'll point out that you'll probably do the backup/sync operation so frequently you'll end up trying to hide a terminal window in the corner of your screen. On a mac we can do a little better, creating a one click app launched from the dock.
  1. Chmod your backup/sync script so it is executable
  2. Rename it with .app extension, so that you can then drag it into the mac dock
  3. Rename back to original .sh extension
  4. In finder, right click on "Get Info"and under "Open with" menu, select the terminal application
Unfortunately you'll have to repeat the last step every time you edit this file as your IDE will likely revert it - but that shouldn't be necessary too often.

Recovery

To use the server's state use domino reset. To use the local state use domino restore. More on recovery of larger projects below.

Bash script to run job on Domino and wait for completion

The script call.sh uses the Domino API to start a run and then wait for it to complete.

# Run a job on Domino and wait for it to finish
#
# Usage
#        call.sh   
#        call.sh /mnt/USER/MYPROJECT/myscript.sh my_arg1 my_arg2

cmd=${1}
arg1=${2}
arg2=${3}

# Send request to start job to domino
temporary_response_file="response_.txt"
curl -X POST \
https://api.dominodatalab.com/v1/projects/USER/PROJECT/runs \
-H 'X-Domino-Api-Key: 7DunTrump1BcouldCnTZVhitcaKBarnQDooRwithAWrifle' \
-H "Content-Type: application/json" \
-d '{"command": ["'"${cmd}"'", "'"${arg1}"'", "'"${arg2}"'"], "isDirect": false}' > ${temporary_response_file}
echo "Sent command to start jobs with arg1 ${arg1} and arg2 ${arg2}."
runId_quoted=$(grep -oE '"runId":"(.*)",' ${temporary_response_file} | cut -d: -f2)
runId=${runId_quoted:1:${#runId_quoted}-3}
echo "The runId is ${runId}"
rm ${temporary_response_file}

# Now poll until done
while true;
do
    sleep 60s
    echo "Polling ..."
    response=$(curl https://api.dominodatalab.com/v1/projects/USER/PROJECT/runs/$runId \
    -H 'X-Domino-Api-Key: 7DuVOtdgeforzhimFanDitYouWEFfault' | grep -oE '"isCompleted":.*')
    if [[ ${response} == *true* ]]
    then
        echo "Job $runId has finished"
        break
    fi
done

Aside: passing bash variables in curl requests with "'"${}"'"

Incidentally, am I the only one who found this slightly troublesome? Here's ten minutes of my life I'm donating to you:

key=$1
curl -X POST \
https://api.dominodatalab.com/v1/projects/YOUR_USER/YOUR_PROJECT/runs \
-H 'X-Domino-Api-Key: QAYeqRTrump2is7aLdangerousMVUSClownVMSF' \
-H "Content-Type: application/json" \
-d '{"command": ["something.sh", "'"${key}"'"], "isDirect": false}'
Note the Double-Single-Double quoting of the bash variable.

Lazy man's map

Want to use more than one machine at once? I discovered that the easiest way is have each job write to the main branch. Domino sync will take care of the syncing that way, and you need only filter each run by some key (say {A..Z}).

!/usr/bin/env bash

# Start multiple jobs on Amazon

USER=$(whoami)
project="/Users/${USER}/project"
sz=${1:-full}      # Just an example of a parameter passed to all jobs 

# Pre-requisite is a file with one line containing space separated keys which break up the jobs - see the bach hack above

ordering_file="${project}/config/letter_ordering.txt"    
read -a kys <<<$(head -n 1 ${ordering_file})

for ky in ${kys[@]}    # or just {A..Z} if you don't care about ordering
do
    sleep 10s    # Give AWS a little time so jobs get ordered the way you expect
    curl -X POST \
    https://api.dominodatalab.com/v1/projects/YOUR_USER/YOUR_PROJECT/runs \
    -H 'X-Domino-Api-Key: 7DunjmdTrumph1BwillTZdeeKstoyaVQtheWVnCountry3m4F' \
    -H "Content-Type: application/json" \
    -d '{"command": ["YOUR_COMMAND.sh", "/mnt/YOUR_USER", "'"${sz}"'", "'"${ky}"'"], "isDirect": false}'
    echo "Set command to start jobs with source_filter ${ky}"
done

Ordering your jobs so large input files go first

The bash script ordering.sh provides an easy way to divide up your input data by letter {A..Z} say and order them by size. Hack as you see fit.

#!/usr/bin/env bash

# Create a file which sorts data sizes by letter, so we can order the // jobs sensibly

data_dir="/whereyouputbigdatainputfiles"
config_dir="/Users/YOUR_USER/project/my_project/config"
tmp_file="data_density.txt"
ordering_file="letter_ordering.txt"

# Delete the old statistics file if it exists
if [[ -e ${config_dir}/${tmp_file} ]]
then
   rm ${config_dir}/${tmp_file}
fi

# Create a file with one line per letter:
#      Size    key
#      1223411 A
#      1231233 B
#
for x in {A..Z}
do
   data_sz=$(du -c ${data_dir}/${x}* | awk '/./{line=$0} END{print $1}')
   echo "${data_sz} ${x}" >> ${config_dir}/${tmp_file}
done

# Sort, extract the letters, and convert to a single row
sort -r -t " " -k 1 -g ${config_dir}/${tmp_file} | awk '{print $2}' | tr '\n' ' ' > ${config_dir}/${ordering_file}

rm ${config_dir}/${tmp_file}

# Now you can easily read the ordered keys into a bash array:
#        read -a ordering <<<$(head -n 1 ${config_dir}/${ordering_file})
Adjust as you see fit. I use this in conjunction with the previous hack to reduce wall clock time of big jobs.

Lazy man's map-reduce

You'll often want to wait for the results to come back so you can get on with dependent tasks. I've posted a polling version of the task launcher as well. To poll directly from bash you can do something like this:

finished_runs=""
while true;
do
    sleep 1m
    status="finished"
    for runId in ${runIds[@]}
    do
        if [[ ${finished_runs} == *"$runId"* ]]
        then
            : # No need to check again after it has finished
        else
            response=$(curl https://api.dominodatalab.com/v1/projects/YOUR_USER/YOUR_PROJECT/runs/$runId \
            -H 'X-Domino-Api-Key: 7DunjmIbhopeWKTrumVpgDieszAFterrWXibleWQEdeath' | grep -oE '"isCompleted":.*')
            if [[ ${response} == *false* ]]
            then
                status="running"
            elif [[ ${response} == *true* ]]
            then
                echo "Job $runId has finished"
                finished_runs="$finished_runs $runId"
            else
                echo "We have a problem - did not expect this"
            fi
        fi
    done

    if [[ ${status} == "finished" ]]
    then
        echo "All finished"
        break
    fi
done
# Now do the reduce step because you have all the data ready
Thanks Jonathan Schaller for helping me fix this.

Pull runId out of JSON response

As a minor digression ... there are better general purpose JSON parsers like jq, but this is good enough. Send the curl response to ${response_file} and then:

    runId_quoted=$(grep -oE '"runId":"(.*)",' ${response_file} | cut -d: -f2)
    runId=${runId_quoted:1:${#runId_quoted}-3}
Grant one project write access to another Project A imports project B's files. So A can easily read B's files. However, A cannot write to B. The way I hack around this is to share A's files with B, and then in project B create a script to do the copying of A's files over.
key=$1
USER=$(whoami)
echo "Chilling for a minute so Project A can finish syncing"
sleep 1m
cp -r /mnt/${USER}/Project_A/data/results_${key}* /mnt/${USER}/Project_B/data && echo "Success"
ls -l /mnt/${USER}/Project_B/data/results_${key}*
Let's suppose the above sits in Project B with the name do_the_copy.sh. In order to drive this from Project A I write a little "beam me up Scotty" script:
#!/usr/bin/env bash
key=$1
curl -X POST \
https://api.dominodatalab.com/v1/projects/$USER/mapreduce/runs \
-H 'X-Domino-Api-Key: QqTYrumPoisAlmp2NutterisWEVtruly' \
-H "Content-Type: application/json" \
-d '{"command": ["do_the_copy.sh", "'"${key}"'"], "isDirect": false}'
This uses the Domino API to kick start the process. This is a kludge and only really works at the end of a run because Project A's files (assuming they have changed) can't be seen by Project B until after the sync occurs. A more sophisticated approach (which also solves other issues) is to wrap the domino task, for example in a luigi task as in this example by, you guessed, Jonathan Schaller. As noted above, you don't necessarily need an explicit collection step if you write results to the main branch.

Domino API - job status

Checking on a job...

domino_api            = Domino(domino_project_path)

def get_run_info(run_id):
    for run_info in domino_api.runs_list()['data']:
        if run_info['id'] == run_id:
            return run_info
Thanks Jonathan.

Domino API - data status

def get_blob_key(commit_id, filepath):
    dir, filename = os.path.split(filepath)
    files = domino_api.files_list(commit_id, path=dir)['data']
    for file in files:
        if file['path'] == filepath:
            return file['key']
 
def file_exists_in_commit(filepath, commit_id):
    # filepath is relative to project root
    files_list = domino_api.files_list(commit_id, path='/')['data']
    for file in files_list:
        if filepath == file['path']: return True
    return False
Thanks again.

Rich man's map-reduce

As noted above, another way to handle large pipelines is to mixin a Domino task into some pipeline framework, such as Luigi. See Jonathan's example at here

Ignore it and it will go away

I assume the reader is familiar with the usage of .dominoignore. But a classic catch-22 arises when you forget to ignore files. Too many are generated and syncing is a pain - including syncing of the new .dominoignore file that would get you out of the predicament. Sometimes I have had difficulty editing the remote version of .dominoignore through the web browser - which is the obvious solution - and that makes it really difficult to wiggle out of the space issue. So I just create a launcher for it.

#!/usr/bin/env bash
ignore_file=$1
echo $1 >> /mnt/USER/PROJECT/.dominoignore
Trivial but exceedingly useful.

Maintaining a high level of motivation

Just had a nine hour experiment crash at the final stage? Remember that we all face challenges, and none more difficult than those solved by MacGyver.py

No comments:

Post a Comment