Ari Brown e3c11fb1aa tidying up examples and readme

2024-01-30 18:13:56 -05:00

3.5 KiB

Raw Blame History

Minerva

Minerva is the Roman equivalent of Athena, and Athena is AWS's database that stores results in S3. However, Minerva goes beyond that, and now eases all AWS access, and even offers its own cluster management with Dask.

Athena

In order to ease programmatic access to Athena and offer blocking access (so that your code waits for the result), I wrote minerva to make it seamless.

The results are returned as pyarrow datasets (with parquet files as the underlying structure).

Please follow along at examples/athena_basic_query.py.

Import the required and helpful libraries:

import minerva
import pprint

pp = pprint.PrettyPrinter(indent=4)

The first substantive line is create a handle to the AWS account according to your AWS profile in ~/.aws/credentials:

m = minerva.Minerva("hay")

Then, we create a handle to Athena. The argument passed is the S3 output location where results will be saved (s3://<output>/results/<random number for the query>/):

athena = m.athena("s3://haystac-pmo-athena/")

We place the query in a non-blocking manner:

query  = athena.query(
    """
    select round(longitude, 3) as lon, count(*) as count
    from trajectories.baseline
    where agent = 4
    group by round(longitude, 3)
    order by count(*) desc
    """
)

Minerva will automatically wrap query() in an UNLOAD statement so that the data is unloaded to S3 in the parquet format. As a prerequisite for this, all columns must have names, which is why round(longitude, 3) is designated lon and count(*) is count.

(If you don't want to retrieve any results, such as a CREATE TABLE statement, then use execute() to commence a non-blocking query and use finish() to block to completion.)

When we're reading to block until the results are ready and retrieve the results, we do:

data   = query.results()

This is blocking, so the code will wait here (checking with AWS every 5 seconds) until the results are ready. Then, the results are downloaded to /tmp/ and lazily interpreted as parquet files in the form of a pyarrow.dataset.dataset.

We can sample the results easily without overloading memory:

pp.pprint(data.head(10))

And we also get useful statistics on the query:

print(query.runtime)
print(query.cost)

DO NOT END YOUR STATEMENTS WITH A SEMICOLON

ONLY ONE STATEMENT PER QUERY ALLOWED

Redshift

Please follow along at examples/redshift_basic_query.py.

The only difference from Athena is the creation of the Redshift handle:

red   = m.redshift("s3://haystac-te-athena/",
                   db        = "train",
                   workgroup = "phase1-trial2")

In this case, we're connecting to the train DB and the access to the workgroup and DB is handled through our IAM role. Permission has to be initially granted by running (in the Redshift web console):

grant USAGE ON schema public to "IAM:<my_iam_user>"

S3

import minerva

m = minerva.Minerva("hay")
objs = m.s3.ls("s3://haystac-pmo-athena/")
print(list(objs))

See minerva/s3.py for a full list of supported methods.

EC2

Follow along with examples/simple_instance.py

Cluster

Dask

Helpers

I wrote a Timing module to help with timing various functions:

with Timing("my cool test"):
    long_function()

# Prints the following
# 
# my cool test:
# => 32.45

Basic Usage

Build

To build the project, run the following commands. (Requires poetry installed):

poetry install
poetry build

TODO

parallelize the downloading of files

3.5 KiB Raw Blame History