Public repository for spreading the fruits of my labor.

Find a file

Ari Brown e3c11fb1aa tidying up examples and readme		2024-01-30 18:13:56 -05:00
cluster	better redshift support, working on dask stuff	2023-12-21 14:23:58 -05:00
examples	tidying up examples and readme	2024-01-30 18:13:56 -05:00
minerva	tidying up examples and readme	2024-01-30 18:13:56 -05:00
.gitignore	Various updates to export the main classes, update dependencies, and add an...	2023-08-02 14:40:01 +00:00
.gitlab-ci.yml	trying to get the ci/cd to work	2024-01-25 14:46:25 -05:00
create_view.sql	updated create_view	2023-07-27 13:58:17 -04:00
poetry.lock	added dask clustering support	2023-10-02 17:26:02 -04:00
pyproject.toml	added helpers for local files, loading templates, and an example for canceling queries	2024-01-25 11:10:50 -05:00
README.md	tidying up examples and readme	2024-01-30 18:13:56 -05:00

README.md

Minerva

Minerva is the Roman equivalent of Athena, and Athena is AWS's database that stores results in S3. However, Minerva goes beyond that, and now eases all AWS access, and even offers its own cluster management with Dask.

Athena

In order to ease programmatic access to Athena and offer blocking access (so that your code waits for the result), I wrote minerva to make it seamless.

The results are returned as pyarrow datasets (with parquet files as the underlying structure).

Please follow along at examples/athena_basic_query.py.

Import the required and helpful libraries:

import minerva
import pprint

pp = pprint.PrettyPrinter(indent=4)

The first substantive line is create a handle to the AWS account according to your AWS profile in ~/.aws/credentials:

m = minerva.Minerva("hay")

Then, we create a handle to Athena. The argument passed is the S3 output location where results will be saved (s3://<output>/results/<random number for the query>/):

athena = m.athena("s3://haystac-pmo-athena/")

We place the query in a non-blocking manner:

query  = athena.query(
    """
    select round(longitude, 3) as lon, count(*) as count
    from trajectories.baseline
    where agent = 4
    group by round(longitude, 3)
    order by count(*) desc
    """
)

Minerva will automatically wrap query() in an UNLOAD statement so that the data is unloaded to S3 in the parquet format. As a prerequisite for this, all columns must have names, which is why round(longitude, 3) is designated lon and count(*) is count.

(If you don't want to retrieve any results, such as a CREATE TABLE statement, then use execute() to commence a non-blocking query and use finish() to block to completion.)

When we're reading to block until the results are ready and retrieve the results, we do:

data   = query.results()

This is blocking, so the code will wait here (checking with AWS every 5 seconds) until the results are ready. Then, the results are downloaded to /tmp/ and lazily interpreted as parquet files in the form of a pyarrow.dataset.dataset.

We can sample the results easily without overloading memory:

pp.pprint(data.head(10))

And we also get useful statistics on the query:

print(query.runtime)
print(query.cost)

DO NOT END YOUR STATEMENTS WITH A SEMICOLON

ONLY ONE STATEMENT PER QUERY ALLOWED

Redshift

Please follow along at examples/redshift_basic_query.py.

The only difference from Athena is the creation of the Redshift handle:

red   = m.redshift("s3://haystac-te-athena/",
                   db        = "train",
                   workgroup = "phase1-trial2")

In this case, we're connecting to the train DB and the access to the workgroup and DB is handled through our IAM role. Permission has to be initially granted by running (in the Redshift web console):

grant USAGE ON schema public to "IAM:<my_iam_user>"

S3

import minerva

m = minerva.Minerva("hay")
objs = m.s3.ls("s3://haystac-pmo-athena/")
print(list(objs))

See minerva/s3.py for a full list of supported methods.

EC2

Follow along with examples/simple_instance.py

Cluster

Dask

Helpers

I wrote a Timing module to help with timing various functions:

with Timing("my cool test"):
    long_function()

# Prints the following
# 
# my cool test:
# => 32.45

Basic Usage

Build

To build the project, run the following commands. (Requires poetry installed):

poetry install
poetry build

TODO

parallelize the downloading of files