Public repository for spreading the fruits of my labor.

Find a file

Ari Brown c0ff6af866 specifying port for worker dashboard		2023-11-15 15:09:18 -05:00
examples	updated dask cluster example	2023-10-12 14:59:54 -04:00
minerva	specifying port for worker dashboard	2023-11-15 15:09:18 -05:00
.gitignore	Various updates to export the main classes, update dependencies, and add an...	2023-08-02 14:40:01 +00:00
.gitlab-ci.yml	updated gitlab ci to always use dind	2023-10-10 14:59:42 -04:00
create_view.sql	updated create_view	2023-07-27 13:58:17 -04:00
poetry.lock	added dask clustering support	2023-10-02 17:26:02 -04:00
pyproject.toml	dask scheduler apparently doesn't exist sometimes	2023-11-09 14:08:03 -05:00
README.md	updated readme for another test	2023-10-12 14:58:18 -04:00
test.py	added dask clustering support	2023-10-02 17:26:02 -04:00
test2.py	added dask clustering support	2023-10-02 17:26:02 -04:00

README.md

Minerva

Minerva is the Roman equivalent of Athena, and Athena is AWS's database that stores results in S3. However, Minerva goes beyond that, and now eases all AWS access, and even offers its own cluster management with Dask.

Athena

In order to ease programmatic access to Athena and offer blocking access (so that your code waits for the result), I wrote minerva to make it seamless.

The results are returned as pyarrow datasets (with parquet files as the underlying structure).

Redshift

S3

EC2

Cluster

Dask

Helpers

I wrote a Timing module to help with timing various functions:

with Timing("my cool test"):
    long_function()

# Prints the following
# 
# my cool test:
# => 32.45

Basic Usage

import minerva as m

athena = m.Athena("hay", "s3://haystac-pmo-athena/")
query  = athena.query('select * from "trajectories"."kitware" limit 10')
data   = query.results()
print(data.head(10))

First, a connection to Athena is made. The first argument is the AWS profile in ~/.aws/credentials. The second argument is the S3 location where the results will be stored.

In the second substantive line, an SQL query is made. This is non-blocking. The query is off and running and you are free to do whatever you want now.

In the third line, the results are requested. This is blocking, so the code will wait here (checking with AWS every 5 seconds) until the results are ready. Then, the results are downloaded to /tmp/ and lazily interpreted as parquet files in the form of a pyarrow.dataset.dataset.

DO NOT END YOUR STATEMENTS WITH A SEMICOLON

ONLY ONE STATEMENT PER QUERY ALLOWED

Returning Scalar Values

In SQL, scalar values get assigned an anonymous column -- Athena doesn't like that. Thus, you have to assign the column a name.

data = athena.query('select count(*) as my_col from "trajectories"."kitware"').results()
print(data.head(1))

Build

To build the project, run the following commands. (Requires poetry installed):

poetry install
poetry build

TODO

parallelize the downloading of files