3.5 KiB
Minerva
Minerva is the Roman equivalent of Athena, and Athena is AWS's database that stores results in S3. However, Minerva goes beyond that, and now eases all AWS access, and even offers its own cluster management with Dask.
Athena
In order to ease programmatic access to Athena and offer blocking access (so
that your code waits for the result), I wrote minerva to make it seamless.
The results are returned as pyarrow datasets (with parquet files as the underlying structure).
Please follow along at examples/athena_basic_query.py.
Import the required and helpful libraries:
import minerva
import pprint
pp = pprint.PrettyPrinter(indent=4)
The first substantive line is create a handle to the AWS account according to
your AWS profile in ~/.aws/credentials:
m = minerva.Minerva("hay")
Then, we create a handle to Athena. The argument passed is the S3 output
location where results will be saved (s3://<output>/results/<random number for the query>/):
athena = m.athena("s3://haystac-pmo-athena/")
We place the query in a non-blocking manner:
query = athena.query(
"""
select round(longitude, 3) as lon, count(*) as count
from trajectories.baseline
where agent = 4
group by round(longitude, 3)
order by count(*) desc
"""
)
Minerva will automatically wrap query() in an UNLOAD statement so that the
data is unloaded to S3 in the parquet format. As a prerequisite for this,
all columns must have names, which is why round(longitude, 3) is
designated lon and count(*) is count.
(If you don't want to retrieve any results, such as a CREATE TABLE
statement, then use execute() to commence a non-blocking query and use
finish() to block to completion.)
When we're reading to block until the results are ready and retrieve the results, we do:
data = query.results()
This is blocking, so the code will wait here (checking with AWS every 5
seconds) until the results are ready. Then, the results are downloaded to
/tmp/ and lazily interpreted as parquet files in the form of a
pyarrow.dataset.dataset.
We can sample the results easily without overloading memory:
pp.pprint(data.head(10))
And we also get useful statistics on the query:
print(query.runtime)
print(query.cost)
DO NOT END YOUR STATEMENTS WITH A SEMICOLON
ONLY ONE STATEMENT PER QUERY ALLOWED
Redshift
Please follow along at examples/redshift_basic_query.py.
The only difference from Athena is the creation of the Redshift handle:
red = m.redshift("s3://haystac-te-athena/",
db = "train",
workgroup = "phase1-trial2")
In this case, we're connecting to the train DB and the access to the workgroup
and DB is handled through our IAM role. Permission has to be initially granted
by running (in the Redshift web console):
grant USAGE ON schema public to "IAM:<my_iam_user>"
S3
import minerva
m = minerva.Minerva("hay")
objs = m.s3.ls("s3://haystac-pmo-athena/")
print(list(objs))
See minerva/s3.py for a full list of supported methods.
EC2
Follow along with examples/simple_instance.py
Cluster
Dask
Helpers
I wrote a Timing module to help with timing various functions:
with Timing("my cool test"):
long_function()
# Prints the following
#
# my cool test:
# => 32.45
Basic Usage
Build
To build the project, run the following commands. (Requires poetry installed):
poetry install
poetry build
TODO
- parallelize the downloading of files