forked from bellwether/minerva
158 lines
3.5 KiB
Markdown
158 lines
3.5 KiB
Markdown
# Minerva
|
|
Minerva is the Roman equivalent of Athena, and Athena is AWS's database that
|
|
stores results in S3. However, Minerva goes beyond that, and now eases all AWS
|
|
access, and even offers its own cluster management with Dask.
|
|
|
|
## Athena
|
|
In order to ease programmatic access to Athena and offer blocking access (so
|
|
that your code waits for the result), I wrote `minerva` to make it seamless.
|
|
|
|
The results are returned as pyarrow datasets (with parquet files as the
|
|
underlying structure).
|
|
|
|
Please follow along at `examples/athena_basic_query.py`.
|
|
|
|
Import the required and helpful libraries:
|
|
|
|
```
|
|
import minerva
|
|
import pprint
|
|
|
|
pp = pprint.PrettyPrinter(indent=4)
|
|
```
|
|
|
|
The first substantive line is create a handle to the AWS account according to
|
|
your AWS profile in `~/.aws/credentials`:
|
|
|
|
```
|
|
m = minerva.Minerva("hay")
|
|
```
|
|
|
|
Then, we create a handle to `Athena`. The argument passed is the S3 output
|
|
location where results will be saved (`s3://<output>/results/<random number for
|
|
the query>/`):
|
|
|
|
```
|
|
athena = m.athena("s3://haystac-pmo-athena/")
|
|
```
|
|
|
|
We place the query in a non-blocking manner:
|
|
|
|
```
|
|
query = athena.query(
|
|
"""
|
|
select round(longitude, 3) as lon, count(*) as count
|
|
from trajectories.baseline
|
|
where agent = 4
|
|
group by round(longitude, 3)
|
|
order by count(*) desc
|
|
"""
|
|
)
|
|
```
|
|
|
|
Minerva will automatically wrap `query()` in an `UNLOAD` statement so that the
|
|
data is unloaded to S3 in the `parquet` format. As a prerequisite for this,
|
|
**all columns must have names**, which is why `round(longitude, 3)` is
|
|
designated `lon` and `count(*)` is `count`.
|
|
|
|
(If you *don't* want to retrieve any results, such as a `CREATE TABLE`
|
|
statement, then use `execute()` to commence a non-blocking query and use
|
|
`finish()` to block to completion.)
|
|
|
|
When we're reading to **block** until the results are ready **and retrieve the
|
|
results**, we do:
|
|
|
|
```
|
|
data = query.results()
|
|
```
|
|
|
|
|
|
This is **blocking**, so the code will wait here (checking with AWS every 5
|
|
seconds) until the results are ready. Then, the results are downloaded to
|
|
`/tmp/` and **lazily** interpreted as parquet files in the form of a
|
|
`pyarrow.dataset.dataset`.
|
|
|
|
We can sample the results easily without overloading memory:
|
|
|
|
```
|
|
pp.pprint(data.head(10))
|
|
```
|
|
|
|
And we also get useful statistics on the query:
|
|
|
|
```
|
|
print(query.runtime)
|
|
print(query.cost)
|
|
```
|
|
|
|
**DO NOT END YOUR STATEMENTS WITH A SEMICOLON**
|
|
|
|
**ONLY ONE STATEMENT PER QUERY ALLOWED**
|
|
|
|
## Redshift
|
|
|
|
Please follow along at `examples/redshift_basic_query.py`.
|
|
|
|
The only difference from Athena is the creation of the Redshift handle:
|
|
|
|
```
|
|
red = m.redshift("s3://haystac-te-athena/",
|
|
db = "train",
|
|
workgroup = "phase1-trial2")
|
|
```
|
|
|
|
In this case, we're connecting to the `train` DB and the access to the workgroup
|
|
and DB is handled through our IAM role. Permission has to be initially granted
|
|
by running (in the Redshift web console):
|
|
|
|
```
|
|
grant USAGE ON schema public to "IAM:<my_iam_user>"
|
|
```
|
|
|
|
## S3
|
|
|
|
```
|
|
import minerva
|
|
|
|
m = minerva.Minerva("hay")
|
|
objs = m.s3.ls("s3://haystac-pmo-athena/")
|
|
print(list(objs))
|
|
```
|
|
|
|
See `minerva/s3.py` for a full list of supported methods.
|
|
|
|
## EC2
|
|
|
|
Follow along with `examples/simple_instance.py`
|
|
|
|
## Cluster
|
|
|
|
## Dask
|
|
|
|
## Helpers
|
|
I wrote a `Timing` module to help with timing various functions:
|
|
|
|
```
|
|
with Timing("my cool test"):
|
|
long_function()
|
|
|
|
# Prints the following
|
|
#
|
|
# my cool test:
|
|
# => 32.45
|
|
```
|
|
|
|
# Basic Usage
|
|
|
|
# Build
|
|
|
|
To build the project, run the following commands. (Requires poetry installed):
|
|
|
|
```bash
|
|
poetry install
|
|
poetry build
|
|
```
|
|
|
|
# TODO
|
|
* parallelize the downloading of files
|
|
|