minerva/README.md
2024-01-30 18:13:56 -05:00

158 lines
3.5 KiB
Markdown

# Minerva
Minerva is the Roman equivalent of Athena, and Athena is AWS's database that
stores results in S3. However, Minerva goes beyond that, and now eases all AWS
access, and even offers its own cluster management with Dask.
## Athena
In order to ease programmatic access to Athena and offer blocking access (so
that your code waits for the result), I wrote `minerva` to make it seamless.
The results are returned as pyarrow datasets (with parquet files as the
underlying structure).
Please follow along at `examples/athena_basic_query.py`.
Import the required and helpful libraries:
```
import minerva
import pprint
pp = pprint.PrettyPrinter(indent=4)
```
The first substantive line is create a handle to the AWS account according to
your AWS profile in `~/.aws/credentials`:
```
m = minerva.Minerva("hay")
```
Then, we create a handle to `Athena`. The argument passed is the S3 output
location where results will be saved (`s3://<output>/results/<random number for
the query>/`):
```
athena = m.athena("s3://haystac-pmo-athena/")
```
We place the query in a non-blocking manner:
```
query = athena.query(
"""
select round(longitude, 3) as lon, count(*) as count
from trajectories.baseline
where agent = 4
group by round(longitude, 3)
order by count(*) desc
"""
)
```
Minerva will automatically wrap `query()` in an `UNLOAD` statement so that the
data is unloaded to S3 in the `parquet` format. As a prerequisite for this,
**all columns must have names**, which is why `round(longitude, 3)` is
designated `lon` and `count(*)` is `count`.
(If you *don't* want to retrieve any results, such as a `CREATE TABLE`
statement, then use `execute()` to commence a non-blocking query and use
`finish()` to block to completion.)
When we're reading to **block** until the results are ready **and retrieve the
results**, we do:
```
data = query.results()
```
This is **blocking**, so the code will wait here (checking with AWS every 5
seconds) until the results are ready. Then, the results are downloaded to
`/tmp/` and **lazily** interpreted as parquet files in the form of a
`pyarrow.dataset.dataset`.
We can sample the results easily without overloading memory:
```
pp.pprint(data.head(10))
```
And we also get useful statistics on the query:
```
print(query.runtime)
print(query.cost)
```
**DO NOT END YOUR STATEMENTS WITH A SEMICOLON**
**ONLY ONE STATEMENT PER QUERY ALLOWED**
## Redshift
Please follow along at `examples/redshift_basic_query.py`.
The only difference from Athena is the creation of the Redshift handle:
```
red = m.redshift("s3://haystac-te-athena/",
db = "train",
workgroup = "phase1-trial2")
```
In this case, we're connecting to the `train` DB and the access to the workgroup
and DB is handled through our IAM role. Permission has to be initially granted
by running (in the Redshift web console):
```
grant USAGE ON schema public to "IAM:<my_iam_user>"
```
## S3
```
import minerva
m = minerva.Minerva("hay")
objs = m.s3.ls("s3://haystac-pmo-athena/")
print(list(objs))
```
See `minerva/s3.py` for a full list of supported methods.
## EC2
Follow along with `examples/simple_instance.py`
## Cluster
## Dask
## Helpers
I wrote a `Timing` module to help with timing various functions:
```
with Timing("my cool test"):
long_function()
# Prints the following
#
# my cool test:
# => 32.45
```
# Basic Usage
# Build
To build the project, run the following commands. (Requires poetry installed):
```bash
poetry install
poetry build
```
# TODO
* parallelize the downloading of files