minerva/README.md
2023-10-12 14:58:18 -04:00

82 lines
2.1 KiB
Markdown

# Minerva
Minerva is the Roman equivalent of Athena, and Athena is AWS's database that
stores results in S3. However, Minerva goes beyond that, and now eases all AWS
access, and even offers its own cluster management with Dask.
## Athena
In order to ease programmatic access to Athena and offer blocking access (so
that your code waits for the result), I wrote `minerva` to make it seamless.
The results are returned as pyarrow datasets (with parquet files as the
underlying structure).
## Redshift
## S3
## EC2
## Cluster
## Dask
## Helpers
I wrote a `Timing` module to help with timing various functions:
```
with Timing("my cool test"):
long_function()
# Prints the following
#
# my cool test:
# => 32.45
```
# Basic Usage
```
import minerva as m
athena = m.Athena("hay", "s3://haystac-pmo-athena/")
query = athena.query('select * from "trajectories"."kitware" limit 10')
data = query.results()
print(data.head(10))
```
First, a connection to Athena is made. The first argument is the AWS profile in
`~/.aws/credentials`. The second argument is the S3 location where the results
will be stored.
In the second substantive line, an SQL query is made. This is **non-blocking**.
The query is off and running and you are free to do whatever you want now.
In the third line, the results are requested. This is **blocking**, so the code
will wait here (checking with AWS every 5 seconds) until the results are ready.
Then, the results are downloaded to `/tmp/` and lazily interpreted as parquet
files in the form of a `pyarrow.dataset.dataset`.
**DO NOT END YOUR STATEMENTS WITH A SEMICOLON**
**ONLY ONE STATEMENT PER QUERY ALLOWED**
# Returning Scalar Values
In SQL, scalar values get assigned an anonymous column -- Athena doesn't like
that. Thus, you have to assign the column a name.
```
data = athena.query('select count(*) as my_col from "trajectories"."kitware"').results()
print(data.head(1))
```
# Build
To build the project, run the following commands. (Requires poetry installed):
```bash
poetry install
poetry build
```
# TODO
* parallelize the downloading of files