forked from bellwether/minerva
57 lines
1.7 KiB
Markdown
57 lines
1.7 KiB
Markdown
# Minerva
|
|
Minerva is the Roman equivalent of Athena, and Athena is AWS's database that
|
|
stores results in S3.
|
|
|
|
In order to ease programmatic access to Athena and offer blocking access (so
|
|
that your code waits for the result), I wrote `minerva` to make it seamless.
|
|
|
|
The results are returned as pyarrow datasets (with parquet files as the
|
|
underlying structure).
|
|
|
|
# Basic Usage
|
|
```
|
|
import minerva as m
|
|
|
|
athena = m.Athena("hay", "s3://haystac-pmo-athena/")
|
|
query = athena.query('select * from "trajectories"."kitware" limit 10')
|
|
data = query.results()
|
|
print(data.head(10))
|
|
```
|
|
|
|
First, a connection to Athena is made. The first argument is the AWS profile in
|
|
`~/.aws/credentials`. The second argument is the S3 location where the results
|
|
will be stored.
|
|
|
|
In the second substantive line, an SQL query is made. This is **non-blocking**.
|
|
The query is off and running and you are free to do whatever you want now.
|
|
|
|
In the third line, the results are requested. This is **blocking**, so the code
|
|
will wait here (checking with AWS every 5 seconds) until the results are ready.
|
|
Then, the results are downloaded to `/tmp/` and lazily interpreted as parquet
|
|
files in the form of a `pyarrow.dataset.dataset`.
|
|
|
|
**DO NOT END YOUR STATEMENTS WITH A SEMICOLON**
|
|
|
|
**ONLY ONE STATEMENT PER QUERY ALLOWED**
|
|
|
|
# Returning Scalar Values
|
|
In SQL, scalar values get assigned an anonymous column -- Athena doesn't like
|
|
that. Thus, you have to assign the column a name.
|
|
|
|
```
|
|
data = athena.query('select count(*) as my_col from "trajectories"."kitware"').results()
|
|
print(data.head(1))
|
|
```
|
|
|
|
# Build
|
|
|
|
To build the project, run the following commands. (Requires poetry installed):
|
|
|
|
```bash
|
|
poetry install
|
|
poetry build
|
|
```
|
|
|
|
# TODO
|
|
* parallelize the downloading of files
|
|
|