tidying up examples and readme

This commit is contained in:
Ari Brown 2024-01-30 18:13:56 -05:00
parent 819bf7abf3
commit e3c11fb1aa
5 changed files with 138 additions and 66 deletions

142
README.md
View file

@ -10,12 +10,121 @@ that your code waits for the result), I wrote `minerva` to make it seamless.
The results are returned as pyarrow datasets (with parquet files as the
underlying structure).
Please follow along at `examples/athena_basic_query.py`.
Import the required and helpful libraries:
```
import minerva
import pprint
pp = pprint.PrettyPrinter(indent=4)
```
The first substantive line is create a handle to the AWS account according to
your AWS profile in `~/.aws/credentials`:
```
m = minerva.Minerva("hay")
```
Then, we create a handle to `Athena`. The argument passed is the S3 output
location where results will be saved (`s3://<output>/results/<random number for
the query>/`):
```
athena = m.athena("s3://haystac-pmo-athena/")
```
We place the query in a non-blocking manner:
```
query = athena.query(
"""
select round(longitude, 3) as lon, count(*) as count
from trajectories.baseline
where agent = 4
group by round(longitude, 3)
order by count(*) desc
"""
)
```
Minerva will automatically wrap `query()` in an `UNLOAD` statement so that the
data is unloaded to S3 in the `parquet` format. As a prerequisite for this,
**all columns must have names**, which is why `round(longitude, 3)` is
designated `lon` and `count(*)` is `count`.
(If you *don't* want to retrieve any results, such as a `CREATE TABLE`
statement, then use `execute()` to commence a non-blocking query and use
`finish()` to block to completion.)
When we're reading to **block** until the results are ready **and retrieve the
results**, we do:
```
data = query.results()
```
This is **blocking**, so the code will wait here (checking with AWS every 5
seconds) until the results are ready. Then, the results are downloaded to
`/tmp/` and **lazily** interpreted as parquet files in the form of a
`pyarrow.dataset.dataset`.
We can sample the results easily without overloading memory:
```
pp.pprint(data.head(10))
```
And we also get useful statistics on the query:
```
print(query.runtime)
print(query.cost)
```
**DO NOT END YOUR STATEMENTS WITH A SEMICOLON**
**ONLY ONE STATEMENT PER QUERY ALLOWED**
## Redshift
Please follow along at `examples/redshift_basic_query.py`.
The only difference from Athena is the creation of the Redshift handle:
```
red = m.redshift("s3://haystac-te-athena/",
db = "train",
workgroup = "phase1-trial2")
```
In this case, we're connecting to the `train` DB and the access to the workgroup
and DB is handled through our IAM role. Permission has to be initially granted
by running (in the Redshift web console):
```
grant USAGE ON schema public to "IAM:<my_iam_user>"
```
## S3
```
import minerva
m = minerva.Minerva("hay")
objs = m.s3.ls("s3://haystac-pmo-athena/")
print(list(objs))
```
See `minerva/s3.py` for a full list of supported methods.
## EC2
Follow along with `examples/simple_instance.py`
## Cluster
## Dask
@ -34,39 +143,6 @@ with Timing("my cool test"):
```
# Basic Usage
```
import minerva as m
athena = m.Athena("hay", "s3://haystac-pmo-athena/")
query = athena.query('select * from "trajectories"."kitware" limit 10')
data = query.results()
print(data.head(10))
```
First, a connection to Athena is made. The first argument is the AWS profile in
`~/.aws/credentials`. The second argument is the S3 location where the results
will be stored.
In the second substantive line, an SQL query is made. This is **non-blocking**.
The query is off and running and you are free to do whatever you want now.
In the third line, the results are requested. This is **blocking**, so the code
will wait here (checking with AWS every 5 seconds) until the results are ready.
Then, the results are downloaded to `/tmp/` and lazily interpreted as parquet
files in the form of a `pyarrow.dataset.dataset`.
**DO NOT END YOUR STATEMENTS WITH A SEMICOLON**
**ONLY ONE STATEMENT PER QUERY ALLOWED**
# Returning Scalar Values
In SQL, scalar values get assigned an anonymous column -- Athena doesn't like
that. Thus, you have to assign the column a name.
```
data = athena.query('select count(*) as my_col from "trajectories"."kitware"').results()
print(data.head(1))
```
# Build