# Minerva Minerva is the Roman equivalent of Athena, and Athena is AWS's database that stores results in S3. However, Minerva goes beyond that, and now eases all AWS access, and even offers its own cluster management with Dask. ## Athena In order to ease programmatic access to Athena and offer blocking access (so that your code waits for the result), I wrote `minerva` to make it seamless. The results are returned as pyarrow datasets (with parquet files as the underlying structure). Please follow along at `examples/athena_basic_query.py`. Import the required and helpful libraries: ``` import minerva import pprint pp = pprint.PrettyPrinter(indent=4) ``` The first substantive line is create a handle to the AWS account according to your AWS profile in `~/.aws/credentials`: ``` m = minerva.Minerva("hay") ``` Then, we create a handle to `Athena`. The argument passed is the S3 output location where results will be saved (`s3:///results//`): ``` athena = m.athena("s3://haystac-pmo-athena/") ``` We place the query in a non-blocking manner: ``` query = athena.query( """ select round(longitude, 3) as lon, count(*) as count from trajectories.baseline where agent = 4 group by round(longitude, 3) order by count(*) desc """ ) ``` Minerva will automatically wrap `query()` in an `UNLOAD` statement so that the data is unloaded to S3 in the `parquet` format. As a prerequisite for this, **all columns must have names**, which is why `round(longitude, 3)` is designated `lon` and `count(*)` is `count`. (If you *don't* want to retrieve any results, such as a `CREATE TABLE` statement, then use `execute()` to commence a non-blocking query and use `finish()` to block to completion.) When we're reading to **block** until the results are ready **and retrieve the results**, we do: ``` data = query.results() ``` This is **blocking**, so the code will wait here (checking with AWS every 5 seconds) until the results are ready. Then, the results are downloaded to `/tmp/` and **lazily** interpreted as parquet files in the form of a `pyarrow.dataset.dataset`. We can sample the results easily without overloading memory: ``` pp.pprint(data.head(10)) ``` And we also get useful statistics on the query: ``` print(query.runtime) print(query.cost) ``` **DO NOT END YOUR STATEMENTS WITH A SEMICOLON** **ONLY ONE STATEMENT PER QUERY ALLOWED** ## Redshift Please follow along at `examples/redshift_basic_query.py`. The only difference from Athena is the creation of the Redshift handle: ``` red = m.redshift("s3://haystac-te-athena/", db = "train", workgroup = "phase1-trial2") ``` In this case, we're connecting to the `train` DB and the access to the workgroup and DB is handled through our IAM role. Permission has to be initially granted by running (in the Redshift web console): ``` grant USAGE ON schema public to "IAM:" ``` ## S3 ``` import minerva m = minerva.Minerva("hay") objs = m.s3.ls("s3://haystac-pmo-athena/") print(list(objs)) ``` See `minerva/s3.py` for a full list of supported methods. ## EC2 Follow along with `examples/simple_instance.py` ## Cluster ## Dask ## Helpers I wrote a `Timing` module to help with timing various functions: ``` with Timing("my cool test"): long_function() # Prints the following # # my cool test: # => 32.45 ``` # Basic Usage # Build To build the project, run the following commands. (Requires poetry installed): ```bash poetry install poetry build ``` # TODO * parallelize the downloading of files