tidying up examples and readme

2024-01-30 18:13:56 -05:00 · 2024-01-30 18:13:56 -05:00 · e3c11fb1aa
commit e3c11fb1aa
parent 819bf7abf3
5 changed files with 138 additions and 66 deletions
--- a/README.md
+++ b/README.md
@ -10,12 +10,121 @@ that your code waits for the result), I wrote `minerva` to make it seamless.
 The results are returned as pyarrow datasets (with parquet files as the
 underlying structure).

+Please follow along at `examples/athena_basic_query.py`.
+
+Import the required and helpful libraries:
+
+```
+import minerva
+import pprint
+
+pp = pprint.PrettyPrinter(indent=4)
+```
+
+The first substantive line is create a handle to the AWS account according to
+your AWS profile in `~/.aws/credentials`:
+
+```
+m = minerva.Minerva("hay")
+```
+
+Then, we create a handle to `Athena`. The argument passed is the S3 output
+location where results will be saved (`s3://<output>/results/<random number for
+the query>/`):
+
+```
+athena = m.athena("s3://haystac-pmo-athena/")
+```
+
+We place the query in a non-blocking manner:
+
+```
+query  = athena.query(
+    """
+    select round(longitude, 3) as lon, count(*) as count
+    from trajectories.baseline
+    where agent = 4
+    group by round(longitude, 3)
+    order by count(*) desc
+    """
+)
+```
+
+Minerva will automatically wrap `query()` in an `UNLOAD` statement so that the
+data is unloaded to S3 in the `parquet` format. As a prerequisite for this,
+**all columns must have names**, which is why `round(longitude, 3)` is
+designated `lon` and `count(*)` is `count`.
+
+(If you *don't* want to retrieve any results, such as a `CREATE TABLE`
+statement, then use `execute()` to commence a non-blocking query and use
+`finish()` to block to completion.)
+
+When we're reading to **block** until the results are ready **and retrieve the
+results**, we do:
+
+```
+data   = query.results()
+```
+
+
+This is **blocking**, so the code will wait here (checking with AWS every 5
+seconds) until the results are ready.  Then, the results are downloaded to
+`/tmp/` and **lazily** interpreted as parquet files in the form of a
+`pyarrow.dataset.dataset`.
+
+We can sample the results easily without overloading memory:
+
+```
+pp.pprint(data.head(10))
+```
+
+And we also get useful statistics on the query:
+
+```
+print(query.runtime)
+print(query.cost)
+```
+
+**DO NOT END YOUR STATEMENTS WITH A SEMICOLON**
+
+**ONLY ONE STATEMENT PER QUERY ALLOWED**
+
 ## Redshift

+Please follow along at `examples/redshift_basic_query.py`.
+
+The only difference from Athena is the creation of the Redshift handle:
+
+```
+red   = m.redshift("s3://haystac-te-athena/",
+                   db        = "train",
+                   workgroup = "phase1-trial2")
+```
+
+In this case, we're connecting to the `train` DB and the access to the workgroup
+and DB is handled through our IAM role. Permission has to be initially granted
+by running (in the Redshift web console):
+
+```
+grant USAGE ON schema public to "IAM:<my_iam_user>"
+```
+
 ## S3

+```
+import minerva
+
+m = minerva.Minerva("hay")
+objs = m.s3.ls("s3://haystac-pmo-athena/")
+print(list(objs))
+```
+
+See `minerva/s3.py` for a full list of supported methods.
+
 ## EC2

+Follow along with `examples/simple_instance.py`
+
 ## Cluster

 ## Dask
@ -34,39 +143,6 @@ with Timing("my cool test"):
 ```

 # Basic Usage
-```
-import minerva as m
-
-athena = m.Athena("hay", "s3://haystac-pmo-athena/")
-query  = athena.query('select * from "trajectories"."kitware" limit 10')
-data   = query.results()
-print(data.head(10))
-```
-
-First, a connection to Athena is made. The first argument is the AWS profile in
-`~/.aws/credentials`. The second argument is the S3 location where the results
-will be stored.
-
-In the second substantive line, an SQL query is made. This is **non-blocking**.
-The query is off and running and you are free to do whatever you want now.
-
-In the third line, the results are requested. This is **blocking**, so the code
-will wait here (checking with AWS every 5 seconds) until the results are ready.
-Then, the results are downloaded to `/tmp/` and lazily interpreted as parquet
-files in the form of a `pyarrow.dataset.dataset`.
-
-**DO NOT END YOUR STATEMENTS WITH A SEMICOLON**
-
-**ONLY ONE STATEMENT PER QUERY ALLOWED**
-
-# Returning Scalar Values
-In SQL, scalar values get assigned an anonymous column -- Athena doesn't like
-that. Thus, you have to assign the column a name.
-
-```
-data = athena.query('select count(*) as my_col from "trajectories"."kitware"').results()
-print(data.head(1))
-```

 # Build