significant improvement to the readme and verification that all the examples work

2024-01-31 16:18:32 -05:00 · 2024-01-31 16:18:32 -05:00 · 5dccce53e9
commit 5dccce53e9
parent e3c11fb1aa
9 changed files with 275 additions and 109 deletions
--- a/README.md
+++ b/README.md
@ -7,6 +7,9 @@ access, and even offers its own cluster management with Dask.
 In order to ease programmatic access to Athena and offer blocking access (so
 that your code waits for the result), I wrote `minerva` to make it seamless.

+**IN ORDER TO GET UNLOAD TIMESTAMPS, YOU MUST FOLLOW THE INSTRUCTIONS IN THE
+TIMESTAMP SECTION**
+
 The results are returned as pyarrow datasets (with parquet files as the
 underlying structure).

@ -109,6 +112,8 @@ by running (in the Redshift web console):
 grant USAGE ON schema public to "IAM:<my_iam_user>"
 ```

+Run the query in the same way as you would on Athena.
+
 ## S3

 ```
@ -125,9 +130,146 @@ See `minerva/s3.py` for a full list of supported methods.

 Follow along with `examples/simple_instance.py`

+Like in the Athena example, we first need to gain access to our desired AWS
+account via the profile in `~/.aws/credentials`:
+
+```
+import minerva
+
+m = minerva.Minerva("hay")
+```
+
+With that, we now create a `Pier`, which is the base from which we launch our
+machines. To keep things simple, we specify the subnet, security group, IAM
+profile, and PKI key pair here: all machines launched off this `pier` will share
+those qualities:
+
+```
+pier = m.pier(subnet_id = "subnet-05eb26d8649a093e1", # project-subnet-public1-us-east-1a
+              sg_groups = ["sg-0f9e555954e863954",    # ssh
+                           "sg-0b34a3f7398076545"],   # default
+              iam       = "S3+SSM+CloudWatch+ECR",
+              key_pair  = ("Ari-Brown-HAY", "~/.ssh/Ari-Brown-HAY.pem"))
+```
+
+In this example, we'll only create one machine, but this shows the method you'd
+use to create multiple.
+
+We're going to specify the AMI, the instance type, and the username to log in
+with. We can then specify the name of the instance (always a good idea), the
+EBS disk size in GB, and any variables that we want to make available to shell
+environment.
+
+```
+def worker(pier, n=0):
+    mach = pier.machine(ami    = "ami-0399a4f70ca684620",  # dask on ubuntu 22.04 x86
+                 instance_type = "t3.medium",
+                 username      = "ubuntu",
+                 name          = f"test-{n}",
+                 disk_size     = 32,
+                 variables     = {"type":   "worker",
+                                  "number": n})
+    return mach
+```
+
+Here, we create a single machine. Nothing will happen on AWS yet!
+
+```
+mach = worker(pier)
+```
+
+Then we tell AWS to *asynchronously* create our instance:
+
+```
+mach.create()
+```
+
+Then we can **block** and wait until the instance has been started and establish
+an SSH connection. This has the potential to fail if AWS takes too long to
+either start the instance (> 3 minutes) or if the internal services on the
+machine take too long to start up (> 35 seconds).
+
+```
+mach.login()
+```
+
+Now we can begin to use it! Using `mach.cmd()` will let you run commands on the
+server via SSH. Unfortunately, they all take place within `/bin/bash -c`.
+
+The return value of the `cmd()` function is a combination of STDOUT and STDERR,
+which can be accessed via `stdout` and `stderr`, respectively.
+
+```
+print("*******")
+print(repr(mach.cmd("echo 'hello world'").stdout))
+print("*******")
+print(mach.cmd("echo I am machine $number of type $type"))
+print("*******")
+```
+
+When you're done, terminate the instance:
+
+```
+mach.terminate()
+```
+
+
 ## Cluster

-## Dask
+Creating a `dask` cluster builds on what we've seen in the other EC2 examples.
+
+On top of that, we'll have to load in extra libraries:
+
+```
+from dask.distributed import Client
+import dask
+```
+
+We define functions for creating a scheduler and our workers:
+
+```
+def worker(pier, n):
+    ...
+
+def scheduler(pier):
+    ...
+```
+
+Our pier needs to include a security group that allows the cluster to talk to
+itself (ports 8786 and 8787):
+
+```
+m       = minerva.Minerva("hay")
+pier    = m.pier(...
+                 sg_groups = [...,
+                              ...,
+                              "sg-04cd2626d91ac093c"],   # dask (8786, 8787)
+                 ...
+```
+
+Finally, our hard work is done and we can start the cluster! We specify the
+pier, the scheduler creation function, the worker creation function, and the
+number of workers.
+
+Note that since `Cluster` isn't fully production-ready yet, it's not available
+via `pier.cluster()` yet.
+
+```
+cluster = pier.cluster(scheduler, worker, num_workers=5)
+cluster.start()
+```
+
+This will start up a scheduler (whose dashboard will be visible at the end at
+http://scheduler_ip:8787) and 5 workers who are available via the Dask library.
+
+Connect to the dask cluster:
+
+```
+client  = Client(cluster.public_location)
+```
+
+And use according to standard dask instructions.
+

 ## Helpers
 I wrote a `Timing` module to help with timing various functions:
@ -142,7 +284,29 @@ with Timing("my cool test"):
 # => 32.45
 ```

-# Basic Usage
+## Timestamps
+
+Athena can't unload timestamps that have a timezone. Thus, you have to create a
+view in order to asbtract away the detail of the UTC timestamp. This is required
+for any unloaded format that isn't CSV.
+
+When we ingest data, we ingest it into e.g. `baseline_original` and then create
+a view `baseline` that doesn't have the timestamp involved.
+
+```
+-- view to accomodate for the fact that Athena can't handle timezones
+-- required in order to unload the data in any format that's not CSV
+
+create or replace view my_data AS
+select
+    agent, from_unixtime(cast(to_unixtime(timestamp) AS bigint)) as timestamp, latitude, longitude
+from my_data_original
+```
+
+## Writing SQL
+I recommend keeping the SQL in separate files (be organized!) and then
+processing them with the `Mako` templating library. Use `${}` within your SQL
+files in order to pass variables.

 # Build