forked from bellwether/minerva
significant improvement to the readme and verification that all the examples work
This commit is contained in:
parent
e3c11fb1aa
commit
5dccce53e9
9 changed files with 275 additions and 109 deletions
168
README.md
168
README.md
|
|
@ -7,6 +7,9 @@ access, and even offers its own cluster management with Dask.
|
|||
In order to ease programmatic access to Athena and offer blocking access (so
|
||||
that your code waits for the result), I wrote `minerva` to make it seamless.
|
||||
|
||||
**IN ORDER TO GET UNLOAD TIMESTAMPS, YOU MUST FOLLOW THE INSTRUCTIONS IN THE
|
||||
TIMESTAMP SECTION**
|
||||
|
||||
The results are returned as pyarrow datasets (with parquet files as the
|
||||
underlying structure).
|
||||
|
||||
|
|
@ -109,6 +112,8 @@ by running (in the Redshift web console):
|
|||
grant USAGE ON schema public to "IAM:<my_iam_user>"
|
||||
```
|
||||
|
||||
Run the query in the same way as you would on Athena.
|
||||
|
||||
## S3
|
||||
|
||||
```
|
||||
|
|
@ -125,9 +130,146 @@ See `minerva/s3.py` for a full list of supported methods.
|
|||
|
||||
Follow along with `examples/simple_instance.py`
|
||||
|
||||
Like in the Athena example, we first need to gain access to our desired AWS
|
||||
account via the profile in `~/.aws/credentials`:
|
||||
|
||||
```
|
||||
import minerva
|
||||
|
||||
m = minerva.Minerva("hay")
|
||||
```
|
||||
|
||||
With that, we now create a `Pier`, which is the base from which we launch our
|
||||
machines. To keep things simple, we specify the subnet, security group, IAM
|
||||
profile, and PKI key pair here: all machines launched off this `pier` will share
|
||||
those qualities:
|
||||
|
||||
```
|
||||
pier = m.pier(subnet_id = "subnet-05eb26d8649a093e1", # project-subnet-public1-us-east-1a
|
||||
sg_groups = ["sg-0f9e555954e863954", # ssh
|
||||
"sg-0b34a3f7398076545"], # default
|
||||
iam = "S3+SSM+CloudWatch+ECR",
|
||||
key_pair = ("Ari-Brown-HAY", "~/.ssh/Ari-Brown-HAY.pem"))
|
||||
```
|
||||
|
||||
In this example, we'll only create one machine, but this shows the method you'd
|
||||
use to create multiple.
|
||||
|
||||
We're going to specify the AMI, the instance type, and the username to log in
|
||||
with. We can then specify the name of the instance (always a good idea), the
|
||||
EBS disk size in GB, and any variables that we want to make available to shell
|
||||
environment.
|
||||
|
||||
```
|
||||
def worker(pier, n=0):
|
||||
mach = pier.machine(ami = "ami-0399a4f70ca684620", # dask on ubuntu 22.04 x86
|
||||
instance_type = "t3.medium",
|
||||
username = "ubuntu",
|
||||
name = f"test-{n}",
|
||||
disk_size = 32,
|
||||
variables = {"type": "worker",
|
||||
"number": n})
|
||||
return mach
|
||||
```
|
||||
|
||||
Here, we create a single machine. Nothing will happen on AWS yet!
|
||||
|
||||
```
|
||||
mach = worker(pier)
|
||||
```
|
||||
|
||||
Then we tell AWS to *asynchronously* create our instance:
|
||||
|
||||
```
|
||||
mach.create()
|
||||
```
|
||||
|
||||
Then we can **block** and wait until the instance has been started and establish
|
||||
an SSH connection. This has the potential to fail if AWS takes too long to
|
||||
either start the instance (> 3 minutes) or if the internal services on the
|
||||
machine take too long to start up (> 35 seconds).
|
||||
|
||||
```
|
||||
mach.login()
|
||||
```
|
||||
|
||||
Now we can begin to use it! Using `mach.cmd()` will let you run commands on the
|
||||
server via SSH. Unfortunately, they all take place within `/bin/bash -c`.
|
||||
|
||||
The return value of the `cmd()` function is a combination of STDOUT and STDERR,
|
||||
which can be accessed via `stdout` and `stderr`, respectively.
|
||||
|
||||
```
|
||||
print("*******")
|
||||
print(repr(mach.cmd("echo 'hello world'").stdout))
|
||||
print("*******")
|
||||
print(mach.cmd("echo I am machine $number of type $type"))
|
||||
print("*******")
|
||||
```
|
||||
|
||||
When you're done, terminate the instance:
|
||||
|
||||
```
|
||||
mach.terminate()
|
||||
```
|
||||
|
||||
|
||||
## Cluster
|
||||
|
||||
## Dask
|
||||
Creating a `dask` cluster builds on what we've seen in the other EC2 examples.
|
||||
|
||||
On top of that, we'll have to load in extra libraries:
|
||||
|
||||
```
|
||||
from dask.distributed import Client
|
||||
import dask
|
||||
```
|
||||
|
||||
We define functions for creating a scheduler and our workers:
|
||||
|
||||
```
|
||||
def worker(pier, n):
|
||||
...
|
||||
|
||||
def scheduler(pier):
|
||||
...
|
||||
```
|
||||
|
||||
Our pier needs to include a security group that allows the cluster to talk to
|
||||
itself (ports 8786 and 8787):
|
||||
|
||||
```
|
||||
m = minerva.Minerva("hay")
|
||||
pier = m.pier(...
|
||||
sg_groups = [...,
|
||||
...,
|
||||
"sg-04cd2626d91ac093c"], # dask (8786, 8787)
|
||||
...
|
||||
```
|
||||
|
||||
Finally, our hard work is done and we can start the cluster! We specify the
|
||||
pier, the scheduler creation function, the worker creation function, and the
|
||||
number of workers.
|
||||
|
||||
Note that since `Cluster` isn't fully production-ready yet, it's not available
|
||||
via `pier.cluster()` yet.
|
||||
|
||||
```
|
||||
cluster = pier.cluster(scheduler, worker, num_workers=5)
|
||||
cluster.start()
|
||||
```
|
||||
|
||||
This will start up a scheduler (whose dashboard will be visible at the end at
|
||||
http://scheduler_ip:8787) and 5 workers who are available via the Dask library.
|
||||
|
||||
Connect to the dask cluster:
|
||||
|
||||
```
|
||||
client = Client(cluster.public_location)
|
||||
```
|
||||
|
||||
And use according to standard dask instructions.
|
||||
|
||||
|
||||
## Helpers
|
||||
I wrote a `Timing` module to help with timing various functions:
|
||||
|
|
@ -142,7 +284,29 @@ with Timing("my cool test"):
|
|||
# => 32.45
|
||||
```
|
||||
|
||||
# Basic Usage
|
||||
## Timestamps
|
||||
|
||||
Athena can't unload timestamps that have a timezone. Thus, you have to create a
|
||||
view in order to asbtract away the detail of the UTC timestamp. This is required
|
||||
for any unloaded format that isn't CSV.
|
||||
|
||||
When we ingest data, we ingest it into e.g. `baseline_original` and then create
|
||||
a view `baseline` that doesn't have the timestamp involved.
|
||||
|
||||
```
|
||||
-- view to accomodate for the fact that Athena can't handle timezones
|
||||
-- required in order to unload the data in any format that's not CSV
|
||||
|
||||
create or replace view my_data AS
|
||||
select
|
||||
agent, from_unixtime(cast(to_unixtime(timestamp) AS bigint)) as timestamp, latitude, longitude
|
||||
from my_data_original
|
||||
```
|
||||
|
||||
## Writing SQL
|
||||
I recommend keeping the SQL in separate files (be organized!) and then
|
||||
processing them with the `Mako` templating library. Use `${}` within your SQL
|
||||
files in order to pass variables.
|
||||
|
||||
# Build
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue