A New Beginning
January 17, 2024
Background
For a number of reasons, I became obsessed with an MMORPG called Old School RuneScape in the latter half of 2023. OSRS is a game that brings a sense of achievement, but it also consumes so much time that my life was becoming unbalanced.
At the same time, I was spearheading the implementation of a data orchestration platform called dagster at work. This project has been incredibly exciting, and has already had a profound impact on the way we move and track data within our organization. Several challenges arose with implementing dagster, many of which were related to standing up a hybrid-hosted cloud deployment. An engineer at heart, I find myself wanting to understand how everything I was building actually works at a finer level.
A week ago today, I officially dropped OSRS and began a new chapter in my life: a journey into the Cloud.
Where I'm Going
Ultimately, I'd love to implement an inexpensive, fully-scalable data platform for various initiatives:
- Crowd-funded, open-source data for environmental, socio-economic, and educational public use such as Data Science for Social Good
- Subscription model data for commercial use cases like real-estate arbitrage, social media insights, consumer analysis
- Ad-funded data for "fun" use-cases like my BrawlStars dashboard, Music Bingo game, or maybe a hiking analytics application
The Tools
- Python - the tried-and-true 'one-size-fits-all' programming language
- dbt - a "data build tool" for maintaining and testing data models
- dagster - data orchestration tool of choice
- ECS or EKS - scalable service for self-hosting compute and logging
- ChatGPT & Github Copilot - critical tools for accelerating development
- git & GitHub Actions - version control and deployment automation
What's Missing
- Data ingestion tool - Something like airbyte would make it easier to ingest data from all different types of sources
- Database selection - Thinking big data, so an OLAP solution would be ideal. Will likely use PostgreSQL until there is funding to migrate to Snowflake or Redshift
- BI dashboard tool - I'm a huge fan of streamlit, and self-hosting streamlit apps shouldn't be an issue
- Auto ML solution - The data science ecosystem evolves rapidly. Keras, sklearn and (py)caret were big when I was taking classes, but better tools will emerge
Progress Update
After starting the A Cloud Guru Solutions Architect - Associate course last week, I immediately was given a deeper dive of several resource types I had thought I was already familiar with at this point. Boy was I wrong!
The Course
This week I covered IAM, S3, EC2, EBS, EFS, Databases, and VPC.
IAM seems pretty straightforward - users, roles, and groups with access policies applied.
S3 is so much more capable than I knew. Not only is S3 used as a general purpose file storage, but it also integrates with many other AWS resource types. CloudFront can publish logs to an S3 bucket, S3 can serve as a static web host, databases can store point-in-time snapshots in S3.
EC2 also has more depth than I previously thought. Creating AMIs for use with new EC2s from EBS blocks is a fascinating concept. Most of the security with allowed inbound and outbound ports is still way over my head.
The Platform
AWS Adventures
Shortly into the course, I was inspired to host my own website. Intrigued by the cost-effectiveness of S3 static sites, I set out to create my own publicly-facing S3 bucket behind a domain.
After wrestling with Google Domains (RIP), Squarespace, Route 53, CNAME records, Alias records, and CloudFront - I finally got my site online! The journey involved:
- Purchasing bstroh.com, then realizing I needed Route 53
- Buying strohb.com after learning about the 60-day transfer lock
- Discovering "hosted zones" after almost being charged $50/month for policy records
- Realizing S3 bucket names must match domain names exactly
- Finally embracing CloudFront for free SSL certificates and caching
GitHub Integration
After getting the AWS side working, I connected GitHub Actions for automated deployments. Now my site updates automatically whenever I push changes!
What's Next
Continuing with the A Cloud Guru course - next up is Route 53, Elastic Load Balancing, and High Availability.
For the platform, I'm looking to:
- Set up CloudWatch for monitoring
- Explore cost optimization strategies
- Start planning the data ingestion pipeline
Thanks for reading! Check back next week for more updates on my cloud journey.