Pairing Terraform and Ansible allows for clean separation of concerns between provisioning infrastructure and configuring workloads with Traefik, Docker Compose, and cloud-native logging tools simplifying service deployment and observability.
Challenge
At early stages, I needed to reliably manage infrastructure for small-scale deployments. At the time, I only needed two: a stable environment (production) and an experimental environment (development). My goals were:
- Minimal cloud console clicking
- Scriptable, auditable infrastructure
- Rapid provisioning of Linux VMs with Docker-based services
- Dynamic routing and TLS termination
- Centralized logs for diagnostics
Initially, I was deploying VMs and services manually on Azure and GCP, often SSH-ing into boxes to configure things. This did not scale. I wanted reproducibility, lower human error, and faster onboarding for future teammates or contractors.
Constraints and limitations
- Time: I had ~1–2 weekends to set up infra automation and CI hooks.
- Team size: Solo at the time. This is needed simplicity over complexity.
- Budget: Limited monthly credits on Azure and GCP; had to avoid managed Kubernetes or high-cost logging pipelines.
- Deployment targets: Ubuntu-based VMs with Docker services, mostly internal tools and staging APIs.
- Logging & monitoring: Needed to be centralized but low-maintenance, ideally something I could “set and forget.”
Approach
🧱 Infrastructure provisioning with Terraform
Terraform defined all the infrastructure-as-code:
- Resource groups, subnets, firewalls (Azure)
- Compute instances with static IPs and SSH keys
- Output variables for passing IPs and credentials to Ansible
- Provisioned both Azure and GCP environments depending on the workspace selected
Each environment (dev, staging) had its own backend state file and variables.
⚙️ Configuration with Ansible
Once the VMs were provisioned, Ansible handled all post-boot configuration:
- Installed Docker & Docker Compose
- Set up Traefik as a reverse proxy with ACME auto-certs
- Deployed service containers (via
docker-compose.yml
) from GitHub - Set up healthchecks and log forwarding agents
This ensured that a fresh VM would become fully ready to serve within ~5–10 minutes.
📊 Logging via cloud-native tools (no Grafana)
I intentionally avoided self-hosting Grafana or using Grafana Cloud to keep things lightweight. Instead:
- On Azure, I installed the OMS Agent to forward syslogs and Docker container logs to Azure Log Analytics
- On GCP, I configured the Ops Agent and journald integration to push logs to Cloud Logging (Logs Explorer)
- These tools were already available in the cloud environments and required minimal additional infra
- I could view filtered logs, query by labels or severities, and set up alerts all without provisioning or maintaining a separate observability stack
This decision saved time and complexity while providing just enough visibility for early-stage infrastructure.
🐳 Service orchestration with Docker Compose + Traefik
Each VM ran a docker-compose.yml
that booted multiple microservices (e.g., APIs, frontend, webhook handlers).
- Traefik (v2.10+) served as an edge router using file-based configuration and Docker labels.
- ACME via Let’s Encrypt automatically handled certs.
- Each service exposed ports internally; Traefik routed HTTPS requests to them.
- Deployment was as easy as
ansible-playbook deploy.yml
.
📊 Centralized logging with Azure Log Analytics / Google Logs
Logging strategy varied slightly:
- Azure: Installed the Log Analytics agent to forward
/var/log
and Docker logs. - GCP: Used the Ops Agent and
journald
to pipe logs into Cloud Logging. - Logs were filtered by severity and tagged by VM hostname and environment.
- Traefik logs and access logs were configured for structured output and forwarded as well.
This let me inspect errors, track deployments, and monitor uptime using queries.
Results
- Setup time dropped from 2 hours (manual) → 7 minutes (infra + services)
- New VMs could be deployed with 1–2 commands:
terraform apply
andansible-playbook
- Re-deploying applications dropped from 20-30 minutes (manual) → 5 minutes (with automated health checks)
- Brief-downtime deployments using Docker Compose with Traefik routing. Service restarts cause a short unavailability window, acceptable for internal tools or low-traffic environments.
- Auditability improved via Git versioning of infra definitions and Ansible roles
- Logs were searchable in cloud-native consoles with no manual tailing
What we could have done better
- Secrets management: Kept inventory secrets file which are gitignored, but should’ve adopted HashiCorp Vault or SOPS from the start.
- No health check dashboard: Couldn’t visualize all services’ status across VMs. Next iteration should use something like Uptime Kuma or Grafana + Prometheus.
- Tight coupling of Docker Compose to Ansible made it hard to redeploy specific services without a full run. Moving toward GitHub Actions triggered via webhook might decouple this better.
- Traefik config drift: As the number of services grew, managing static and dynamic configs in Ansible became error-prone.
- Downtime during deploys: Because
docker-compose up
stops and replaces containers, users experienced brief downtimes during each deploy. Implementing a blue-green deployment strategy (running new containers in parallel and switching Traefik routing only after health checks pass) would minimize this gap and support near-zero-downtime rollouts.