Terraform and Ansible: Infrastructure and Deployments

Pairing Terraform and Ansible allows for clean separation of concerns between provisioning infrastructure and configuring workloads with Traefik, Docker Compose, and cloud-native logging tools simplifying service deployment and observability.

Challenge

At early stages, I needed to reliably manage infrastructure for small-scale deployments. At the time, I only needed two: a stable environment (production) and an experimental environment (development). My goals were:

Minimal cloud console clicking
Scriptable, auditable infrastructure
Rapid provisioning of Linux VMs with Docker-based services
Dynamic routing and TLS termination
Centralized logs for diagnostics

Initially, I was deploying VMs and services manually on Azure and GCP, often SSH-ing into boxes to configure things. This did not scale. I wanted reproducibility, lower human error, and faster onboarding for future teammates or contractors.

Constraints and limitations

Time: I had ~1–2 weekends to set up infra automation and CI hooks.
Team size: Solo at the time. This is needed simplicity over complexity.
Budget: Limited monthly credits on Azure and GCP; had to avoid managed Kubernetes or high-cost logging pipelines.
Deployment targets: Ubuntu-based VMs with Docker services, mostly internal tools and staging APIs.
Logging & monitoring: Needed to be centralized but low-maintenance, ideally something I could “set and forget.”

Approach

🧱 Infrastructure provisioning with Terraform

Terraform defined all the infrastructure-as-code:

Resource groups, subnets, firewalls (Azure)
Compute instances with static IPs and SSH keys
Output variables for passing IPs and credentials to Ansible
Provisioned both Azure and GCP environments depending on the workspace selected

Each environment (dev, staging) had its own backend state file and variables.

⚙️ Configuration with Ansible

Once the VMs were provisioned, Ansible handled all post-boot configuration:

Installed Docker & Docker Compose
Set up Traefik as a reverse proxy with ACME auto-certs
Deployed service containers (via docker-compose.yml) from GitHub
Set up healthchecks and log forwarding agents

This ensured that a fresh VM would become fully ready to serve within ~5–10 minutes.

📊 Logging via cloud-native tools (no Grafana)

I intentionally avoided self-hosting Grafana or using Grafana Cloud to keep things lightweight. Instead:

On Azure, I installed the OMS Agent to forward syslogs and Docker container logs to Azure Log Analytics
On GCP, I configured the Ops Agent and journald integration to push logs to Cloud Logging (Logs Explorer)
These tools were already available in the cloud environments and required minimal additional infra
I could view filtered logs, query by labels or severities, and set up alerts all without provisioning or maintaining a separate observability stack

This decision saved time and complexity while providing just enough visibility for early-stage infrastructure.

🐳 Service orchestration with Docker Compose + Traefik

Each VM ran a docker-compose.yml that booted multiple microservices (e.g., APIs, frontend, webhook handlers).

Traefik (v2.10+) served as an edge router using file-based configuration and Docker labels.
ACME via Let’s Encrypt automatically handled certs.
Each service exposed ports internally; Traefik routed HTTPS requests to them.
Deployment was as easy as ansible-playbook deploy.yml.

📊 Centralized logging with Azure Log Analytics / Google Logs

Logging strategy varied slightly:

Azure: Installed the Log Analytics agent to forward /var/log and Docker logs.
GCP: Used the Ops Agent and journald to pipe logs into Cloud Logging.
Logs were filtered by severity and tagged by VM hostname and environment.
Traefik logs and access logs were configured for structured output and forwarded as well.

This let me inspect errors, track deployments, and monitor uptime using queries.

Results

Setup time dropped from 2 hours (manual) → 7 minutes (infra + services)
New VMs could be deployed with 1–2 commands: terraform apply and ansible-playbook
Re-deploying applications dropped from 20-30 minutes (manual) → 5 minutes (with automated health checks)
Brief-downtime deployments using Docker Compose with Traefik routing. Service restarts cause a short unavailability window, acceptable for internal tools or low-traffic environments.
Auditability improved via Git versioning of infra definitions and Ansible roles
Logs were searchable in cloud-native consoles with no manual tailing

What we could have done better

Secrets management: Kept inventory secrets file which are gitignored, but should’ve adopted HashiCorp Vault or SOPS from the start.
No health check dashboard: Couldn’t visualize all services’ status across VMs. Next iteration should use something like Uptime Kuma or Grafana + Prometheus.
Tight coupling of Docker Compose to Ansible made it hard to redeploy specific services without a full run. Moving toward GitHub Actions triggered via webhook might decouple this better.
Traefik config drift: As the number of services grew, managing static and dynamic configs in Ansible became error-prone.
Downtime during deploys: Because docker-compose up stops and replaces containers, users experienced brief downtimes during each deploy. Implementing a blue-green deployment strategy (running new containers in parallel and switching Traefik routing only after health checks pass) would minimize this gap and support near-zero-downtime rollouts.