Terraform and Ansible: Infrastructure and Deployments

Pairing Terraform and Ansible allows for clean separation of concerns between provisioning infrastructure and configuring workloads with Traefik, Docker Compose, and cloud-native logging tools simplifying service deployment and observability.

Challenge

At early stages, I needed to reliably manage infrastructure for small-scale deployments. At the time, I only needed two: a stable environment (production) and an experimental environment (development). My goals were:

  • Minimal cloud console clicking
  • Scriptable, auditable infrastructure
  • Rapid provisioning of Linux VMs with Docker-based services
  • Dynamic routing and TLS termination
  • Centralized logs for diagnostics

Initially, I was deploying VMs and services manually on Azure and GCP, often SSH-ing into boxes to configure things. This did not scale. I wanted reproducibility, lower human error, and faster onboarding for future teammates or contractors.


Constraints and limitations

  • Time: I had ~1–2 weekends to set up infra automation and CI hooks.
  • Team size: Solo at the time. This is needed simplicity over complexity.
  • Budget: Limited monthly credits on Azure and GCP; had to avoid managed Kubernetes or high-cost logging pipelines.
  • Deployment targets: Ubuntu-based VMs with Docker services, mostly internal tools and staging APIs.
  • Logging & monitoring: Needed to be centralized but low-maintenance, ideally something I could “set and forget.”

Approach

🧱 Infrastructure provisioning with Terraform

Terraform defined all the infrastructure-as-code:

  • Resource groups, subnets, firewalls (Azure)
  • Compute instances with static IPs and SSH keys
  • Output variables for passing IPs and credentials to Ansible
  • Provisioned both Azure and GCP environments depending on the workspace selected

Each environment (dev, staging) had its own backend state file and variables.

⚙️ Configuration with Ansible

Once the VMs were provisioned, Ansible handled all post-boot configuration:

  • Installed Docker & Docker Compose
  • Set up Traefik as a reverse proxy with ACME auto-certs
  • Deployed service containers (via docker-compose.yml) from GitHub
  • Set up healthchecks and log forwarding agents

This ensured that a fresh VM would become fully ready to serve within ~5–10 minutes.

📊 Logging via cloud-native tools (no Grafana)

I intentionally avoided self-hosting Grafana or using Grafana Cloud to keep things lightweight. Instead:

  • On Azure, I installed the OMS Agent to forward syslogs and Docker container logs to Azure Log Analytics
  • On GCP, I configured the Ops Agent and journald integration to push logs to Cloud Logging (Logs Explorer)
  • These tools were already available in the cloud environments and required minimal additional infra
  • I could view filtered logs, query by labels or severities, and set up alerts all without provisioning or maintaining a separate observability stack

This decision saved time and complexity while providing just enough visibility for early-stage infrastructure.

🐳 Service orchestration with Docker Compose + Traefik

Each VM ran a docker-compose.yml that booted multiple microservices (e.g., APIs, frontend, webhook handlers).

  • Traefik (v2.10+) served as an edge router using file-based configuration and Docker labels.
  • ACME via Let’s Encrypt automatically handled certs.
  • Each service exposed ports internally; Traefik routed HTTPS requests to them.
  • Deployment was as easy as ansible-playbook deploy.yml.

📊 Centralized logging with Azure Log Analytics / Google Logs

Logging strategy varied slightly:

  • Azure: Installed the Log Analytics agent to forward /var/log and Docker logs.
  • GCP: Used the Ops Agent and journald to pipe logs into Cloud Logging.
  • Logs were filtered by severity and tagged by VM hostname and environment.
  • Traefik logs and access logs were configured for structured output and forwarded as well.

This let me inspect errors, track deployments, and monitor uptime using queries.

Results

  • Setup time dropped from 2 hours (manual) → 7 minutes (infra + services)
  • New VMs could be deployed with 1–2 commands: terraform apply and ansible-playbook
  • Re-deploying applications dropped from 20-30 minutes (manual) → 5 minutes (with automated health checks)
  • Brief-downtime deployments using Docker Compose with Traefik routing. Service restarts cause a short unavailability window, acceptable for internal tools or low-traffic environments.
  • Auditability improved via Git versioning of infra definitions and Ansible roles
  • Logs were searchable in cloud-native consoles with no manual tailing

What we could have done better

  • Secrets management: Kept inventory secrets file which are gitignored, but should’ve adopted HashiCorp Vault or SOPS from the start.
  • No health check dashboard: Couldn’t visualize all services’ status across VMs. Next iteration should use something like Uptime Kuma or Grafana + Prometheus.
  • Tight coupling of Docker Compose to Ansible made it hard to redeploy specific services without a full run. Moving toward GitHub Actions triggered via webhook might decouple this better.
  • Traefik config drift: As the number of services grew, managing static and dynamic configs in Ansible became error-prone.
  • Downtime during deploys: Because docker-compose up stops and replaces containers, users experienced brief downtimes during each deploy. Implementing a blue-green deployment strategy (running new containers in parallel and switching Traefik routing only after health checks pass) would minimize this gap and support near-zero-downtime rollouts.

References

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.