**Log Aggregation: WePay’s Logging Infrastructure**
Without logs, navigating through issues and errors would be like stumbling in the dark. In this article, we will discuss how WePay has set up its logging infrastructure and the components they use for log aggregation, processing, enrichment, buffering, ingestion, storage, and searching.
WePay utilizes Filebeat to collect and ship logs from virtual machines (VMs) and microservices. Filebeat ensures reliable delivery of events with no data loss by storing the delivery state of each event in a registry file. If the defined output is blocked, Filebeat will continue trying to send events until it receives confirmation from the output.
**Log Processing and Enrichment**
WePay utilizes Logstash for advanced log processing and enrichment. Logstash filters parse each log event, identify named fields, and transform them into a common format for analysis. Logstash’s features include:
– Deriving structure from unstructured data using grok and mutate filters
– Deciphering geographic coordinates from IP addresses
– Anonymizing personally identifiable information (PII) data and excluding sensitive fields from logs
– Simplifying processing, regardless of data source, format, or schema
– Converting JSON log messages into Avro format matching an Avro schema for storage in Kafka
**Log Buffering and Ingestion**
To handle sudden log surges and protect Elasticsearch, WePay uses Apache Kafka for log buffering. In WePay’s logging pipeline, Logstash forwards logs to Kafka in Avro format. WePay also utilizes lightweight Confluent Kafka Connectors for ingesting logs into Elastic, Google BigQuery, and Google Cloud Storage Buckets.
WePay, as a payments company, follows PCI DSS audits requirements, which state that logs must be retained for a minimum of one year, with 90 days of logs available for immediate analysis. WePay stores logs in three places:
1. Elasticsearch: Logs are retained for 90 days for immediate analysis.
2. Google BigQuery: Logs are stored for long-term retention and compliance requirements.
3. Google Cloud Storage: Logs are stored for long-term retention and to support data backfilling in case of BigQuery connector issues.
WePay employs separate Elastic clusters for different environments: development, testing, staging, and production. This separation offers advantages such as isolating log surges caused by bugs or incorrect logging formats and minimizing risks during upgrades.
**Hot Warm Cold Architecture**
To manage data retention and index lifecycle, WePay has set up three types of elastic data nodes: hot, warm, and cold. Hot nodes handle new indexes with extensive reads and writes, while warm and cold nodes host older indexes with reads and no writes. Elastic’s Index Lifecycle Management (ILM) triggers actions based on conditions. WePay’s ILM policy includes transitioning new indexes to warm nodes after one week and freezing indexes on cold nodes after one month. Indexes are deleted after three months.
**Searching with Kibana**
WePay uses two elastic clusters but wanted to have a single Kibana instance for searching logs. They achieved this using cross-cluster search. By creating a third Elasticsearch cluster, requests can be sent to both the development and production clusters. This setup has several benefits, including centralized security roles and permissions, dedicated machine learning nodes, search thread throttling, and increased overall security.
WePay has plans to integrate application performance monitoring (APM) data into Elasticsearch for better correlation with logging events. They also aim to implement Cross Cluster Replication for improved fault tolerance and utilize the Frozen Tier feature in Elastic to search for logs beyond the 90-day retention period.
WePay’s logging infrastructure, with its log aggregation, processing, enrichment, buffering, ingestion, storage, and searching capabilities, has been designed to handle high volumes of logs while ensuring data integrity, security, and compliance. The separation of environments and the utilization of different Elastic clusters have proven effective in managing and protecting the logging pipeline. With continuous improvements and future developments, WePay continues to enhance its logging capabilities for efficient troubleshooting and analysis.