TECHMONARCH INSIGHTS · SOC OPERATIONS & SECURITY ENGINEERING
Log management is the intelligence layer that determines whether your SOC and NOC can investigate incidents with evidence or with guesswork. Most MSPs are collecting a fraction of what they should and analyzing even less. Here is the framework to fix that.
By TechMonarch Editorial Team | 9 min read | SOC Operations, Security Engineering & Compliance
|
204 days average time to identify a breach in environments without centralized log management — versus 72 days in environments with it |
76% of MSPs cannot reconstruct a complete attack timeline after an incident due to insufficient log collection and retention practices |
90 days minimum log retention period required by most major compliance frameworks including HIPAA, PCI-DSS, and SOC 2 Type II |
When a security incident occurs in a client environment — a compromised credential, a lateral movement event, data exfiltration from a file server, a ransomware detonation — there are two fundamentally different situations a response team can find themselves in. In the first, there is a centralized log management environment with comprehensive collection, proper retention, and structured analysis capabilities. The incident response team can reconstruct the attack timeline, identify patient zero, trace every lateral movement, and determine the full blast radius of the compromise. In the second, the logs that exist are fragmented across individual systems, incomplete, partially overwritten, and covering only the last 48 to 72 hours because no one configured longer local retention. The team is investigating with fragments.
The difference between these two situations is log management — and for the majority of SMB and mid-market clients in a typical MSP portfolio, the second scenario is significantly more common than the first. Not because log data is not generated — modern IT environments produce enormous volumes of log data across every layer of the stack — but because that data is not being collected centrally, not being retained for long enough, and not being analyzed in a way that surfaces actionable intelligence.
Log management is also no longer purely a security concern. Compliance frameworks that govern a growing proportion of MSP client environments — HIPAA, PCI-DSS, SOC 2, CMMC, ISO 27001 — all have specific log collection and retention requirements that represent audit findings and potential regulatory exposure when they are not met. For MSPs whose clients are in regulated industries, log management is a compliance deliverable that belongs in the managed service scope alongside patch management and backup management.
This guide is a practical framework for MSPs building or improving a log management capability for their client base — covering what to collect, how long to store it, and what to analyze within it.
The Log Collection Hierarchy: Prioritizing by Risk and Compliance Value
Not all log sources are equally valuable. In a resource-constrained environment — which describes almost every SMB client’s IT budget — the temptation is to collect everything and deal with the storage and analysis complexity later. The more practical approach is to build collection in priority tiers, ensuring that the highest-value log sources are covered completely before expanding to lower-value sources.
Tier 1: Authentication and Identity Logs
Authentication logs are the single most valuable log source for both security investigation and compliance purposes. They answer the foundational questions of any security investigation: who authenticated, from where, at what time, to what resource, and whether the authentication succeeded or failed. In environments using Active Directory or Azure AD, authentication logs include domain controller Security Event Logs (Event IDs 4624 successful logon, 4625 failed logon, 4648 logon with explicit credentials, 4768 Kerberos ticket request, 4776 NTLM authentication), Azure AD sign-in logs, and VPN authentication logs from the firewall or VPN concentrator.
Authentication log collection should be comprehensive and without sampling. A sampled authentication log is nearly useless for security investigation because the event you need — the single successful authentication from an attacker who tried 500 passwords before finding the right one — is the one most likely to fall outside the sample window. Every authentication event, success or failure, from every system in the environment, belongs in the centralized log store.
Tier 2: Endpoint and Server Security Logs
Endpoint and server security logs cover process execution, privilege escalation, service installation, scheduled task creation, and registry modification — the behavioral indicators of compromise that distinguish attacker activity from normal user behavior. On Windows systems, the relevant Event IDs include 4688 (process creation, which requires process auditing to be enabled), 4698 and 4702 (scheduled task creation and modification), 4720 and 4722 (user account creation and enabling), 4732 (member added to security-enabled local group), and 7045 (new service installed in the system).
For MSP clients running an EDR solution — Microsoft Defender for Endpoint, CrowdStrike, SentinelOne, or similar — the EDR telemetry is the richest available source of endpoint behavioral data and should be integrated into the centralized log management environment. EDR logs provide process trees, network connection records, and file system activity at a granularity that Windows Event Logs alone do not match.
Tier 3: Network Infrastructure Logs
Network infrastructure logs — firewall allow and deny logs, DNS query logs, DHCP lease records, VPN session logs, and proxy logs where applicable — provide the network-layer visibility that endpoint logs cannot. A firewall log shows external IP addresses communicating with internal systems, unusual outbound connections that may indicate command-and-control traffic, and policy violations that reveal misconfigurations or unauthorized access attempts. DNS query logs are particularly valuable because they reveal domain lookups that precede network connections, including lookups for known malicious domains that are a reliable indicator of compromise before the connection itself is established.
DHCP records are often overlooked but provide the mapping between IP addresses and device identities at specific points in time — a mapping that is essential for correlating network-layer events in firewall logs with specific devices identified in endpoint logs. Without DHCP records, an IP address in a firewall log is unidentifiable after a lease expires and is reassigned to a different device.
Tier 4: Application and Cloud Service Logs
Application logs and cloud service audit logs round out the collection framework. For Microsoft 365 environments — which covers the majority of MSP client bases — the Unified Audit Log (UAL) is a critical source that captures user activity across Exchange Online, SharePoint, OneDrive, Teams, and Azure AD. Email access, file sharing events, mailbox permission changes, mail forwarding rule creation, and OAuth application consent grants are all captured in the UAL and are among the most commonly relevant data sources in business email compromise investigations.
For clients with line-of-business applications, ERP systems, or other enterprise platforms that generate their own audit logs, those logs should be evaluated for collection based on the sensitivity of the data they process and the compliance requirements applicable to the client. A healthcare client’s EHR system audit log is a HIPAA requirement. A financial services client’s trading platform audit log may be a regulatory requirement. These are not optional collection targets in regulated environments.
“Collecting logs is not the same as having log management. The collection is the foundation. The value — for security, compliance, and incident response — comes from what is done with the data after it is collected: how long it is retained, how it is correlated, and what analysis is applied to surface the signals that matter.”
Retention Policy: How Long to Store What, and Why It Matters
Log retention is where the gap between what MSPs are doing and what compliance frameworks and security best practices require is most pronounced. The default retention period in many environments is whatever the local system retains before overwriting — which is often 72 hours to 7 days for high-volume log sources on systems with limited disk allocation for logs. This is operationally useless for security investigation and non-compliant for virtually every regulated industry.
The retention requirements that govern most MSP client environments are as follows. PCI-DSS requires a minimum of 12 months retention with at least 3 months immediately available for analysis. HIPAA requires a minimum of 6 years for security event logs that relate to protected health information. SOC 2 Type II typically requires a minimum of 12 months to cover the audit period. CMMC Level 2 and above requires a minimum of 3 years for audit log retention. These are floors, not recommendations, and failure to meet them is an audit finding that MSPs managing compliance obligations for their clients need to treat as a service delivery requirement.
For clients without specific regulatory requirements, the security-driven retention recommendation is a minimum of 90 days hot retention — immediately queryable in your SIEM or log management platform — and 12 months cold retention in compressed, tamper-evident archival storage. The 90-day hot retention covers the mean attacker dwell time in environments with reasonable detection capabilities. The 12-month archival period covers threat actors with longer dwell times and provides the historical context for year-over-year trend analysis.
The practical implementation of a tiered retention model uses storage architecture to balance cost and accessibility. High-value, high-volume log sources like authentication logs and EDR telemetry are retained hot in the SIEM for 90 days and then compressed and archived to lower-cost object storage (Azure Blob Storage, AWS S3) for the balance of the retention period. Lower-volume, lower-change-rate log sources like DHCP records and application logs may be retained entirely in lower-cost archival storage with a defined retrieval time for investigation access.
What to Analyze: Detection Logic That Surfaces Real Threats
Log collection and retention without analysis is a compliance exercise, not a security capability. The analysis layer is where log management becomes a genuine threat detection function — and it is the layer that requires the most operational investment and ongoing refinement.
The analysis framework for MSP client environments should be built around four detection categories, each of which addresses a distinct class of threat behavior.
Category 1: Authentication Anomalies
Authentication anomaly detection is the highest-return analysis investment because it catches the most common initial access and privilege escalation techniques. The specific detection rules that generate the most actionable alerts in MSP client environments include: impossible travel detection (successful authentications from geographically disparate locations within a timeframe that precludes physical travel), password spray detection (a pattern of distributed failed authentications against many accounts from a single source IP, typically 1 to 3 failures per account rather than the repeated failures against a single account that lockout policies prevent), off-hours authentication from unfamiliar locations for privileged accounts, and new device registrations for accounts that have not registered a new device in the preceding 90-day baseline period.
For Microsoft 365 environments, Microsoft’s Identity Protection already implements many of these detections natively and generates risky sign-in and risky user alerts. Integrating these alerts into the centralized log management environment, correlating them with on-premises authentication logs, and ensuring they generate tickets in the PSA are the operational steps that make the native Microsoft detection capabilities operationally effective for MSP service delivery.
Category 2: Lateral Movement Indicators
Lateral movement — an attacker who has compromised one system moving to other systems within the environment — is the phase of an attack where containment is still possible if detection is fast enough. The log-based indicators of lateral movement that are most reliably detectable include: successful authentications to multiple systems within a short time window from a single account (particularly to systems the account does not normally access), remote service execution events (Event ID 7045 on a remote system preceded by a network logon from the originating system), administrative share access (Event ID 5140 showing access to C$, ADMIN$, or IPC$ shares from a system that does not normally access them), and pass-the-hash or pass-the-ticket indicators in Kerberos and NTLM authentication logs.
Category 3: Data Exfiltration Indicators
Data exfiltration detection requires correlation between multiple log sources — file access logs, network traffic volume logs, cloud storage activity logs, and email audit logs. The indicators that most reliably signal exfiltration activity are: anomalous outbound data volume from specific endpoints compared to a 30-day baseline, bulk file download from SharePoint or OneDrive that significantly exceeds a user’s normal access pattern, email forwarding rule creation in Exchange Online (particularly rules that forward to external addresses), and access to file share resources by accounts that do not normally access them in volumes that suggest bulk copying rather than normal file access.
For clients in environments where sensitive data has a defined classification and location, data loss prevention integration with the log management environment provides a more precise exfiltration detection capability. DLP policy match events, when correlated with the network and file access log indicators above, significantly reduce the false positive rate in exfiltration detection.
Category 4: Infrastructure Integrity Indicators
Infrastructure integrity detection covers the log indicators that signal an attacker establishing persistence or preparing for a high-impact action: new local administrator accounts created outside of your change management process, Group Policy modification events on domain controllers, firewall rule changes that open new inbound access, DNS record modifications on authoritative servers, and certificate issuance events from internal certificate authorities for certificates covering unfamiliar subjects. These are lower-frequency events than authentication and endpoint events, but their occurrence outside of a documented change window is a high-confidence indicator of malicious activity.
“The detection rules in a SIEM are hypotheses about attacker behavior. Each rule says: if this pattern of events occurs, it probably means this. Building a detection library is a continuous process of refining those hypotheses against real attacker behavior observed in your client environments.”
SIEM Selection and Architecture for MSP Multi-Tenant Environments
The platform that underpins log management for MSP environments must support multi-tenant architecture — the ability to ingest, store, and analyze logs from multiple client environments with strict logical separation, so that a query or alert for Client A never surfaces data from Client B. This is both a security requirement and a compliance requirement, and it eliminates a significant portion of the consumer SIEM market from consideration for MSP use cases.
Microsoft Sentinel is the most widely adopted platform in MSP environments serving Microsoft-centric client bases, largely because of its native integration with the Microsoft 365 ecosystem and Azure AD, its MSSP multi-tenant workspace architecture, and its cost model based on data ingestion volume rather than per-device licensing. The Sentinel content hub provides a library of pre-built detection rules, workbooks, and connectors that significantly reduces the time required to stand up effective detection logic for common threat patterns. For MSPs delivering white-label SOC services, Sentinel’s Lighthouse multi-workspace management capability is purpose-built for the managed service use case.
Alternative platforms with strong MSP positioning include Elastic SIEM, which offers a flexible open-source-based architecture with strong support for custom log sources and no per-device licensing, and Devo, which is designed explicitly for high-volume, multi-tenant security operations. The selection criteria that matter most for MSP use cases are multi-tenant data isolation, API-based connector availability for the log sources in your client base, the quality and coverage of the out-of-the-box detection rule library, and the integration depth with your PSA for automated ticket generation from SIEM alerts.
Log Management as a White-Label SOC Capability
For MSPs delivering security services through a white-label SOC partner, log management is the foundational capability that determines whether the SOC service delivers genuine threat detection or a monitoring theater that generates reports without real security outcomes. The quality of alert triage, threat hunting, and incident response that a SOC can deliver is entirely bounded by the quality and completeness of the log data it has access to.
When structuring a white-label SOC engagement that includes log management, the scope should explicitly specify the log sources to be collected from each client environment (mapped to the tier framework above, with client-specific additions for regulated industry requirements), the retention policy applicable to each log source category, the detection rule library that will be applied and the process for adding new detection rules as threat intelligence evolves, and the alert triage SLA — the timeframe within which the SOC team investigates and classifies each alert generated by the detection logic.
A white-label SOC partner who is managing log collection and analysis on your behalf should be providing you with monthly reporting that covers: alert volume by category and client, true positive and false positive rates by detection rule, mean time to detect for the incidents that were identified through log analysis, and any gaps in log collection that were identified during the period. This reporting is both a quality assurance mechanism for the SOC engagement and a source of client-facing QBR content that demonstrates the ongoing security value of the managed service.
Log management is one of the clearest examples of the compounding value of a well-structured managed security service. The longer a comprehensive log collection environment has been in place, the more historical baseline data is available for anomaly detection, the more refined the detection rules become based on observed normal behavior in each client environment, and the more complete the forensic record available for incident investigations. A client who has been in a properly managed log environment for 24 months has a security intelligence asset that a client who signed on last month does not — and that accumulated value is a tangible argument for retention and renewal that goes well beyond the contractual.
