0 / 11 steps complete
Intermediate AI-Assisted Network Security

L06: Network Anomaly Detection with ML

Apply unsupervised machine learning (Isolation Forest, DBSCAN) to detect anomalous network traffic patterns without labeled data. Capture live traffic with tshark, engineer flow features, and tune your model to surface port scans, C2 beaconing, and data exfiltration.

Python 3 scikit-learn tshark pandas Kali Linux Wireshark
Phase 1: Traffic Capture & Flow Feature Extraction
1 Capture baseline traffic with tshark

On Kali Linux, capture 5 minutes of baseline traffic in the host-only network (192.168.56.0/24):

sudo tshark -i eth1 -a duration:300 \
  -w ~/network-lab/baseline_capture.pcap \
  -f "net 192.168.56.0/24"

# Export to CSV for feature engineering
tshark -r ~/network-lab/baseline_capture.pcap \
  -T fields \
  -e frame.time_epoch \
  -e ip.src \
  -e ip.dst \
  -e ip.proto \
  -e tcp.srcport \
  -e tcp.dstport \
  -e udp.srcport \
  -e udp.dstport \
  -e frame.len \
  -e tcp.flags \
  -E header=y -E separator=, \
  > ~/network-lab/packets.csv

echo "Captured $(wc -l < ~/network-lab/packets.csv) packets"
2 Engineer NetFlow-style features from packets

Aggregate raw packets into bidirectional flows with statistical features:

cat > ~/network-lab/flow_features.py << 'EOF'
import pandas as pd
import numpy as np

df = pd.read_csv('packets.csv')
df['time'] = pd.to_numeric(df['frame.time_epoch'], errors='coerce')
df['length'] = pd.to_numeric(df['frame.len'], errors='coerce')

# Create flow key (5-tuple)
def flow_key(row):
    src, dst = sorted([str(row.get('ip.src','')), str(row.get('ip.dst',''))])
    proto = str(row.get('ip.proto',''))
    sp = str(row.get('tcp.srcport', row.get('udp.srcport','')))
    dp = str(row.get('tcp.dstport', row.get('udp.dstport','')))
    return f"{src}|{dst}|{proto}|{sorted([sp,dp])[0]}|{sorted([sp,dp])[1]}"

df['flow_id'] = df.apply(flow_key, axis=1)

# Aggregate flow statistics
flows = df.groupby('flow_id').agg(
    packet_count=('length', 'count'),
    total_bytes=('length', 'sum'),
    mean_bytes=('length', 'mean'),
    std_bytes=('length', 'std'),
    min_bytes=('length', 'min'),
    max_bytes=('length', 'max'),
    duration=('time', lambda x: x.max() - x.min()),
    start_time=('time', 'min'),
).reset_index()

# Derived features
flows['bytes_per_pkt'] = flows['total_bytes'] / flows['packet_count']
flows['pkt_rate'] = flows['packet_count'] / (flows['duration'] + 0.001)
flows['byte_rate'] = flows['total_bytes'] / (flows['duration'] + 0.001)
flows['small_pkt_ratio'] = (flows['min_bytes'] < 100).astype(int)

flows.fillna(0, inplace=True)
flows.to_csv('flow_features.csv', index=False)
print(f"Generated {len(flows)} flows with {flows.shape[1]} features")
print(flows.describe())
EOF
cd ~/network-lab && python3 flow_features.py
Phase 2: Unsupervised Anomaly Detection
3 Train Isolation Forest on flow features

Isolation Forest identifies anomalies by isolating data points in fewer partitions. No labels required:

cat > ~/network-lab/isolation_forest.py << 'EOF'
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

flows = pd.read_csv('flow_features.csv')

feature_cols = ['packet_count', 'total_bytes', 'mean_bytes', 'std_bytes',
                'duration', 'bytes_per_pkt', 'pkt_rate', 'byte_rate',
                'small_pkt_ratio']

X = flows[feature_cols].fillna(0)

# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train Isolation Forest
# contamination=0.05 means ~5% of flows expected to be anomalous
iso = IsolationForest(
    n_estimators=200,
    contamination=0.05,
    max_samples='auto',
    random_state=42
)
flows['anomaly_score'] = iso.fit_predict(X_scaled)
flows['anomaly_raw'] = iso.score_samples(X_scaled)

anomalies = flows[flows['anomaly_score'] == -1]
print(f"\nDetected {len(anomalies)} anomalous flows ({len(anomalies)/len(flows)*100:.1f}%)")
print("\nTop anomalies by score:")
print(anomalies.nsmallest(10, 'anomaly_raw')[
    ['flow_id', 'packet_count', 'total_bytes', 'pkt_rate', 'anomaly_raw']])

# PCA visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(10,6))
colors = ['red' if a == -1 else 'steelblue' for a in flows['anomaly_score']]
plt.scatter(X_pca[:,0], X_pca[:,1], c=colors, alpha=0.6, s=20)
plt.title('Network Flow Anomaly Detection — Isolation Forest')
plt.xlabel('PCA Component 1'); plt.ylabel('PCA Component 2')
plt.legend(['Anomaly', 'Normal'], loc='upper right')
plt.savefig('anomaly_pca.png', dpi=150)
print("\nVisualization saved to anomaly_pca.png")
EOF
python3 isolation_forest.py
4 Apply DBSCAN for cluster-based anomaly detection

DBSCAN labels flows not belonging to any cluster as noise — potential anomalies:

cat > ~/network-lab/dbscan_anomaly.py << 'EOF'
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

flows = pd.read_csv('flow_features.csv')
feature_cols = ['packet_count', 'total_bytes', 'mean_bytes', 'bytes_per_pkt',
                'pkt_rate', 'duration']
X = StandardScaler().fit_transform(flows[feature_cols].fillna(0))

# Find optimal epsilon using k-distance graph
nbrs = NearestNeighbors(n_neighbors=4).fit(X)
distances, _ = nbrs.kneighbors(X)
distances = np.sort(distances[:, -1])
plt.figure(figsize=(8,4))
plt.plot(distances)
plt.xlabel('Points sorted by distance')
plt.ylabel('4th NN distance (epsilon)')
plt.title('K-Distance Graph — Choose epsilon at elbow')
plt.savefig('kdistance.png')
print("Examine kdistance.png to choose epsilon value")

# Apply DBSCAN (adjust eps from k-distance graph)
db = DBSCAN(eps=1.5, min_samples=5)
flows['cluster'] = db.fit_predict(X)
noise = flows[flows['cluster'] == -1]

print(f"\nDBSCAN clusters found: {flows['cluster'].nunique() - 1}")
print(f"Noise points (anomalies): {len(noise)}")
print("\nNoise flows sample:")
print(noise[['flow_id', 'packet_count', 'total_bytes', 'pkt_rate']].head(10))
flows.to_csv('flows_clustered.csv', index=False)
EOF
python3 dbscan_anomaly.py
Phase 3: Attack Traffic Injection & Detection
5 Simulate port scan (Nmap) and verify detection

From Kali, run a port scan against Metasploitable and check if your model flags it:

# Terminal 1: Start capture sudo tshark -i eth1 -w ~/network-lab/attack_capture.pcap & TSHARK_PID=$! # Terminal 2: Inject attack traffic # Port scan — generates many short-duration flows with small packets nmap -sS -p 1-1000 192.168.56.101 # C2 beacon simulation — regular interval, small packets python3 -c " import socket, time, random for i in range(20): try: s = socket.socket() s.settimeout(1) s.connect(('192.168.56.101', 4444)) s.send(b'beacon\x00' * 10) s.close() except: pass time.sleep(30 + random.uniform(-2, 2)) # jitter print('Beacon simulation complete') " & sleep 120; kill $TSHARK_PID # Now run features extraction on attack capture tshark -r attack_capture.pcap \ -T fields -e frame.time_epoch -e ip.src -e ip.dst \ -e ip.proto -e tcp.srcport -e tcp.dstport \ -e frame.len -E header=y -E separator=, > attack_packets.csv
6 Score attack traffic against trained model

Extract flow features from the attack capture and apply your trained Isolation Forest:

cat > ~/network-lab/score_attacks.py << 'EOF' import pandas as pd, numpy as np, pickle from sklearn.ensemble import IsolationForest from sklearn.preprocessing import StandardScaler # Load attack flows import subprocess subprocess.run(['python3', 'flow_features.py']) # rerun on attack data # Manually rename if needed: mv flow_features.csv attack_flows.csv attack_flows = pd.read_csv('attack_flows.csv') feature_cols = ['packet_count', 'total_bytes', 'mean_bytes', 'std_bytes', 'duration', 'bytes_per_pkt', 'pkt_rate', 'byte_rate', 'small_pkt_ratio'] X_attack = attack_flows[feature_cols].fillna(0) # Load baseline scaler and model (retrain from baseline CSV for consistency) baseline = pd.read_csv('flow_features.csv') scaler = StandardScaler().fit(baseline[feature_cols].fillna(0)) iso = IsolationForest(n_estimators=200, contamination=0.05, random_state=42) iso.fit(scaler.transform(baseline[feature_cols].fillna(0))) X_scaled = scaler.transform(X_attack) attack_flows['anomaly'] = iso.predict(X_scaled) attack_flows['score'] = iso.score_samples(X_scaled) flagged = attack_flows[attack_flows['anomaly'] == -1] print(f"Attack flows flagged: {len(flagged)}/{len(attack_flows)}") print("\nHighest anomaly scores:") print(flagged.nsmallest(15, 'score')[ ['flow_id', 'packet_count', 'pkt_rate', 'duration', 'score']]) EOF python3 score_attacks.py
7 Detect C2 beaconing with periodicity analysis

C2 beacons have highly regular inter-packet timing. Detect with FFT periodicity analysis:

cat > ~/network-lab/beacon_detect.py << 'EOF'
import pandas as pd, numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Load raw packet timestamps grouped by flow
df = pd.read_csv('attack_packets.csv')
df['time'] = pd.to_numeric(df['frame.time_epoch'], errors='coerce')
df = df.dropna(subset=['time'])

# For each unique dst IP, compute inter-arrival time statistics
for dst_ip in df['ip.dst'].dropna().unique():
    flow = df[df['ip.dst'] == dst_ip].sort_values('time')
    if len(flow) < 10: continue

    # Inter-arrival times
    iats = flow['time'].diff().dropna()

    # Low standard deviation relative to mean = regular beaconing
    cv = iats.std() / (iats.mean() + 0.001)  # coefficient of variation

    if cv < 0.3 and iats.mean() > 5:  # regular and interval > 5s
        print(f"\n[!] POTENTIAL C2 BEACON DETECTED")
        print(f"    Destination: {dst_ip}")
        print(f"    Packet count: {len(flow)}")
        print(f"    Mean interval: {iats.mean():.2f}s")
        print(f"    Std interval:  {iats.std():.2f}s")
        print(f"    Coefficient of variation: {cv:.3f} (low = regular)")

        plt.figure(figsize=(8,3))
        plt.plot(iats.values, marker='o', markersize=3)
        plt.axhline(iats.mean(), color='red', linestyle='--', label='Mean IAT')
        plt.title(f'Inter-Arrival Times — {dst_ip}')
        plt.xlabel('Packet #'); plt.ylabel('Seconds')
        plt.legend(); plt.savefig(f'beacon_{dst_ip.replace(".","-")}.png')
EOF
python3 beacon_detect.py
Phase 4: Sigma Rules & Detection Reporting
8 Tune model contamination threshold

Adjust the contamination parameter to balance detection rate vs false positives:

python3 << 'EOF'
import pandas as pd, numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

flows = pd.read_csv('flow_features.csv')
feature_cols = ['packet_count', 'total_bytes', 'mean_bytes', 'bytes_per_pkt',
                'pkt_rate', 'duration', 'std_bytes', 'small_pkt_ratio']
X = StandardScaler().fit_transform(flows[feature_cols].fillna(0))

print(f"{'Contamination':>15} | {'Flagged':>8} | {'Flag Rate':>10} | {'Min Score':>10}")
print("-" * 55)
for c in [0.01, 0.02, 0.05, 0.10, 0.15, 0.20]:
    iso = IsolationForest(n_estimators=200, contamination=c, random_state=42)
    preds = iso.fit_predict(X)
    scores = iso.score_samples(X)
    flagged = (preds == -1).sum()
    print(f"{c:>15.2f} | {flagged:>8} | {flagged/len(flows)*100:>9.1f}% | {scores.min():>10.4f}")
EOF

Set contamination based on your environment's expected anomaly rate. For SOC environments, 2-5% is typical.

9 Write Sigma rules for port scan and beacon patterns
cat > ~/network-lab/sigma_port_scan.yml << 'EOF' title: Network Port Scan Detection id: a1b2c3d4-e5f6-7890-abcd-ef1234567890 status: experimental description: Detects horizontal port scanning based on high distinct destination port count author: CyberSec Pro Academy - L06 date: 2024/01/15 logsource: category: network_connection product: zeek detection: selection: src_ip|startswith: '192.168.' timeframe: 60s condition: selection | count(dst_port) by src_ip > 100 falsepositives: - Network scanners (Nessus, Qualys) — allowlist scanner IPs - Load balancer health checks level: medium tags: - attack.discovery - attack.t1046 EOF cat > ~/network-lab/sigma_c2_beacon.yml << 'EOF' title: C2 Beaconing — Regular Interval Connections id: b2c3d4e5-f6a7-8901-bcde-f01234567891 status: experimental description: Detects C2 beaconing via regular-interval connections to same external host author: CyberSec Pro Academy - L06 logsource: category: network_connection detection: selection: dst_port: - 443 - 80 - 4444 - 8080 connection_count|gte: 10 filter_internal: dst_ip|startswith: - '10.' - '172.16.' - '192.168.' timeframe: 1h condition: selection and not filter_internal | count() by src_ip,dst_ip > 8 level: high tags: - attack.command_and_control - attack.t1071 - attack.t1571 EOF echo "Sigma rules written"
10 Map detections to MITRE ATT&CK

Document which ATT&CK techniques your detections cover:

DetectionATT&CK TechniqueCoverage
Port scan (Isolation Forest)T1046 — Network Service DiscoveryDiscovery
C2 beaconing (periodicity)T1071 — App Layer ProtocolC2
Large upload flows (byte_rate)T1048 — Exfiltration Over Alt ProtocolExfiltration
DBSCAN noise flowsT1571 — Non-Standard PortC2
11 Document findings and produce detection report

Record your lab results. Use the AI analyst to help structure your network detection report.

Lab Findings

MetricValue
Total flows analyzed
Anomalies flagged (Isolation Forest)
Port scan detected
Beacon detected
Optimal contamination value
MITRE techniques covered

Next: Lab L07 — Behavioral Analytics (UEBA)

Detect insider threats and compromised accounts using user and entity behavior analytics.

Start L07 →
AI Network Detection Expert

Enter your Anthropic API key to activate the AI analyst:

Quick Prompts: