Intermediate AI-Assisted Network Security

L06: Network Anomaly Detection with ML

Apply unsupervised machine learning (Isolation Forest, DBSCAN) to detect anomalous network traffic patterns without labeled data. Capture live traffic with tshark, engineer flow features, and tune your model to surface port scans, C2 beaconing, and data exfiltration.

Python 3 scikit-learn tshark pandas Kali Linux Wireshark

Phase 1: Traffic Capture & Flow Feature Extraction

1 Capture baseline traffic with tshark ▼

On Kali Linux, capture 5 minutes of baseline traffic in the host-only network (192.168.56.0/24):

sudo tshark -i eth1 -a duration:300 \
  -w ~/network-lab/baseline_capture.pcap \
  -f "net 192.168.56.0/24"

# Export to CSV for feature engineering
tshark -r ~/network-lab/baseline_capture.pcap \
  -T fields \
  -e frame.time_epoch \
  -e ip.src \
  -e ip.dst \
  -e ip.proto \
  -e tcp.srcport \
  -e tcp.dstport \
  -e udp.srcport \
  -e udp.dstport \
  -e frame.len \
  -e tcp.flags \
  -E header=y -E separator=, \
  > ~/network-lab/packets.csv

echo "Captured $(wc -l < ~/network-lab/packets.csv) packets"

Mark complete

2 Engineer NetFlow-style features from packets ▼

Aggregate raw packets into bidirectional flows with statistical features:

cat > ~/network-lab/flow_features.py << 'EOF'
import pandas as pd
import numpy as np

df = pd.read_csv('packets.csv')
df['time'] = pd.to_numeric(df['frame.time_epoch'], errors='coerce')
df['length'] = pd.to_numeric(df['frame.len'], errors='coerce')

# Create flow key (5-tuple)
def flow_key(row):
    src, dst = sorted([str(row.get('ip.src','')), str(row.get('ip.dst',''))])
    proto = str(row.get('ip.proto',''))
    sp = str(row.get('tcp.srcport', row.get('udp.srcport','')))
    dp = str(row.get('tcp.dstport', row.get('udp.dstport','')))
    return f"{src}|{dst}|{proto}|{sorted([sp,dp])[0]}|{sorted([sp,dp])[1]}"

df['flow_id'] = df.apply(flow_key, axis=1)

# Aggregate flow statistics
flows = df.groupby('flow_id').agg(
    packet_count=('length', 'count'),
    total_bytes=('length', 'sum'),
    mean_bytes=('length', 'mean'),
    std_bytes=('length', 'std'),
    min_bytes=('length', 'min'),
    max_bytes=('length', 'max'),
    duration=('time', lambda x: x.max() - x.min()),
    start_time=('time', 'min'),
).reset_index()

# Derived features
flows['bytes_per_pkt'] = flows['total_bytes'] / flows['packet_count']
flows['pkt_rate'] = flows['packet_count'] / (flows['duration'] + 0.001)
flows['byte_rate'] = flows['total_bytes'] / (flows['duration'] + 0.001)
flows['small_pkt_ratio'] = (flows['min_bytes'] < 100).astype(int)

flows.fillna(0, inplace=True)
flows.to_csv('flow_features.csv', index=False)
print(f"Generated {len(flows)} flows with {flows.shape[1]} features")
print(flows.describe())
EOF
cd ~/network-lab && python3 flow_features.py

Mark complete

Phase 2: Unsupervised Anomaly Detection

3 Train Isolation Forest on flow features ▼

Isolation Forest identifies anomalies by isolating data points in fewer partitions. No labels required:

cat > ~/network-lab/isolation_forest.py << 'EOF'
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

flows = pd.read_csv('flow_features.csv')

feature_cols = ['packet_count', 'total_bytes', 'mean_bytes', 'std_bytes',
                'duration', 'bytes_per_pkt', 'pkt_rate', 'byte_rate',
                'small_pkt_ratio']

X = flows[feature_cols].fillna(0)

# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train Isolation Forest
# contamination=0.05 means ~5% of flows expected to be anomalous
iso = IsolationForest(
    n_estimators=200,
    contamination=0.05,
    max_samples='auto',
    random_state=42
)
flows['anomaly_score'] = iso.fit_predict(X_scaled)
flows['anomaly_raw'] = iso.score_samples(X_scaled)

anomalies = flows[flows['anomaly_score'] == -1]
print(f"\nDetected {len(anomalies)} anomalous flows ({len(anomalies)/len(flows)*100:.1f}%)")
print("\nTop anomalies by score:")
print(anomalies.nsmallest(10, 'anomaly_raw')[
    ['flow_id', 'packet_count', 'total_bytes', 'pkt_rate', 'anomaly_raw']])

# PCA visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(10,6))
colors = ['red' if a == -1 else 'steelblue' for a in flows['anomaly_score']]
plt.scatter(X_pca[:,0], X_pca[:,1], c=colors, alpha=0.6, s=20)
plt.title('Network Flow Anomaly Detection — Isolation Forest')
plt.xlabel('PCA Component 1'); plt.ylabel('PCA Component 2')
plt.legend(['Anomaly', 'Normal'], loc='upper right')
plt.savefig('anomaly_pca.png', dpi=150)
print("\nVisualization saved to anomaly_pca.png")
EOF
python3 isolation_forest.py

Mark complete

4 Apply DBSCAN for cluster-based anomaly detection ▼

DBSCAN labels flows not belonging to any cluster as noise — potential anomalies:

cat > ~/network-lab/dbscan_anomaly.py << 'EOF'
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt

flows = pd.read_csv('flow_features.csv')
feature_cols = ['packet_count', 'total_bytes', 'mean_bytes', 'bytes_per_pkt',
                'pkt_rate', 'duration']
X = StandardScaler().fit_transform(flows[feature_cols].fillna(0))

# Find optimal epsilon using k-distance graph
nbrs = NearestNeighbors(n_neighbors=4).fit(X)
distances, _ = nbrs.kneighbors(X)
distances = np.sort(distances[:, -1])
plt.figure(figsize=(8,4))
plt.plot(distances)
plt.xlabel('Points sorted by distance')
plt.ylabel('4th NN distance (epsilon)')
plt.title('K-Distance Graph — Choose epsilon at elbow')
plt.savefig('kdistance.png')
print("Examine kdistance.png to choose epsilon value")

# Apply DBSCAN (adjust eps from k-distance graph)
db = DBSCAN(eps=1.5, min_samples=5)
flows['cluster'] = db.fit_predict(X)
noise = flows[flows['cluster'] == -1]

print(f"\nDBSCAN clusters found: {flows['cluster'].nunique() - 1}")
print(f"Noise points (anomalies): {len(noise)}")
print("\nNoise flows sample:")
print(noise[['flow_id', 'packet_count', 'total_bytes', 'pkt_rate']].head(10))
flows.to_csv('flows_clustered.csv', index=False)
EOF
python3 dbscan_anomaly.py

Mark complete

Phase 3: Attack Traffic Injection & Detection

5 Simulate port scan (Nmap) and verify detection ▼

From Kali, run a port scan against Metasploitable and check if your model flags it:

# Terminal 1: Start capture sudo tshark -i eth1 -w ~/network-lab/attack_capture.pcap & TSHARK_PID=$! # Terminal 2: Inject attack traffic # Port scan — generates many short-duration flows with small packets nmap -sS -p 1-1000 192.168.56.101 # C2 beacon simulation — regular interval, small packets python3 -c " import socket, time, random for i in range(20): try: s = socket.socket() s.settimeout(1) s.connect(('192.168.56.101', 4444)) s.send(b'beacon\x00' * 10) s.close() except: pass time.sleep(30 + random.uniform(-2, 2)) # jitter print('Beacon simulation complete') " & sleep 120; kill $TSHARK_PID # Now run features extraction on attack capture tshark -r attack_capture.pcap \ -T fields -e frame.time_epoch -e ip.src -e ip.dst \ -e ip.proto -e tcp.srcport -e tcp.dstport \ -e frame.len -E header=y -E separator=, > attack_packets.csv

Mark complete

6 Score attack traffic against trained model ▼

Extract flow features from the attack capture and apply your trained Isolation Forest:

cat > ~/network-lab/score_attacks.py << 'EOF' import pandas as pd, numpy as np, pickle from sklearn.ensemble import IsolationForest from sklearn.preprocessing import StandardScaler # Load attack flows import subprocess subprocess.run(['python3', 'flow_features.py']) # rerun on attack data # Manually rename if needed: mv flow_features.csv attack_flows.csv attack_flows = pd.read_csv('attack_flows.csv') feature_cols = ['packet_count', 'total_bytes', 'mean_bytes', 'std_bytes', 'duration', 'bytes_per_pkt', 'pkt_rate', 'byte_rate', 'small_pkt_ratio'] X_attack = attack_flows[feature_cols].fillna(0) # Load baseline scaler and model (retrain from baseline CSV for consistency) baseline = pd.read_csv('flow_features.csv') scaler = StandardScaler().fit(baseline[feature_cols].fillna(0)) iso = IsolationForest(n_estimators=200, contamination=0.05, random_state=42) iso.fit(scaler.transform(baseline[feature_cols].fillna(0))) X_scaled = scaler.transform(X_attack) attack_flows['anomaly'] = iso.predict(X_scaled) attack_flows['score'] = iso.score_samples(X_scaled) flagged = attack_flows[attack_flows['anomaly'] == -1] print(f"Attack flows flagged: {len(flagged)}/{len(attack_flows)}") print("\nHighest anomaly scores:") print(flagged.nsmallest(15, 'score')[ ['flow_id', 'packet_count', 'pkt_rate', 'duration', 'score']]) EOF python3 score_attacks.py

Mark complete

7 Detect C2 beaconing with periodicity analysis ▼

C2 beacons have highly regular inter-packet timing. Detect with FFT periodicity analysis:

cat > ~/network-lab/beacon_detect.py << 'EOF'
import pandas as pd, numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Load raw packet timestamps grouped by flow
df = pd.read_csv('attack_packets.csv')
df['time'] = pd.to_numeric(df['frame.time_epoch'], errors='coerce')
df = df.dropna(subset=['time'])

# For each unique dst IP, compute inter-arrival time statistics
for dst_ip in df['ip.dst'].dropna().unique():
    flow = df[df['ip.dst'] == dst_ip].sort_values('time')
    if len(flow) < 10: continue

    # Inter-arrival times
    iats = flow['time'].diff().dropna()

    # Low standard deviation relative to mean = regular beaconing
    cv = iats.std() / (iats.mean() + 0.001)  # coefficient of variation

    if cv < 0.3 and iats.mean() > 5:  # regular and interval > 5s
        print(f"\n[!] POTENTIAL C2 BEACON DETECTED")
        print(f"    Destination: {dst_ip}")
        print(f"    Packet count: {len(flow)}")
        print(f"    Mean interval: {iats.mean():.2f}s")
        print(f"    Std interval:  {iats.std():.2f}s")
        print(f"    Coefficient of variation: {cv:.3f} (low = regular)")

        plt.figure(figsize=(8,3))
        plt.plot(iats.values, marker='o', markersize=3)
        plt.axhline(iats.mean(), color='red', linestyle='--', label='Mean IAT')
        plt.title(f'Inter-Arrival Times — {dst_ip}')
        plt.xlabel('Packet #'); plt.ylabel('Seconds')
        plt.legend(); plt.savefig(f'beacon_{dst_ip.replace(".","-")}.png')
EOF
python3 beacon_detect.py

Mark complete

Phase 4: Sigma Rules & Detection Reporting

8 Tune model contamination threshold ▼

Adjust the contamination parameter to balance detection rate vs false positives:

python3 << 'EOF'
import pandas as pd, numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

flows = pd.read_csv('flow_features.csv')
feature_cols = ['packet_count', 'total_bytes', 'mean_bytes', 'bytes_per_pkt',
                'pkt_rate', 'duration', 'std_bytes', 'small_pkt_ratio']
X = StandardScaler().fit_transform(flows[feature_cols].fillna(0))

print(f"{'Contamination':>15} | {'Flagged':>8} | {'Flag Rate':>10} | {'Min Score':>10}")
print("-" * 55)
for c in [0.01, 0.02, 0.05, 0.10, 0.15, 0.20]:
    iso = IsolationForest(n_estimators=200, contamination=c, random_state=42)
    preds = iso.fit_predict(X)
    scores = iso.score_samples(X)
    flagged = (preds == -1).sum()
    print(f"{c:>15.2f} | {flagged:>8} | {flagged/len(flows)*100:>9.1f}% | {scores.min():>10.4f}")
EOF

Set contamination based on your environment's expected anomaly rate. For SOC environments, 2-5% is typical.

Mark complete

9 Write Sigma rules for port scan and beacon patterns ▼

cat > ~/network-lab/sigma_port_scan.yml << 'EOF' title: Network Port Scan Detection id: a1b2c3d4-e5f6-7890-abcd-ef1234567890 status: experimental description: Detects horizontal port scanning based on high distinct destination port count author: CyberSec Pro Academy - L06 date: 2024/01/15 logsource: category: network_connection product: zeek detection: selection: src_ip|startswith: '192.168.' timeframe: 60s condition: selection | count(dst_port) by src_ip > 100 falsepositives: - Network scanners (Nessus, Qualys) — allowlist scanner IPs - Load balancer health checks level: medium tags: - attack.discovery - attack.t1046 EOF cat > ~/network-lab/sigma_c2_beacon.yml << 'EOF' title: C2 Beaconing — Regular Interval Connections id: b2c3d4e5-f6a7-8901-bcde-f01234567891 status: experimental description: Detects C2 beaconing via regular-interval connections to same external host author: CyberSec Pro Academy - L06 logsource: category: network_connection detection: selection: dst_port: - 443 - 80 - 4444 - 8080 connection_count|gte: 10 filter_internal: dst_ip|startswith: - '10.' - '172.16.' - '192.168.' timeframe: 1h condition: selection and not filter_internal | count() by src_ip,dst_ip > 8 level: high tags: - attack.command_and_control - attack.t1071 - attack.t1571 EOF echo "Sigma rules written"

Mark complete

10 Map detections to MITRE ATT&CK ▼

Document which ATT&CK techniques your detections cover:

Detection	ATT&CK Technique	Coverage
Port scan (Isolation Forest)	T1046 — Network Service Discovery	Discovery
C2 beaconing (periodicity)	T1071 — App Layer Protocol	C2
Large upload flows (byte_rate)	T1048 — Exfiltration Over Alt Protocol	Exfiltration
DBSCAN noise flows	T1571 — Non-Standard Port	C2

Mark complete

11 Document findings and produce detection report ▼

Record your lab results. Use the AI analyst to help structure your network detection report.

Lab Findings

Metric	Value
Total flows analyzed
Anomalies flagged (Isolation Forest)
Port scan detected
Beacon detected
Optimal contamination value
MITRE techniques covered

Mark complete

Next: Lab L07 — Behavioral Analytics (UEBA)

Detect insider threats and compromised accounts using user and entity behavior analytics.

Start L07 →