Apply unsupervised machine learning (Isolation Forest, DBSCAN) to detect anomalous network traffic patterns without labeled data. Capture live traffic with tshark, engineer flow features, and tune your model to surface port scans, C2 beaconing, and data exfiltration.
On Kali Linux, capture 5 minutes of baseline traffic in the host-only network (192.168.56.0/24):
sudo tshark -i eth1 -a duration:300 \ -w ~/network-lab/baseline_capture.pcap \ -f "net 192.168.56.0/24" # Export to CSV for feature engineering tshark -r ~/network-lab/baseline_capture.pcap \ -T fields \ -e frame.time_epoch \ -e ip.src \ -e ip.dst \ -e ip.proto \ -e tcp.srcport \ -e tcp.dstport \ -e udp.srcport \ -e udp.dstport \ -e frame.len \ -e tcp.flags \ -E header=y -E separator=, \ > ~/network-lab/packets.csv echo "Captured $(wc -l < ~/network-lab/packets.csv) packets"
Aggregate raw packets into bidirectional flows with statistical features:
cat > ~/network-lab/flow_features.py << 'EOF'
import pandas as pd
import numpy as np
df = pd.read_csv('packets.csv')
df['time'] = pd.to_numeric(df['frame.time_epoch'], errors='coerce')
df['length'] = pd.to_numeric(df['frame.len'], errors='coerce')
# Create flow key (5-tuple)
def flow_key(row):
src, dst = sorted([str(row.get('ip.src','')), str(row.get('ip.dst',''))])
proto = str(row.get('ip.proto',''))
sp = str(row.get('tcp.srcport', row.get('udp.srcport','')))
dp = str(row.get('tcp.dstport', row.get('udp.dstport','')))
return f"{src}|{dst}|{proto}|{sorted([sp,dp])[0]}|{sorted([sp,dp])[1]}"
df['flow_id'] = df.apply(flow_key, axis=1)
# Aggregate flow statistics
flows = df.groupby('flow_id').agg(
packet_count=('length', 'count'),
total_bytes=('length', 'sum'),
mean_bytes=('length', 'mean'),
std_bytes=('length', 'std'),
min_bytes=('length', 'min'),
max_bytes=('length', 'max'),
duration=('time', lambda x: x.max() - x.min()),
start_time=('time', 'min'),
).reset_index()
# Derived features
flows['bytes_per_pkt'] = flows['total_bytes'] / flows['packet_count']
flows['pkt_rate'] = flows['packet_count'] / (flows['duration'] + 0.001)
flows['byte_rate'] = flows['total_bytes'] / (flows['duration'] + 0.001)
flows['small_pkt_ratio'] = (flows['min_bytes'] < 100).astype(int)
flows.fillna(0, inplace=True)
flows.to_csv('flow_features.csv', index=False)
print(f"Generated {len(flows)} flows with {flows.shape[1]} features")
print(flows.describe())
EOF
cd ~/network-lab && python3 flow_features.py
Isolation Forest identifies anomalies by isolating data points in fewer partitions. No labels required:
cat > ~/network-lab/isolation_forest.py << 'EOF'
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
flows = pd.read_csv('flow_features.csv')
feature_cols = ['packet_count', 'total_bytes', 'mean_bytes', 'std_bytes',
'duration', 'bytes_per_pkt', 'pkt_rate', 'byte_rate',
'small_pkt_ratio']
X = flows[feature_cols].fillna(0)
# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train Isolation Forest
# contamination=0.05 means ~5% of flows expected to be anomalous
iso = IsolationForest(
n_estimators=200,
contamination=0.05,
max_samples='auto',
random_state=42
)
flows['anomaly_score'] = iso.fit_predict(X_scaled)
flows['anomaly_raw'] = iso.score_samples(X_scaled)
anomalies = flows[flows['anomaly_score'] == -1]
print(f"\nDetected {len(anomalies)} anomalous flows ({len(anomalies)/len(flows)*100:.1f}%)")
print("\nTop anomalies by score:")
print(anomalies.nsmallest(10, 'anomaly_raw')[
['flow_id', 'packet_count', 'total_bytes', 'pkt_rate', 'anomaly_raw']])
# PCA visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(10,6))
colors = ['red' if a == -1 else 'steelblue' for a in flows['anomaly_score']]
plt.scatter(X_pca[:,0], X_pca[:,1], c=colors, alpha=0.6, s=20)
plt.title('Network Flow Anomaly Detection — Isolation Forest')
plt.xlabel('PCA Component 1'); plt.ylabel('PCA Component 2')
plt.legend(['Anomaly', 'Normal'], loc='upper right')
plt.savefig('anomaly_pca.png', dpi=150)
print("\nVisualization saved to anomaly_pca.png")
EOF
python3 isolation_forest.py
DBSCAN labels flows not belonging to any cluster as noise — potential anomalies:
cat > ~/network-lab/dbscan_anomaly.py << 'EOF'
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
flows = pd.read_csv('flow_features.csv')
feature_cols = ['packet_count', 'total_bytes', 'mean_bytes', 'bytes_per_pkt',
'pkt_rate', 'duration']
X = StandardScaler().fit_transform(flows[feature_cols].fillna(0))
# Find optimal epsilon using k-distance graph
nbrs = NearestNeighbors(n_neighbors=4).fit(X)
distances, _ = nbrs.kneighbors(X)
distances = np.sort(distances[:, -1])
plt.figure(figsize=(8,4))
plt.plot(distances)
plt.xlabel('Points sorted by distance')
plt.ylabel('4th NN distance (epsilon)')
plt.title('K-Distance Graph — Choose epsilon at elbow')
plt.savefig('kdistance.png')
print("Examine kdistance.png to choose epsilon value")
# Apply DBSCAN (adjust eps from k-distance graph)
db = DBSCAN(eps=1.5, min_samples=5)
flows['cluster'] = db.fit_predict(X)
noise = flows[flows['cluster'] == -1]
print(f"\nDBSCAN clusters found: {flows['cluster'].nunique() - 1}")
print(f"Noise points (anomalies): {len(noise)}")
print("\nNoise flows sample:")
print(noise[['flow_id', 'packet_count', 'total_bytes', 'pkt_rate']].head(10))
flows.to_csv('flows_clustered.csv', index=False)
EOF
python3 dbscan_anomaly.py
From Kali, run a port scan against Metasploitable and check if your model flags it:
# Terminal 1: Start capture
sudo tshark -i eth1 -w ~/network-lab/attack_capture.pcap &
TSHARK_PID=$!
# Terminal 2: Inject attack traffic
# Port scan — generates many short-duration flows with small packets
nmap -sS -p 1-1000 192.168.56.101
# C2 beacon simulation — regular interval, small packets
python3 -c "
import socket, time, random
for i in range(20):
try:
s = socket.socket()
s.settimeout(1)
s.connect(('192.168.56.101', 4444))
s.send(b'beacon\x00' * 10)
s.close()
except: pass
time.sleep(30 + random.uniform(-2, 2)) # jitter
print('Beacon simulation complete')
" &
sleep 120; kill $TSHARK_PID
# Now run features extraction on attack capture
tshark -r attack_capture.pcap \
-T fields -e frame.time_epoch -e ip.src -e ip.dst \
-e ip.proto -e tcp.srcport -e tcp.dstport \
-e frame.len -E header=y -E separator=, > attack_packets.csv
Extract flow features from the attack capture and apply your trained Isolation Forest:
cat > ~/network-lab/score_attacks.py << 'EOF'
import pandas as pd, numpy as np, pickle
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
# Load attack flows
import subprocess
subprocess.run(['python3', 'flow_features.py']) # rerun on attack data
# Manually rename if needed: mv flow_features.csv attack_flows.csv
attack_flows = pd.read_csv('attack_flows.csv')
feature_cols = ['packet_count', 'total_bytes', 'mean_bytes', 'std_bytes',
'duration', 'bytes_per_pkt', 'pkt_rate', 'byte_rate', 'small_pkt_ratio']
X_attack = attack_flows[feature_cols].fillna(0)
# Load baseline scaler and model (retrain from baseline CSV for consistency)
baseline = pd.read_csv('flow_features.csv')
scaler = StandardScaler().fit(baseline[feature_cols].fillna(0))
iso = IsolationForest(n_estimators=200, contamination=0.05, random_state=42)
iso.fit(scaler.transform(baseline[feature_cols].fillna(0)))
X_scaled = scaler.transform(X_attack)
attack_flows['anomaly'] = iso.predict(X_scaled)
attack_flows['score'] = iso.score_samples(X_scaled)
flagged = attack_flows[attack_flows['anomaly'] == -1]
print(f"Attack flows flagged: {len(flagged)}/{len(attack_flows)}")
print("\nHighest anomaly scores:")
print(flagged.nsmallest(15, 'score')[
['flow_id', 'packet_count', 'pkt_rate', 'duration', 'score']])
EOF
python3 score_attacks.py
C2 beacons have highly regular inter-packet timing. Detect with FFT periodicity analysis:
cat > ~/network-lab/beacon_detect.py << 'EOF'
import pandas as pd, numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Load raw packet timestamps grouped by flow
df = pd.read_csv('attack_packets.csv')
df['time'] = pd.to_numeric(df['frame.time_epoch'], errors='coerce')
df = df.dropna(subset=['time'])
# For each unique dst IP, compute inter-arrival time statistics
for dst_ip in df['ip.dst'].dropna().unique():
flow = df[df['ip.dst'] == dst_ip].sort_values('time')
if len(flow) < 10: continue
# Inter-arrival times
iats = flow['time'].diff().dropna()
# Low standard deviation relative to mean = regular beaconing
cv = iats.std() / (iats.mean() + 0.001) # coefficient of variation
if cv < 0.3 and iats.mean() > 5: # regular and interval > 5s
print(f"\n[!] POTENTIAL C2 BEACON DETECTED")
print(f" Destination: {dst_ip}")
print(f" Packet count: {len(flow)}")
print(f" Mean interval: {iats.mean():.2f}s")
print(f" Std interval: {iats.std():.2f}s")
print(f" Coefficient of variation: {cv:.3f} (low = regular)")
plt.figure(figsize=(8,3))
plt.plot(iats.values, marker='o', markersize=3)
plt.axhline(iats.mean(), color='red', linestyle='--', label='Mean IAT')
plt.title(f'Inter-Arrival Times — {dst_ip}')
plt.xlabel('Packet #'); plt.ylabel('Seconds')
plt.legend(); plt.savefig(f'beacon_{dst_ip.replace(".","-")}.png')
EOF
python3 beacon_detect.py
Adjust the contamination parameter to balance detection rate vs false positives:
python3 << 'EOF'
import pandas as pd, numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
flows = pd.read_csv('flow_features.csv')
feature_cols = ['packet_count', 'total_bytes', 'mean_bytes', 'bytes_per_pkt',
'pkt_rate', 'duration', 'std_bytes', 'small_pkt_ratio']
X = StandardScaler().fit_transform(flows[feature_cols].fillna(0))
print(f"{'Contamination':>15} | {'Flagged':>8} | {'Flag Rate':>10} | {'Min Score':>10}")
print("-" * 55)
for c in [0.01, 0.02, 0.05, 0.10, 0.15, 0.20]:
iso = IsolationForest(n_estimators=200, contamination=c, random_state=42)
preds = iso.fit_predict(X)
scores = iso.score_samples(X)
flagged = (preds == -1).sum()
print(f"{c:>15.2f} | {flagged:>8} | {flagged/len(flows)*100:>9.1f}% | {scores.min():>10.4f}")
EOF
Set contamination based on your environment's expected anomaly rate. For SOC environments, 2-5% is typical.
cat > ~/network-lab/sigma_port_scan.yml << 'EOF'
title: Network Port Scan Detection
id: a1b2c3d4-e5f6-7890-abcd-ef1234567890
status: experimental
description: Detects horizontal port scanning based on high distinct destination port count
author: CyberSec Pro Academy - L06
date: 2024/01/15
logsource:
category: network_connection
product: zeek
detection:
selection:
src_ip|startswith: '192.168.'
timeframe: 60s
condition: selection | count(dst_port) by src_ip > 100
falsepositives:
- Network scanners (Nessus, Qualys) — allowlist scanner IPs
- Load balancer health checks
level: medium
tags:
- attack.discovery
- attack.t1046
EOF
cat > ~/network-lab/sigma_c2_beacon.yml << 'EOF'
title: C2 Beaconing — Regular Interval Connections
id: b2c3d4e5-f6a7-8901-bcde-f01234567891
status: experimental
description: Detects C2 beaconing via regular-interval connections to same external host
author: CyberSec Pro Academy - L06
logsource:
category: network_connection
detection:
selection:
dst_port:
- 443
- 80
- 4444
- 8080
connection_count|gte: 10
filter_internal:
dst_ip|startswith:
- '10.'
- '172.16.'
- '192.168.'
timeframe: 1h
condition: selection and not filter_internal | count() by src_ip,dst_ip > 8
level: high
tags:
- attack.command_and_control
- attack.t1071
- attack.t1571
EOF
echo "Sigma rules written"
Document which ATT&CK techniques your detections cover:
| Detection | ATT&CK Technique | Coverage |
|---|---|---|
| Port scan (Isolation Forest) | T1046 — Network Service Discovery | Discovery |
| C2 beaconing (periodicity) | T1071 — App Layer Protocol | C2 |
| Large upload flows (byte_rate) | T1048 — Exfiltration Over Alt Protocol | Exfiltration |
| DBSCAN noise flows | T1571 — Non-Standard Port | C2 |
Record your lab results. Use the AI analyst to help structure your network detection report.
| Metric | Value |
|---|---|
| Total flows analyzed | |
| Anomalies flagged (Isolation Forest) | |
| Port scan detected | |
| Beacon detected | |
| Optimal contamination value | |
| MITRE techniques covered |