ePDG Monitoring [Документация VAS Experts]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:dpi:epdg:monitoring [2026/04/30 08:51] – [5.2 SNMP v2c — EPDG-MIB] elena.krasnobryzhen:dpi:epdg:monitoring [2026/05/07 09:11] (current) elena.krasnobryzh
Line 1: Line 1:
 {{indexmenu_n>6}} {{indexmenu_n>6}}
-====== Monitoring of ePDG ======+====== ePDG Monitoring ======
  
 ===== Integrated VoWiFi Gateway Monitoring System (ePDG) ===== ===== Integrated VoWiFi Gateway Monitoring System (ePDG) =====
  
-===== 1. Review of the decision =====+===== Review of the decision =====
  
 The VAS Experts ePDG Monitoring system provides full operational control of the **fast-epdg** component, the VoWiFi (Voice over WiFi) gateway operating according to 3GPP TS 29.273 and TS 24.302. The gateway provides secure transmission of voice and packet traffic through untrusted Wi-Fi channels with IPSec / IKEv2 tunneling and integration with the EPC core through SWu, SWm, SWx, S2b, S6b interfaces. The VAS Experts ePDG Monitoring system provides full operational control of the **fast-epdg** component, the VoWiFi (Voice over WiFi) gateway operating according to 3GPP TS 29.273 and TS 24.302. The gateway provides secure transmission of voice and packet traffic through untrusted Wi-Fi channels with IPSec / IKEv2 tunneling and integration with the EPC core through SWu, SWm, SWx, S2b, S6b interfaces.
Line 18: Line 18:
   * **Coverage of the entire SWu → S2b** stack — IKEv2 (SWu), Diameter SWm/SWx/S6b, GTPv2-C (S2b) and GTP-U data plane — all in one place. The 33 metrics cover control plane and data plane.   * **Coverage of the entire SWu → S2b** stack — IKEv2 (SWu), Diameter SWm/SWx/S6b, GTPv2-C (S2b) and GTP-U data plane — all in one place. The 33 metrics cover control plane and data plane.
  
-===== 2. Architecture of the monitoring system ===== 
- 
-<mermaid> 
-flowchart TB 
-    subgraph DataPlane["Data Plane"] 
-        IPSEC["IPSec ESP<br/>IKEv2 SA / Child SA<br/>Kernel xfrm"] 
-        GTPU["GTP-U Tunneller<br/>S2b Data<br/>ePDG ↔ PGW"] 
-    end 
- 
-    subgraph ControlPlane["Control Plane"] 
-        IKE["IKEv2 SWu<br/>EAP-AKA' auth"] 
-        DIAM["Diameter Client<br/>SWx/SWm/S6b"] 
-        GTPC["GTPv2-C S2b<br/>to PGW/SMF"] 
-        CTRL["ePDG Controller<br/>Attach/Detach FSM"] 
-    end 
- 
-    subgraph Collection["Metrics Collection"] 
-        PROMEXP["fast-epdg<br/>/metrics endpoint<br/>:9817"] 
-    end 
- 
-    subgraph Storage["Storage"] 
-        PROM["Prometheus<br/>TSDB<br/>15-day retention"] 
-    end 
- 
-    subgraph Visualization["Visualization"] 
-        GRAF["Grafana<br/>4 дашборда, 35+ панелей"] 
-    end 
- 
-    subgraph Alerting["Alerting"] 
-        AM["Alertmanager<br/>Routing / Inhibition"] 
-        EMAIL["Email SMTP"] 
-        SNMPGW["SNMP Trap Sender<br/>Webhook → Trap gateway"] 
-        NMS["Внешняя NMS<br/>SNMP v2c UDP/162"] 
-        WH["Webhooks<br/>Telegram / PagerDuty"] 
-    end 
- 
-    IKE --> PROMEXP 
-    IPSEC --> PROMEXP 
-    GTPC --> PROMEXP 
-    GTPU --> PROMEXP 
-    DIAM --> PROMEXP 
-    CTRL --> PROMEXP 
- 
-    PROMEXP --> PROM 
-    PROM --> GRAF 
-    PROM --> AM 
- 
-    AM --> EMAIL 
-    AM --> SNMPGW 
-    SNMPGW --> NMS 
-    AM --> WH 
-</mermaid> 
  
 ==== Four-level monitoring architecture ==== ==== Four-level monitoring architecture ====
Line 79: Line 27:
 | **Alerting** | Alertmanager + SNMP Trap Sender | PromQL rules → webhook → SNMP v2c trap | | **Alerting** | Alertmanager + SNMP Trap Sender | PromQL rules → webhook → SNMP v2c trap |
  
- 
-===== 3. Components and indicators ===== 
- 
-==== Monitoring coverage ==== 
- 
-<mermaid> 
-flowchart LR 
-    EXP["fast-epdg<br/>/metrics :9817"] 
- 
-    EXP --> CFG["Config<br/>2 metrics"] 
-    EXP --> NET["Network<br/>1 metric"] 
-    EXP --> PROTO["Protocols L5-L7<br/>15 metrics"] 
-    EXP --> SVC["Service KPI<br/>4 metrics"] 
-    EXP --> SESS["Session State<br/>4 metrics"] 
-    EXP --> APP["Application<br/>3 metrics"] 
-    EXP --> SYS["System<br/>4 metrics"] 
- 
-    PROTO --> IKEV2["IKEv2<br/>SWu — 3"] 
-    PROTO --> GTPC["GTPv2-C<br/>S2b — 4"] 
-    PROTO --> GTPU["GTP-U<br/>S2b data — 3"] 
-    PROTO --> DIA["Diameter<br/>SWm/SWx/S6b — 5"] 
-</mermaid> 
  
 ==== Quantitative review by category ==== ==== Quantitative review by category ====
  
-^ Category ^ Number of metrics ^ Survey interval ^ Key indicators ^ +^ Category                    ^ Number of metrics  ^ Survey interval  ^ Key indicators                                                                                  
-| **Config** | 2 | 10 | Configuration status, reload counter | +| **Config**                  | 2                  | 10 sec           | Configuration status, reload counter                                                            
-| **Network** | 1 | 10 | Node connection status (PGW/AAA/HSS) | +| **Network**                 | 1                  | 10 sec           | Node connection status (PGW/AAA/HSS)                                                            
-| **IKEv2 (SWu)** | 3 | 10 | Reports by type (IKE_SA_INIT, IKE_AUTH, CREATE_CHILD_SA), delay diagram, errors | +| **IKEv2 (SWu)**             | 3                  | 10 sec           | Reports by type (IKE_SA_INIT, IKE_AUTH, CREATE_CHILD_SA), delay diagram, errors                 
-| **GTPv2-C (S2b)** | 4 | 10 | Messages (Create/Modify/Delete Session), delays, errors, relays | +| **GTPv2-C (S2b)**           | 4                  | 10 sec           | Messages (Create/Modify/Delete Session), delays, errors, relays                                 
-| **GTP-U data plane** | 3 | 10 with | Packets/bytes, tunneling errors | +| **GTP-U data plane**        | 3                  | 10 sec           | Packets/bytes, tunneling errors                                                                 
-| **Diameter (SWm/SWx/S6b)** | 5 | 10 | Command code messages (DER/DEA, MAR/MAA, AAR/AAA), delays, errors, watchdog, connection status | +| **Diameter (SWm/SWx/S6b)**  | 5                  | 10 sec           | Command code messages (DER/DEA, MAR/MAA, AAR/AAA), delays, errors, watchdog, connection status  
-| **Service KPI** | 4 | 10c | Percentage of successful attempts, duration histogram, service availability, uptime | +| **Service KPI**             | 4                  10 sec           | Percentage of successful attempts, duration histogram, service availability, uptime             
-| **Session State** | 4 | 10 with | IKE SA, Child SA, GTP sessions, all users | +| **Session State**           | 4                  | 10 sec           | IKE SA, Child SA, GTP sessions, all users                                                       
-| **Application** | 3 | 10 | Number of streams, memory, log messages by levels | +| **Application**             | 3                  | 10 sec           | Number of streams, memory, log messages by levels                                               
-| **System** | 4 | 10c | CPU recycling, memory, memory disposal, open FD | +| **System**                  | 4                  10 sec           | CPU recycling, memory, memory disposal, open FD                                                 
-| **Total** | **33 metrics** | | |+| **Total**                   | **33 metrics**                                                                                                                      |
  
-==== Naming principles ==== 
- 
-All metrics have the prefix ''epdg_'' and are organized in a hierarchy: 
- 
-<code> 
-epdg_ 
-├── config_*           # Configuration 
-├── network_*          # Network layer 
-├── ikev2_*            # SWu (IKEv2/IPSec) 
-├── gtp_*              # S2b control-plane GTPv2-C 
-├── gtpu_*             # S2b data-plane GTP-U 
-├── diameter_*         # SWm/SWx/S6b 
-├── service_*          # Service KPIs (attach, availability, uptime) 
-├── session_*          # Session Status (IKE SA, Child SA, GTP, subscribers) 
-├── app_*              # App Metrics (memory, threads, logs) 
-└── system_*           # System metrics (CPU, disk, network) 
-</code> 
- 
-===== 4. List of metrics ===== 
- 
-All metrics are exported through a single ''/metrics'' endpoint in Prometheus text format. The name follows the rules of Prometheus: ''epdg_<group>_<name>[_unit]'', the Counter type has the suffix ''_total'', Histogram is the suffix ''_seconds''/''_bytes''. 
- 
-==== 4.1 Config (2) ==== 
- 
-^ Name ^ Type ^ Appointment ^ 
-| ''epdg_config_status'' | Gauge | Component configuration status (0=error, 1=ok) | 
-| ''epdg_config_reload_total'' | Counter | Configuration download counter (success/failure) | 
- 
-==== 4.2 Network (1) ==== 
- 
-^ Name ^ Type ^ Appointment ^ 
-| ''epdg_network_connection_status'' | Gauge | TCP/UDP connection status to a node (0=down, 1=up) — applies to PGW (S2b), AAA (SWm), HSS (SWx) | 
- 
-==== 4.3 IKEv2 SWu (3) ==== 
- 
-^ Name ^ Type ^ Appointment ^ 
-| ''epdg_ikev2_messages_total'' | Counter | IKEv2 Message Counter (IKE_SA_INIT / IKE_AUTH / CREATE_CHILD_SA / INFORMATIONAL) | 
-| ''epdg_ikev2_request_duration_seconds'' | Histogram | IKEv2 response time | 
-| ''epdg_ikev2_errors_total'' | Counter | IKEv2 errors (NO_PROPOSAL_CHOSEN, AUTHENTICATION_FAILED, INVALID_SYNTAX, etc.) | 
- 
-==== 4.4 GTPv2-C S2b (4) ==== 
- 
-^ Name ^ Type ^ Appointment ^ 
-| ''epdg_gtp_messages_total'' | Counter | GTPv2-C (Create/Modify/Delete Session, Echo) | 
-| ''epdg_gtp_request_duration_seconds'' | Histogram | Waiting time request → reply | 
-| ''epdg_gtp_errors_total'' | Counter | GTP-C error by Cause Code | 
-| ''epdg_gtp_retransmissions_total'' | Counter | Redirecting GTP-C requests | 
- 
-==== GTP-U data plane (3) ==== 
- 
-^ Name ^ Type ^ Appointment ^ 
-| ''epdg_gtpu_packets_total'' | Counter | Packages via GTP-U tunnel (uplink/downlink) | 
-| ''epdg_gtpu_bytes_total'' | Counter | Bytes through GTP-U tunnel | 
-| ''epdg_gtpu_errors_total'' | Counter | Tunneling errors (TEID mismatch, decap fail) | 
- 
-==== 4.6 Diameter SWm/SWx/S6b (5) ==== 
- 
-^ Name ^ Type ^ Appointment ^ 
-| ''epdg_diameter_messages_total'' | Counter | DER/DEA (SWm), MAR/MAA (SWx), AAR/AAA (S6b), STR/STA| 
-| ''epdg_diameter_request_duration_seconds'' | Histogram | Waiting time request → reply by Diameter | 
-| ''epdg_diameter_errors_total'' | Counter | Errors by Experimental-Result-Code | 
-| ''epdg_diameter_watchdog_status'' | Gauge | DWR/DWA watchdog status to node (0=timeout, 1=ok) | 
-| ''epdg_diameter_connection_status'' | Gauge | Diameter connection status to node (0=disconnected, 1=connected) | 
- 
-==== 4.7 Service KPI (4) ==== 
- 
-^ Name ^ Type ^ Appointment ^ 
-| ''epdg_service_attach_total'' | Counter | Attempts to connect (success/failure) via APN | 
-| ''epdg_service_attach_duration_seconds'' | Histogram | Duration of connection (IKE_SA_INIT → session ready) | 
-| ''epdg_service_availability'' | Gauge | Accessibility flag (0=down, 1=up) | 
-| ''epdg_service_uptime_seconds'' | Gauge | Service availability time | 
- 
-==== 4.8 Session State (4) ==== 
- 
-^ Name ^ Type ^ Appointment ^ 
-| ''epdg_session_ike_sa_total'' | Gauge | Active IKE SA | 
-| ''epdg_session_child_sa_total'' | Gauge | Active Child SA (IPSec tunnels) | 
-| ''epdg_session_gtp_sessions_total'' | Gauge | Active GTP-C sessions on S2b | 
-| ''epdg_session_subscribers_total'' | Gauge | Unique subscribers (UE connected) | 
- 
-==== 4.9 Application (3) ==== 
- 
-^ Name ^ Type ^ Appointment ^ 
-| ''epdg_app_threads_total'' | Gauge | Total number of work streams | 
-| ''epdg_app_memory_bytes'' | Gauge | Process memory by type | 
-| ''epdg_app_log_messages_total'' | Counter | Log messages by level (debug/info/warn/error/fatal) | 
- 
-==== 4.10 System (4) ==== 
- 
-^ Name ^ Type ^ Appointment ^ 
-| ''epdg_system_cpu_usage_percent'' | Gauge | Download CPU | 
-| ''epdg_system_memory_bytes'' | Gauge | System memory | 
-| ''epdg_system_disk_bytes'' | Gauge | Disk space | 
-| ''epdg_system_open_fds'' | Gauge | Open file descriptions | 
- 
-==== Types of metrics (reminder) ==== 
- 
-^ Type ^ Appointment ^ 
-| **Counter** | Monotonically growing counter (messages, errors, reboots) | 
-| **Gauge** | Current value (active sessions, memory, status) | 
-| **Histogram** | Distribution of values with automatic slices over intervals (duration, lifetime) | 
- 
-===== 5. Integration interfaces ===== 
- 
-<mermaid> 
-flowchart LR 
-    CORE["VAS Experts<br/>ePDG Monitoring"] 
- 
-    CORE --> P["Prometheus<br/>CNCF / OpenMetrics"] 
-    CORE --> S["SNMP v2c<br/>EPDG-MIB"] 
-    CORE --> G["Grafana<br/>JSON Provisioning"] 
-    CORE --> W["Webhooks<br/>ChatOps"] 
-    CORE --> AM["Alertmanager<br/>Routing"] 
- 
-    P --> P1["Cloud-native NMS<br/>Thanos / Cortex / Mimir"] 
-    S --> S1["Legacy NMS<br/>HP OpenView, NetAct<br/>IBM Tivoli"] 
-    G --> G1["NOC Wall Displays<br/>Drill-down Analytics"] 
-    W --> W1["Telegram / Slack<br/>PagerDuty / OpsGenie"] 
-    AM --> AM1["Smart routing<br/>Severity-based"] 
-</mermaid> 
- 
-==== 5.1 Prometheus (CNCF Standard) ==== 
- 
-The native ''/metrics'' endpoint on port **9817** is built into fast-epdg. The format is standard text format Prometheus v0.0.4 (compatible with OpenMetrics). Aggregation is supported with the central Prometheus operator; remote_write team support for long-term storage in Thanos, Cortex, Grafana Mimir. 
- 
-==== 5.2 SNMP v2c — EPDG-MIB ==== 
- 
-**47 OID** covers the Prometheus metric + **14 trap notifications** (with raise/clear pairs according to RFC 3877 ALARM-MIB). Compatible with HP OpenView, IBM Tivoli NetCool, Nokia NetAct, Huawei U2000. 
- 
-<mermaid> 
-flowchart TB 
-    IANA["IANA PEN<br/>enterprises<br/>.1.3.6.1.4.1"] 
-    VAS["VAS Experts<br/>.1.3.6.1.4.1.43823<br/>(vas.expert)"] 
-    EPDG["EPDG-MIB<br/>.43823.1"] 
-    EPC["EPC Monitoring<br/>.43823.100"] 
- 
-    IANA --> VAS 
-    VAS --> EPDG 
-    VAS --> EPC 
- 
-    EPDG --> OBJ["epdgObjects<br/>.43823.1.1"] 
-    EPDG --> NOTIF["epdgNotifications<br/>.43823.1.2<br/>14 trap types"] 
-    EPDG --> CONF["epdgConformance<br/>.43823.1.3"] 
- 
-    OBJ --> SERVICE["service .1.1.1<br/>4 OID"] 
-    OBJ --> IKE["ikev2 .1.1.2<br/>6 OID"] 
-    OBJ --> GTP["gtp .1.1.3<br/>8 OID"] 
-    OBJ --> DIAM["diameter .1.1.4<br/>7 OID"] 
-    OBJ --> SESS["sessions .1.1.5<br/>8 OID"] 
-    OBJ --> SYS["system .1.1.6<br/>8 OID"] 
-    OBJ --> NET["network .1.1.7<br/>6 OID"] 
- 
-    NOTIF --> TRAPAGR["7 raise / 7 clear<br/>pairs"] 
-</mermaid> 
- 
-Examples of SNMP requests: 
- 
-<code bash> 
-# All ePDG trees 
-snmpwalk -v2c -c public <host>.1.3.6.1.4.1.43823.1 
- 
-# Service availability (Gauge 0..1) 
-snmpget -v2c -c public <host> .1.3.6.1.4.1.43823.1.1.0 
-</code> 
- 
- 
-==== 5.3 Grafana ==== 
- 
-**4 JSON dashboard support** (35+ panels total): 
-  * **ePDG Overview** — availability, KPI connections, sessions, state of interfaces 
-  * **IKEv2 Details** — Messages, Performance, Errors, IKE SA Lifecycle 
-  * **GTP Details** — GTPv2-C + GTP-U data on PGW nodes 
-  * **Diameter Details** — Application messages, delays, watchdog 
- 
-Automatic installation through an API that supports Grafana. Adaptive design for Network Control Center (NOC) status monitors with auto-update every 15 seconds. 
- 
-==== 5.4 Alertmanager Webhooks ==== 
- 
-Webhook interface for integration with any notification system: Telegram Bot, Slack, PagerDuty Events API v2, OpsGenie, Microsoft Teams. A separate **SNMP Trap Sender** service converts Alertmanager webhooks to SNMP v2c traps with Enterprise OID. 
- 
-===== 6. The alarm system ===== 
  
 ==== Alarm categories ==== ==== Alarm categories ====
  
-^ Criticism ^ Alarma ^ Description ^ Reaction ^ +^ Criticism     ^ Alarma                                                                                                                                                                                                                            ^ Description                                                                      ^ Reaction                                                                                         
-| **Critical** | ''ePDG_Service_Down'', ''ePDG_High_Attach_Failure_Rate'', ''ePDG_PGW_Unreachable', ''ePDG_AAA_Unreachable', ''ePDG_Diameter_Watchdog_Timeout'' | Component unavailable, mass connection failure, nodes available | Immediate escalation: ENMP Trapmail Webmail. Repeat every 1 o'clock +| **Critical**  | ''ePDG_Service_Down'', ''ePDG_High_Attach_Failure_Rate'', ''ePDG_PGW_Unreachable'', ''ePDG_AAA_Unreachable'', ''ePDG_Diameter_Watchdog_Timeout''                                                                                  | Component is unavailable, widespread connection failures, nodes are unavailable  | Immediate escalation: Email + SNMP Trap Webhook. Repeat every hour                             
-| **Warning** | ''ePDG_High_IKEv2_Latency'', ''ePDG_High_GTP_Latency'', ''ePDG_High_IKEv2_Error_Rate'', ''ePDG_High_GTP_Error_Rate'', ''ePDG_High_Memory_Usage'', ''ePDG_High_CPUsage_Usage_Usage'', ''ePDG_Error_Error_Rate'', ''PDHigh'', ''PDHigh_Rate_Rate'' Repeat every 4 hours. Suppressed if Critical is present on the same component +| **Warning**   | ''ePDG_High_IKEv2_Latency'', ''ePDG_High_GTP_Latency'', ''ePDG_High_IKEv2_Error_Rate'', ''ePDG_High_GTP_Error_Rate'', ''ePDG_High_Memory_Usage'', ''ePDG_High_CPU_Usage'', ''ePDG_Low_Disk_Space'', ''ePDG_High_Error_Log_Rate''  | Performance degradationresource anomalies                                      | Email. Resend every 4 hours. Suppressed if a “Critical” status is present on the same component  |
- +
-==== Complete list of alarms (20+ rules) ==== +
- +
-<mermaid> +
-flowchart LR +
-    AL["ePDG Alert Rules<br/>20+"+
- +
-    AL --> CR["Critical<br/>5 rules] +
-    AL --> WR["Warning<br/>8 rules] +
-    AL --> INFO["Recording<br/>34 rules] +
- +
-    CR --> C1["Service_Down<br/>availability ==0]" +
-    CR --> C2["Attach_Failure_Rate<br/>> 10%] +
-    CR --> C3["PGW_Unreachable<br/>connection_status{s2b}==0"+
-    CR --> C4["AAA_Unreachable<br/>connection_status{swm}==0"+
-    CR --> C5["Diameter_Watchdog_Timeout<br/>watchdog_status=0"+
- +
-    WR --> W1["High_IKEv2_Latency<br/>p95 > 1.0 s"] +
-    WR --> W2["High_GTP_Latency<br/>p95 > 0.5 s] +
-    WR --> W3["High_IKEv2_Error_Rate<br/>> 5%] +
-    WR --> W4["High_GTP_Error_Rate<br/>> 5%] +
-    WR --> W5["High_Memory_Usage<br/>> 80%] +
-    WR --> W6["High_CPU_Usage<br/>> 80%] +
-    WR --> W7["Low_Disk_Space<br/><10 percent]" +
-    WR --> W8["High_Error_Log_Rate<br/>> 10/s] +
- +
-    INFO --> I1["attach_success_rate<br/>preaggregated] +
-    INFO --> I2["p95_p99_latency<br/>preaggregated] +
-    INFO --> I3["throughput<br/>preaggregated] +
-</mermaid> +
- +
-==== Alarm treatment process ==== +
- +
-<mermaid> +
-sequenceDiagram +
-    participant M as Metric (Prometheus) +
-    participant R as Alert Rule (PromQL) +
-    participant AM as Alertmanager +
-    participant E as Email (SMTP) +
-    participant SG as SNMP Trap Gateway +
-    participant NMS as External NMS +
-    participant W as Webhook (ChatOps) +
- +
-    M-> R: Value exceeds threshold +
-    R->>R: Waiting (for: 1-10 min) +
-    R->>AM: Alert FIRING +
-    AM->>AM: Group by [alertname, component] +
-    AM->>AM: Inhibition check (critically suppresses warning) +
- +
-    alt severity = critical +
-        AM->>E: Email [CRITICAL] +
-        SG: Webhook → SNMP Trap +
-        SG->>>NMS: SNMP v2c Trap (OID .1.3.6.1.4.1.43823.1.2.X) +
-        AM->>>W: Webhook (Telegram/PagerDuty) +
-    other severity = warning +
-        AM->>E: Email [WARNING] +
-    end +
- +
-    Note over M,R: Metric returns to normal +
-    R->>AM: Alert Resolved +
-    SG: clear-trap (paired notification) +
-    AM->>>E: Email [RESOLVED] +
-</mermaid> +
- +
-==== Features ==== +
- +
-  * **Inhibition**: Critical alarms automatically suppress Warning for the same component +
-  * **Grouping**: Alarms are grouped into 'alertname' + 'component' with a 30-second window +
-  * **Dead time / Hysteresis**: 1 to 10 minutes 'for' prevents false positives +
-  * **Trap pairing**: raise/clear simultaneous events for compliance with RFC 3877 ALARM-MIB +
- +
- +
-===== 7. Visualization and operational dashboards ===== +
- +
-==== Composition of dashboards ==== +
- +
-^ Dashboard ^ Panel ^ Purpose ^ +
-**ePDG Overview** | 10 | Service availability, connection success rate, number of active sessions, SWu/SWm/S2b status, interface bandwidth | +
-| **IKEv2 Details** | 10 | Mes per second by type, histogram of request duration, delay in the 95th percentile, error by type, IKE SA life cycle | +
-| **GTP Details** | 8 | GTPv2-C PGW messages, retransmissions, cause code errors, GTP-U (uplink/downlink) carriers | +
-| **Diameter Details** | 7 | Number of application messages (SWm/SWx/S6b), duration of requests, state of watchdog timer, distribution of result codes, chronology of connection states | +
- +
-==== Design for Network Management Center (NOC) ==== +
- +
-<mermaid> +
-flowchart TB +
-    NOC [NOC Dashboard Layer] +
- +
-    NOC --> OVER["ePDG Overview<br/>KPI Summary] +
-    NOC --> IKE["IKEv2 Details<br/>Drill-down]] +
-    NOC --> GTP["GTP Details<br/>Drill-down] +
-    NOC --> DIA["Diameter Details<br/>Drill-down"+
- +
-    Over --> |Click attach KPI| IKE +
-    Over --> |Click session count| GTP +
-    Over --> |Click peer status| DIA +
-</mermaid> +
- +
-  * **Auto Update**: 15-second update period +
-  * **Adaptive color scheme**: green → yellow → red by threshold values +
-  * **Drill-down**: From Overview to Detail to Component +
-  * **Time-range selector**: 5 minutes to 30 days of history +
-  * **JSON provisioning**: Dashboards are automatically deployed +
- +
-===== 8. Integration into a single EPC Monitoring stack ===== +
- +
-ePDG monitoring is fully integrated into overall packet core monitoring: +
- +
-<mermaid> +
-flowchart TB +
-    Subgraph Common ("Unified Monitoring Stack"+
-        PROM["Prometheus"+
-        GRAF["Grafana"+
-        AM["Alertmanager"+
-    end +
- +
-    Subgraph Sources[EPC Sources] +
-        DPI["FastDPI<br/>:9110] +
-        SMF["SMF/metrics<br/>:9090] +
-        PCEF["fast-pcef/metrics<br/>:9090] +
-        PCRF["FastPCRF"+
-        EPDG["fast-epdg<br/>:9817] +
-    end +
- +
-    DPI --> PROM +
-    SMF --> PROM +
-    PCEF -> PROM +
-    PCRF -> PROM +
-    EPDG --> PROM +
- +
-    PROM --> GRAF +
-    PROM -- AM +
-</mermaid> +
- +
-The NOC operator sees **all EPC components** (DPI, SMF, PCEF, FastPCRF, ePDG) in a single Grafana interface, with a single alarm system and notification routing through one Alertmanager. +
- +
-===== 9. Coverage of metrics by OSI levels ===== +
- +
-<mermaid> +
-graph LR +
-    L1["L1 Physical<br/>NIC counters via system] +
-    L2["L2 Data Link<br/>MAC, VLAN] +
-    L3["L3 Network<br/>IP, IPSec ESP, GTP-U"+
-    L4["L4 Transport<br/>TCP/UDP/SCTP] +
-    L5["L5 Session<br/>GTPv2-C, IKEv2] +
-    L6["L6 Presentation<br/>IKEv2/IPSec encryption, EAP-AKA"+
-    L7["L7 Application<br/>Diameter, service bearer ops] +
-    "Operations<br/>KPI, SLA, Capacity" +
-    CX["CX Level<br/>Subscriber Experience"+
- +
-    L1 --> L2 --> L3 --> L4 --> L5 --> L6 --> L7 --> Operations --> CX +
- +
-    L1 style fill:#e74c3c,color:#fffffffffffffff +
-    L2 style fill: #e67e22,color:#fffffffff +
-    L3 style fill: #f39c12,color:#fffffff +
-    L4 style fill:#2ecc71,color:#fffffff +
-    L5 style fill: #1abc9c,color:#fffffffffffffff +
-    L6 style fill: #3498db,color:#fffffff +
-    L7 style fill: #9b59b6,color:#fffffffffffffff +
-    Operations style fill:#34495e,color:#fffffff +
-    CX style fill: #2c3e50,color:#fffffffffffffff +
-</mermaid> +
- +
-==== Detailing metrics by level ==== +
-OSI model: +
- +
-^ Level ^ Metrics ^ Examples ^ +
-| **L1/L2 Physical / Data Link** | - | Covered by a separate node_exporter/OS-level analogue (not included in the ePDG metrics list) | +
-| **L3 Network / IPSec tunnels** | 3 | ''epdg_gtpu_packets_total'', ''epdg_gtpu_bytes_total'', ''epdg_gtpu_errors_total'' — GTP-U data plane | +
-| **L4 Transport** | 1 | ''epdg_network_connection_status'' — TCP to nodes (PGW/AAA/HSS) | +
-| **L5 Session** | 3 | ''epdg_session_ike_sa_total'', ''epdg_session_child_sa_total'', ''epdg_session_gtp_sessions_total'' +
-| **L6 Presentation/Security** | 3 | ''epdg_ikev2_messages_total'', ''epdg_ikev2_request_duration_seconds'', ''epdg_ikev2_errors_total'' — IKEv2/IPSec encryption and EAP-AKA' authentication | +
-| **L7 Application** | 9 | ''epdg_diameter_*'' (SWm/SWx/S6b, 5 metrics), ''epdg_gtp_*'' (GTPv2-C, 4 metrics) | +
- +
-Operator level: +
-^ Level ^ Metrics ^ Examples ^ +
-| **Operations** | 11 | ''epdg_service_availability'', ''epdg_service_uptime_seconds'', ''epdg_app_*'' (3), ''epdg_system_*'' (4), ''epdg_config_*'' (2) | +
-| **Customer Experience** | 3 | ''epdg_service_attach_duration_seconds'' p95, ''epdg_service_attach_total'' (success rate), ''epdg_ikev2_request_duration_seconds'' p99 | +
- +
-==== Level 9: Quality of VoWiFi service perception ==== +
- +
-^ QoE indicator ^ Source metrics ^ Interpretation ^ +
-| **VoWiFi connection time** | ''epdg_service_attach_duration_seconds'' p95 | > 3 seconds — subscriber notices delay when switching to WiFi | +
-| **Continuity of service** | ''epdg_session_ike_sa_total'' delta | Mass discharge > 50 IKE SA = accessibility issue | +
-| **Authentication success** | ''ePDG_High_Attach_Failure_Rate'' alert rate | > 5% = HSS/AAA node problem | +
-| **Delayed appointment bearer** | ''epdg_gtp_request_duration_seconds{msg=create-session}'' p99 | > 500 ms — delayed availability of voice channel | +
-| **GTP-U tunnel** | ''epdg_gtpu_errors_total'' rate / ''epdg_gtpu_packets_total'' | > 0.1% = degradation of voice quality | +
-| **IKEv2-reliability** | ''epdg_ikev2_errors_total'' by type | NO_PROPOSAL_CHOSEN / AUTHENTICATION_FAILED — problems with certs / UE | +
- +
- +
-===== 10. Standards and compatibility ===== +
- +
-^ Standard ^ Area ^ Application ^ +
-| **3GPP TS 29.273** | SWx/S6b/SWm | Methodology for accounting for Diameter messages and resulting codes | +
-| **3GPP TS 24.302** | SWu (IKEv2) | Definition of IKEv2 message types and error codes | +
-| **3GPP TS 33.402** | 3GPP security for non-3GPP access | EAP-AKA'/IKEv2 security parameters | +
-| **3GPP TS 23.402** | Non-3GPP access architecture | Interface Structure (SWu/SWm/SWx/S6b/S2b) | +
-| **3GPP TS 32.421** | Performance Measurement | Collection methodology KPI | +
-| **3GPP TS 32.409** | Performance measurement charging | Counter structure | +
-| **IETF RFC 7296** | IKEv2 | Message types, error notifications, state SA | +
-| **IETF RFC 6733** | Diameter | Command codes, Result-Codes | +
-| **IETF RFC 4187** | EAP-AKA | Authentication via SIM | +
-| **IETF RFC 3877** | ALARM MIB | Enterprise MIB structure for alarms | +
-| **IETF RFC 3418** | SNMPv2 MIB | SNMP v2c compatibility | +
-| **Prometheus Exposition Format** | Metrics (v0.0.4) | Export metric format | +
-| **OpenMetrics** | CNCF Standard | Prospective compatibility | +
- +
- +
-===== 11. The deployment model ===== +
- +
-<mermaid> +
-flowchart TB +
-    subgraph Host1[ePDG server] +
-        EPDG["fast-epdg<br/>(VoWiFi gateway)"+
-        PLUGIN["/metrics endpoint<br/>:9817] +
-        EPDG -.-> PLUGIN +
-    end +
- +
-    subgraph Host2["Server Monitoring"+
-        PROM["Prometheus"+
-        GRAF["Grafana"+
-        AM["Alertmanager"+
-        SNMPTRAP["SNMP Trap Sender<br/>(webhook gateway)"+
-        PROM --> GRAF +
-        PROM -- AM +
-        AM --> SNMPTRAP +
-    end +
- +
-    Subgraph Host3[External Systems] +
-        Operator NMS<br/>(HP OpenView /<br/>NetAct / Tivoli) +
-        CHAT["ChatOps<br/>(Telegram/PagerDuty)"+
-    end +
- +
-    PLUGIN --> |HTTP:9817/metrics| PROM +
-    SNMPTRAP -->|UDP 162| NMS +
-    AM --> |Webhook|CHAT +
-</mermaid> +
- +
-==== Deployment characteristics ==== +
- +
-^ Parameter ^ Value ^ +
-| **Metrics footprint** | Integrated (~2 MB memory overhead) | +
-| **External dependencies** | Self-contained package ''fast-epdg'' (rpm) | +
-| **Management** | ''fast-epdg.service'' systemd | +
-| **Configuration** | ''monitoring'' section in ''fast-epdg.conf''+
-| **Update** | Configuration update without interruption | +
-| **OS** | Linux (RHEL/CentOS 8+, Ubuntu 22.04+) +
-| **Port** | 9817 TCP (listen 0.0.0.0, configurable) | +
-| **Deployment time** | < 5 minutes (enable plugin in config + restart) | +
- +
-==== Accommodation options ==== +
- +
-  * **On-premise** — the plugin runs in the fast-epdg address space, zero resource consumption +
-  * **Co-located Prometheus* — Prometheus collects metrics from an application running on the same host +
-  * **Centralized* — a single Prometheus collects from all ePDG nodes +
- +
-===== 12. Metric exporter configuration ===== +
- +
-The ''monitoring'' section in ''fast-epdg.conf'':+
  
-<code> 
-monitoring { 
-    enabled = yes 
-    listen_port = 9817 
-    listen_address = 0.0.0.0 
-    update_interval = 10 
-    metrics { 
-        ikev2 = yes 
-        gtp = yes 
-        diameter = yes 
-        service = yes 
-        session = yes 
-        app = yes 
-        system = yes 
-    } 
-} 
-</code> 
  
-Each group of metrics can be independently turned on/off without recompilation.