Agent Gateway Connection Logic
This post describes the connection logic between the agent and the gateway, including all the retry mechanisms, timeouts, and connection management details.
Overview
The agent connects to the gateway using gRPC with a streaming connection. The connection logic handles authentication, retries, exponential backoff, and keepalive mechanisms.
Connection Modes
1. Standard Mode
The default mode where the agent connects directly to the gateway. Uses exponential backoff for reconnection attempts.
2. Embedded Mode
Agent runs as part of another process with pre-connect logic. Mainly used for DSN configurations.
3. Multi-Connection Mode
Allows multiple connections from a single agent instance. Each connection has its own lifecycle.
Connection Flow
Initial Connection
- Configuration Loading (agent/config/Load):
- Loads config from environment or files
- Validates required fields (URL, token, etc.)
- Sets up TLS configuration if not insecure
- Pre-Connect Phase (for embedded mode):
- Sends PreConnectRequest to gateway
- Gateway responds with status:
CONNECT
: Proceed with connectionBACKOFF
: Wait before retrying- Retries every 5 seconds on failure
- Main Connection (grpc.Connect):
- 15-second timeout for initial connection
- Sends metadata: version, platform, hostname, machine-id
- Establishes bidirectional streaming connection
- Post-Connection:
- Receives
GatewayConnectOK
packet - Starts keepalive goroutine
- Begins processing incoming packets
Connection Parameters
// Timeouts
const (
InitialConnectTimeout = 15 * time.Second
TCPDialTimeout = 10 * time.Second
TCPLivenessTimeout = 5 * time.Second
PreConnectRetryDelay = 5 * time.Second
)
// Backoff Configuration
const (
InitialBackoff = 1 * time.Second
MaxBackoffAttempts = 9 // Max backoff: 512 seconds
BackoffResetTime = 60 // Reset after 60s of stable connection
)
// Keepalive
const DefaultKeepAlive = 10 * time.Second
Retry Logic
Exponential Backoff (backoff.Exponential2x)
The agent uses exponential backoff with 2x multiplier:
Attempt 1: 1s
Attempt 2: 2s
Attempt 3: 4s
Attempt 4: 8s
Attempt 5: 16s
Attempt 6: 32s
Attempt 7: 64s
Attempt 8: 128s
Attempt 9: 256s
Attempt 10+: 512s (capped)
Backoff Reset Conditions
- Time-based: Connection stable for >60 seconds
- Context canceled: Always resets backoff
- Nil error: Successful operation resets backoff
Error Handling
Different gRPC status codes trigger different behaviors:
codes.Canceled
: Reset backoff, clean shutdowncodes.Unauthenticated
: Continue backoff, log error- Other errors: Continue backoff with current delay
Keepalive Mechanism
Once connected, the agent sends keepalive packets every 10 seconds:
func (c *mutexClient) StartKeepAlive() {
go func() {
for {
proto := &pb.Packet{Type: pbgateway.KeepAlive}
if err := c.Send(proto); err != nil {
break // Connection lost
}
time.Sleep(pb.DefaultKeepAlive)
}
}()
}
If keepalive fails, the connection is considered dead and reconnection begins.
Session Management
Session Lifecycle
- SessionOpen: Gateway requests new session
- Validates connection parameters
- Checks TCP liveness for database connections
- Sends SessionOpenOK or SessionClose
- Active Session:
- Processes protocol-specific packets
- Maintains connection state in memory store
- Handles concurrent connections (TCP mode)
- SessionClose:
- Cleans up all associated connections
- Removes from memory store
- Sends exit code to gateway
TCP Liveness Check
For database connections (Postgres, MySQL, MSSQL, MongoDB):
- Attempts TCP dial with 5-second timeout
- Validates host:port connectivity
- Fails session if unreachable
Connection Storage
The agent maintains connections in a thread-safe memory store:
- Key format:
{sessionID}
or{sessionID}:{connectionID}
- Supports filtering by prefix for cleanup
- Handles graceful shutdown of all connections
Error Recovery
Network Failures
- Exponential backoff retry
- Keepalive detection of dead connections
- Automatic reconnection
Authentication Failures
- Logged but continues retry attempts
- No special handling vs other errors
Protocol Errors
- Session closed with error message
- Exit code sent to gateway
- Connection cleaned up
Environment Variables
Runtime environment variables are merged in this order:
- Agent runtime envs (passed to RunV2)
- Connection-specific envs from gateway
- Client-provided envs (lowest priority)
Special handling for system.agent.envs
marker - pulls values from agent's environment.
TLS Configuration
- Default: TLS enabled with system CA pool
- Custom CA: Via TLSCA config field
- Insecure mode: Only for testing/secure networks
- Server name override: TLSServerName field
Debugging
Enable gRPC debug logging:
export LOG_GRPC=1 # Basic logging
export LOG_GRPC=2 # Verbose logging
Common Issues
- "context canceled": Normal shutdown, not an error
- "unauthenticated": Check token validity
- "failed connecting to remote host": Database unreachable
- Backoff stuck at max: Connection failing for >9 attempts
Remember: If the connection fails, check the logs. If logging fails, check the network. If the network fails, verify DNS resolution.