Agent Gateway Connection Logic

This post describes the connection logic between the agent and the gateway, including all the retry mechanisms, timeouts, and connection management details.

Overview

The agent connects to the gateway using gRPC with a streaming connection. The connection logic handles authentication, retries, exponential backoff, and keepalive mechanisms.

Connection Modes

1. Standard Mode

The default mode where the agent connects directly to the gateway. Uses exponential backoff for reconnection attempts.

2. Embedded Mode

Agent runs as part of another process with pre-connect logic. Mainly used for DSN configurations.

3. Multi-Connection Mode

Allows multiple connections from a single agent instance. Each connection has its own lifecycle.

Connection Flow

Initial Connection

  1. Configuration Loading (agent/config/Load):
  • Loads config from environment or files
  • Validates required fields (URL, token, etc.)
  • Sets up TLS configuration if not insecure
  1. Pre-Connect Phase (for embedded mode):
  • Sends PreConnectRequest to gateway
  • Gateway responds with status:
  • CONNECT: Proceed with connection
  • BACKOFF: Wait before retrying
  • Retries every 5 seconds on failure
  1. Main Connection (grpc.Connect):
  • 15-second timeout for initial connection
  • Sends metadata: version, platform, hostname, machine-id
  • Establishes bidirectional streaming connection
  1. Post-Connection:
  • Receives GatewayConnectOK packet
  • Starts keepalive goroutine
  • Begins processing incoming packets

Connection Parameters

// Timeouts
const (
    InitialConnectTimeout = 15 * time.Second
    TCPDialTimeout       = 10 * time.Second  
    TCPLivenessTimeout   = 5 * time.Second
    PreConnectRetryDelay = 5 * time.Second
)
// Backoff Configuration  
const (
    InitialBackoff    = 1 * time.Second
    MaxBackoffAttempts = 9  // Max backoff: 512 seconds
    BackoffResetTime   = 60 // Reset after 60s of stable connection
)
// Keepalive
const DefaultKeepAlive = 10 * time.Second

Retry Logic

Exponential Backoff (backoff.Exponential2x)

The agent uses exponential backoff with 2x multiplier:

Attempt 1: 1s
Attempt 2: 2s  
Attempt 3: 4s
Attempt 4: 8s
Attempt 5: 16s
Attempt 6: 32s
Attempt 7: 64s
Attempt 8: 128s
Attempt 9: 256s
Attempt 10+: 512s (capped)

Backoff Reset Conditions

  1. Time-based: Connection stable for >60 seconds
  2. Context canceled: Always resets backoff
  3. Nil error: Successful operation resets backoff

Error Handling

Different gRPC status codes trigger different behaviors:

  • codes.Canceled: Reset backoff, clean shutdown
  • codes.Unauthenticated: Continue backoff, log error
  • Other errors: Continue backoff with current delay

Keepalive Mechanism

Once connected, the agent sends keepalive packets every 10 seconds:

func (c *mutexClient) StartKeepAlive() {
    go func() {
        for {
            proto := &pb.Packet{Type: pbgateway.KeepAlive}
            if err := c.Send(proto); err != nil {
                break  // Connection lost
            }
            time.Sleep(pb.DefaultKeepAlive)
        }
    }()
}

If keepalive fails, the connection is considered dead and reconnection begins.

Session Management

Session Lifecycle

  1. SessionOpen: Gateway requests new session
  • Validates connection parameters
  • Checks TCP liveness for database connections
  • Sends SessionOpenOK or SessionClose
  1. Active Session:
  • Processes protocol-specific packets
  • Maintains connection state in memory store
  • Handles concurrent connections (TCP mode)
  1. SessionClose:
  • Cleans up all associated connections
  • Removes from memory store
  • Sends exit code to gateway

TCP Liveness Check

For database connections (Postgres, MySQL, MSSQL, MongoDB):

  1. Attempts TCP dial with 5-second timeout
  2. Validates host:port connectivity
  3. Fails session if unreachable

Connection Storage

The agent maintains connections in a thread-safe memory store:

  • Key format: {sessionID} or {sessionID}:{connectionID}
  • Supports filtering by prefix for cleanup
  • Handles graceful shutdown of all connections

Error Recovery

Network Failures

  • Exponential backoff retry
  • Keepalive detection of dead connections
  • Automatic reconnection

Authentication Failures

  • Logged but continues retry attempts
  • No special handling vs other errors

Protocol Errors

  • Session closed with error message
  • Exit code sent to gateway
  • Connection cleaned up

Environment Variables

Runtime environment variables are merged in this order:

  1. Agent runtime envs (passed to RunV2)
  2. Connection-specific envs from gateway
  3. Client-provided envs (lowest priority)

Special handling for system.agent.envs marker - pulls values from agent's environment.

TLS Configuration

  • Default: TLS enabled with system CA pool
  • Custom CA: Via TLSCA config field
  • Insecure mode: Only for testing/secure networks
  • Server name override: TLSServerName field

Debugging

Enable gRPC debug logging:

export LOG_GRPC=1  # Basic logging
export LOG_GRPC=2  # Verbose logging

Common Issues

  1. "context canceled": Normal shutdown, not an error
  2. "unauthenticated": Check token validity
  3. "failed connecting to remote host": Database unreachable
  4. Backoff stuck at max: Connection failing for >9 attempts

Remember: If the connection fails, check the logs. If logging fails, check the network. If the network fails, verify DNS resolution.