Imperva Cyber Community

Expand all | Collapse all

When a gateway ec2 instance is created it takes too long to come into service and is terminated.

  • 1.  When a gateway ec2 instance is created it takes too long to come into service and is terminated.

    Posted 01-16-2020 07:13

    We run SecureSphere Gateways in AWS using the marketplace ami version 12.5.

    We are having problems when they auto scale and also during a CloudFormation rolling update.

    Below is a (long winded) summary of the issue if anyone can provide any help.

    This has been an issue for 2 years now and I've raised 2 cases over that time but without any resolution or conclusion.

    -------------------------

    ISSUE
    When a gateway ec2 instance is created it takes too long to come into service and is terminated.
    This creates a cycle of creation and termination until one instance comes into service quick enough.
    To 'come into service' means to pass the AWS loadbalancer health checks.
    To pass the health checks means a gateway must download its configuration and then reply on a specific port.


    MONITORING
    Running in the background is a Lambda function which checks the health status of all gateway instances in all the loadbalancers attached to the scaling group.
    If it finds an instance that is unhealthy in all loadbalancers then an assumption is made that this instance is not functioning, as it is not replying to any loadbalancer.
    If an instance is unhealthy in a single (or multiple but not all) loadbalancer then an operator is informed.
    This prevents the issue where an incorrectly configured loadbalancer healthcheck could instruct the scaling group to terminate all gateways, causing outages for all services.
    An exception to this is made if the instance startup time is within the scaling group healthcheck grace period, for us this is 20 minutes.


    GATEWAY BUILD FLOW
    The AWS scaling group creates a new instance.
    We inject a 10 minutes delay into the scaling lifecycle.
    That delay gives the new instance 10 minutes to configure itself before the scaling group then adds the instance to the loadbalancers.
    The load balances then starts to healthcheck the instance and will send traffic to it once it passes.
    After the healthcheck grace period has ended, the instance is still not passing healthchecks and the monitor terminates it.
    So taking into account the delays and grace periods, a new instance has a total of 20 minutes to build and pass healthchecks before it is terminated.


    WHAT COULD WE DO
    We could increase the grace period, but this means that we are accepting the fact that a new gateway instance takes longer than 20 minutes to build.
    This seems like a long time to wait, especially if you are resonding to an increase in traffic that is overwelming your existing instances.


    LOGS
    Below is an extract from a new gateway instance that managed to come into service within the 20 minutes.
    Taken from log file: opt/SecureSphere/etc/logs/ConfigurationLogs/ConfigurationProviderLog/ConfigurationProviderLog.html
    Looking at this extract, can we tell how long the gateway took to finish building and downloading its config.


    LOG EXTRACT
    16/01/2020 02:50:33.430989 [debug] ConfigurationProvider.cpp:666 startProviderLogging - started configuration provider log
    16/01/2020 02:50:46.825895 [NOTIFICATION] Gateway.cpp:4542 Working in large scale configuration mode
    16/01/2020 02:50:46.825955 [debug] ConfigurationDispatcher.cpp:506 !!! updated controller IP to ***
    16/01/2020 02:50:46.826162 [NOTIFICATION] ConfigManager.cpp:6537 Started loading local configuration
    16/01/2020 02:50:46.833368 [debug] ConfigManager.cpp:6608 loadLocalConfiguration - Local configuration - compacting all files.
    16/01/2020 02:50:46.833400 [debug] ConfigManager.cpp:2511 compactFiles - Starting compacting config files
    16/01/2020 02:50:46.833405 [debug] ConfigManager.cpp:2524 compactFiles - Finished compacting config files
    16/01/2020 02:50:46.833409 [NOTIFICATION] ConfigManager.cpp:6619 Local configuration has been loaded successfully in 0 seconds (revision 0)
    16/01/2020 02:50:47.638069 [debug] ConfigurationDispatcher.cpp:496 Gateway configuration mode changed to 'READY_FOR_UPDATES'.
    16/01/2020 02:51:22.281782 [debug] ConfigurationDispatcher.cpp:32 Handling new configuration notification
    16/01/2020 02:51:22.281987 [NOTIFICATION] ConfigurationDispatcher.cpp:591 GW was instructed to refresh all of its configuration - cleaning local configuration and configuration repository
    16/01/2020 02:51:22.374849 [NOTIFICATION] ConfigurationDispatcher.cpp:73 Received configuration notification from server, fetching configuration...
    16/01/2020 02:51:22.374878 [NOTIFICATION] ConfigurationDispatcher.cpp:831 requesting a baseline revision id of at least 1
    16/01/2020 02:51:22.374935 [debug] ConfigurationDownloader.cpp:46 Referring to provider: https://***:443/configuration/get-spec
    16/01/2020 02:51:22.375035 [debug] ConfigurationDownloader.cpp:53 Requesting configuration spec from the provider, try #1
    16/01/2020 02:51:22.495829 [debug] ConfigurationDownloader.cpp:78 getSpecFromProvider - result code: 'SUCCESS', provider: https://***:443/configuration/get-spec
    16/01/2020 02:51:22.495866 [debug] ConfigurationDownloader.cpp:151 Parsing configuration spec
    16/01/2020 02:51:22.504824 [debug] ConfigurationDownloader.cpp:253 Downloading revision 155044...
    ... too many to list here.
    16/01/2020 02:51:30.966165 [debug] ConfigurationDownloader.cpp:253 Downloading revision 156737...
    16/01/2020 02:51:30.974142 [debug] ConfigurationDownloader.cpp:253 Downloading revision 156738...
    16/01/2020 02:51:30.982111 [debug] ConfigurationDispatcher.cpp:182 Configuration was downloaded successfully (took 8 sec)
    16/01/2020 02:51:30.982334 [NOTIFICATION] ConfigManager.cpp:6647 Started applying configuration revisions 155044 (baseline update) to 156738 (incremental update)
    16/01/2020 02:59:06.902846 [debug] ConfigManager.cpp:2511 compactFiles - Starting compacting config files
    16/01/2020 02:59:06.903456 [debug] ConfigManager.cpp:2524 compactFiles - Finished compacting config files
    16/01/2020 02:59:06.930127 [NOTIFICATION] ConfigManager.cpp:6692 Configuration revisions 155044 (baseline update) to 156738 (incremental update) were applied successfully in 455 sec
    16/01/2020 02:59:07.077753 [debug] ConfigurationDispatcher.cpp:620 Saving configuration files to repository
    16/01/2020 03:00:09.167358 [debug] ConfigurationDispatcher.cpp:217 Configuration apply finished successfully (took 518 sec)
    16/01/2020 03:00:09.246838 [debug] ConfigurationDispatcher.cpp:32 Handling new configuration notification
    16/01/2020 03:00:09.247068 [NOTIFICATION] ConfigurationDispatcher.cpp:73 Received configuration notification from server, fetching configuration...
    16/01/2020 03:00:09.247139 [debug] ConfigurationDownloader.cpp:46 Referring to provider: https://***:443/configuration/get-spec



    #On-PremisesWAF(formerlySecuresphere)

    ------------------------------
    Richard Bowden
    Aviva
    ------------------------------


  • 2.  RE: When a gateway ec2 instance is created it takes too long to come into service and is terminated.

    Imperva Employee
    Posted 01-16-2020 07:31
    can you explain why it was decided to inject a 10 minutes delay into the scaling lifecycle.
    was this done thru the AWS console
    How many devices are you spinning up during the scale-up situation 

    Also can you provide the number of the last case you opened 

    I personally have never encountered this scenario so I dont have any guidance right now 
    hopefully a case review will help 


    ------------------------------
    Phil Klassen
    ------------------------------



  • 3.  RE: When a gateway ec2 instance is created it takes too long to come into service and is terminated.

    Posted 01-16-2020 08:41

    Hi Phil,

    Thanks for the reply.
    Initially there was no delay.
    As soon as the instance was deemed ready by the auto scaling group it was added to the loadbalancers.

    Because the instance was not yet ready, i.e. it still had to download its config, all the loadbalancers would start triggering the UnhealthyHost CloudWatch alarms and we would be flooded with alerts.
    After consulting with AWS we decided to add a 10 minute delay to allow the instance time to configure itself and be able to pass healthchecks as soon as it was added to the loadbalancers.

    Regards
    Rick



    ------------------------------
    Richard Bowden
    Aviva
    ------------------------------



  • 4.  RE: When a gateway ec2 instance is created it takes too long to come into service and is terminated.

    Imperva Employee
    Posted 01-17-2020 10:38
    a few things 

    I wasnt able to find any previous cases for your compnay specifically but I did find some cases similar ot what you have described 
    - we did some tuning on our processes as it was taking too long 
    - but those changes were made prior to V12 so those situations should not occur with your deployment 

    One factoid that I did find - and makes sense - the larger the configuration, the longer it will take to initially load 
    remember once a GW is initiated it MUST load a full config 
    incremental updates only occur after the initial baseline/full config has been loaded

    I went through the log entries you provided and it took right at 10 mins - here is the breakdown

    Local config is simply a config that is always available and simply allow the GW to come up
    16/01/2020 02:50:46.833368 [debug] ConfigManager.cpp:6608 loadLocalConfiguration - Local configuration - compacting all files.

    process to get new/full conifg is starting
    16/01/2020 02:51:22.281782 [debug] ConfigurationDispatcher.cpp:32 Handling new configuration notification
    16/01/2020 02:51:22.281987 [NOTIFICATION] ConfigurationDispatcher.cpp:591 GW was instructed to refresh all of its configuration - cleaning local configuration and configuration repository

    Starting to download full config
    16/01/2020 02:51:22.374849 [NOTIFICATION] ConfigurationDispatcher.cpp:73 Received configuration notification from server, fetching configuration...

    Download completed
    16/01/2020 02:51:30.982111 [debug] ConfigurationDispatcher.cpp:182 Configuration was downloaded successfully (took 8 sec)

    Apply config
    16/01/2020 02:59:06.930127 [NOTIFICATION] ConfigManager.cpp:6692 Configuration revisions 155044 (baseline update) to 156738 (incremental update) were applied successfully in 455 sec
    16/01/2020 02:59:07.077753 [debug] ConfigurationDispatcher.cpp:620 Saving configuration files to repository

    config fully loaded
    16/01/2020 03:00:09.167358 [debug] ConfigurationDispatcher.cpp:217 Configuration apply finished successfully (took 518 sec)

    So you may want to set the delay to 11 mins - as we seem to be right on the limit 



    ------------------------------
    Phil Klassen
    ------------------------------



  • 5.  RE: When a gateway ec2 instance is created it takes too long to come into service and is terminated.

    Posted 01-20-2020 03:24
    Hi Phil,

    Thank you very much for taking the time to look at the log extract and provide some insight into what is happening.
    I'll do some more testing, as the log I provided was from an instance that actually came into service successfully.
    Hopefully I'll be able to get one that fails.
    We have had many instances that have not come into service and been terminated.
    As these have happened out of office hours we couldn't get the logs off the gateways to look at them (the SQS queue config to get messages from terminating instances does not work but that's another story).

    Regards
    Rick

    ------------------------------
    Richard Bowden
    Aviva
    ------------------------------



  • 6.  RE: When a gateway ec2 instance is created it takes too long to come into service and is terminated.

    Posted 01-20-2020 06:34

    Hi Phil,

    I did some more testing this morning and created a new gateway instance and after an hour it was still not in service.
    I repeated this and had the same result.
    I downloaded the logs from the last created instance and it shows that it is failing to download or apply it's configuration.
    It is getting it's configuration not from the MX but from an existing (working) gateway instance.

    I downloaded the log from the existing gateway instance and can see the requests for config coming from the IP address of the new instance, so there is communication between them.
    Below is some log extracts.

    NEW INSTANCE
    20/01/2020 10:01:21.089476 [debug] ConfigurationDispatcher.cpp:506 !!! updated controller IP to *.*.*.59
    20/01/2020 10:01:21.089649 [NOTIFICATION] ConfigManager.cpp:6537 Started loading local configuration
    20/01/2020 10:01:21.094988 [debug] ConfigManager.cpp:6608 loadLocalConfiguration - Local configuration - compacting all files.
    ...
    20/01/2020 10:02:02.398500 [debug] ConfigurationDownloader.cpp:46 Referring to provider: https://*.*.*.154:443/configuration/get-spec
    20/01/2020 10:02:02.645709 [debug] ConfigurationDownloader.cpp:53 Requesting configuration spec from the provider, try #1
    20/01/2020 10:02:02.767304 [debug] ConfigurationDownloader.cpp:78 getSpecFromProvider - result code: 'SUCCESS', provider: https://*.*.*.154:443/configuration/get-spec
    20/01/2020 10:02:02.767663 [debug] ConfigurationDownloader.cpp:151 Parsing configuration spec
    20/01/2020 10:02:02.775065 [debug] ConfigurationDownloader.cpp:253 Downloading revision 157115...
    20/01/2020 10:02:23.811167 [ERROR] ConfigurationDownloader.cpp:288 downloadSingleConfigurationContent - failed to download file 1.local.xml from the download server: https://*.*.*.154:443/configuration/configuration-repository/
    20/01/2020 10:02:23.811176 [ERROR] ConfigurationDownloader.cpp:267 downloadSingleRevision - failed to download revision 157115
    20/01/2020 10:02:23.811313 [ERROR] ConfigurationDownloader.cpp:35 work - failed to download files, provider: *.*.*.154, download server: https://*.*.*.154:443/configuration/configuration-repository/
    20/01/2020 10:02:23.811342 [ERROR] ConfigurationDispatcher.cpp:190 Failed to download configuration, result code: FILES_DOWNLOAD_FAILED (took 21 sec)
    20/01/2020 10:04:28.308184 [debug] ConfigurationDispatcher.cpp:32 Handling new configuration notification
    20/01/2020 10:04:28.308479 [NOTIFICATION] ConfigurationDispatcher.cpp:73 Received configuration notification from server, fetching configuration...
    20/01/2020 10:04:28.308659 [debug] ConfigurationDownloader.cpp:46 Referring to provider: https://*.*.*.154:443/configuration/get-spec
    20/01/2020 10:04:28.308760 [debug] ConfigurationDownloader.cpp:53 Requesting configuration spec from the provider, try #1
    20/01/2020 10:04:29.079721 [debug] ConfigurationDownloader.cpp:78 getSpecFromProvider - result code: 'SUCCESS', provider: https://*.*.*.154:443/configuration/get-spec
    20/01/2020 10:04:29.079736 [debug] ConfigurationDownloader.cpp:151 Parsing configuration spec
    20/01/2020 10:04:29.080476 [debug] ConfigurationDownloader.cpp:253 Downloading revision 157115...
    20/01/2020 10:04:50.105143 [ERROR] ConfigurationDownloader.cpp:288 downloadSingleConfigurationContent - failed to download file 1.local.xml from the download server: https://*.*.*.154:443/configuration/configuration-repository/
    20/01/2020 10:04:50.105151 [ERROR] ConfigurationDownloader.cpp:267 downloadSingleRevision - failed to download revision 157115
    20/01/2020 10:04:50.105326 [ERROR] ConfigurationDownloader.cpp:35 work - failed to download files, provider: *.*.*.154, download server: https://*.*.*.154:443/configuration/configuration-repository/
    20/01/2020 10:04:50.105355 [ERROR] ConfigurationDispatcher.cpp:190 Failed to download configuration, result code: FILES_DOWNLOAD_FAILED (took 21 sec)
    ... 22 more failures
    20/01/2020 10:58:05.743778 [debug] ConfigurationDownloader.cpp:46 Referring to provider: https://*.*.*.154:443/configuration/get-spec
    20/01/2020 10:58:05.743905 [debug] ConfigurationDownloader.cpp:53 Requesting configuration spec from the provider, try #1
    20/01/2020 10:58:05.789806 [warning] ConfigurationDownloader.cpp:69 Attempt 1 out of 3 to get spec failed (provider: *.*.*.154, result code: 'PROVIDER_BUSY'). Trying again in 39 seconds
    20/01/2020 10:58:44.789948 [debug] ConfigurationDownloader.cpp:53 Requesting configuration spec from the provider, try #2
    20/01/2020 10:58:44.836342 [warning] ConfigurationDownloader.cpp:69 Attempt 2 out of 3 to get spec failed (provider: *.*.*.154, result code: 'PROVIDER_BUSY'). Trying again in 94 seconds
    20/01/2020 11:00:18.836512 [debug] ConfigurationDownloader.cpp:53 Requesting configuration spec from the provider, try #3
    20/01/2020 11:00:18.883281 [debug] ConfigurationDownloader.cpp:78 getSpecFromProvider - result code: 'PROVIDER_BUSY', provider: https://*.*.*.154:443/configuration/get-spec
    20/01/2020 11:00:18.883327 [ERROR] ConfigurationDownloader.cpp:21 work - failed to get configuration spec, error: PROVIDER_BUSY, provider: https://*.*.*.154:443/configuration/get-spec
    20/01/2020 11:00:18.883335 [ERROR] ConfigurationDispatcher.cpp:190 Failed to download configuration, result code: PROVIDER_BUSY (took 133 sec)


    EXISTING INSTANCE
    20/01/2020 10:02:10.605078 [debug] ConfigurationProvider.cpp:112 handleConfigurationSpecRequestMessage - Provider- Spec request arrived
    20/01/2020 10:02:10.605273 [debug] ConfigurationProvider.cpp:186 handleConfigurationSpecRequestMessage - Providing Spec to requestor's IP *.*.*.94 - Provider cache miss
    20/01/2020 10:02:10.607056 [debug] ConfigurationProvider.cpp:196 handleConfigurationSpecRequestMessage - Providing Spec to requestor's IP *.*.*.94 - Provider- New revisions list was generated
    20/01/2020 10:02:10.610642 [info] ConfigurationProvider.cpp:256 Spec for requestor IP *.*.*.94; Response code- SUCCESS; Unique id- e4637e1e-b71e-4e37-a918-96bbfd850dc1; Baseline- true; Source revision id- 0; Target revision id- 157221
    20/01/2020 10:02:10.610675 [debug] ConfigurationProvider.cpp:224 handleConfigurationSpecRequestMessage - Provider- Spec respond delivered
    20/01/2020 10:03:14.127503 [info] ConfigurationProvider.cpp:573 Content download for requestor IP *.*.*.94 - Delivering content file /opt/SecureSphere/etc/configuration/configuration-repository/157115_baseline_not-applied/1.local.xml
    20/01/2020 10:04:36.260049 [debug] ConfigurationProvider.cpp:112 handleConfigurationSpecRequestMessage - Provider- Spec request arrived
    20/01/2020 10:04:36.260281 [debug] ConfigurationProvider.cpp:186 handleConfigurationSpecRequestMessage - Providing Spec to requestor's IP *.*.*.94 - Provider cache miss
    20/01/2020 10:04:36.261591 [debug] ConfigurationProvider.cpp:196 handleConfigurationSpecRequestMessage - Providing Spec to requestor's IP *.*.*.94 - Provider- New revisions list was generated
    20/01/2020 10:04:36.265127 [info] ConfigurationProvider.cpp:256 Spec for requestor IP *.*.*.94; Response code- SUCCESS; Unique id- e4637e1e-b71e-4e37-a918-96bbfd850dc1; Baseline- false; Source revision id- 0; Target revision id- 157221
    ... about another 22 of the above entries.
    20/01/2020 10:55:43.541185 [debug] ConfigurationProvider.cpp:224 handleConfigurationSpecRequestMessage - Provider- Spec respond delivered
    20/01/2020 10:57:04.093764 [info] ConfigurationProvider.cpp:573 Content download for requestor IP *.*.*.94 - Delivering content file /opt/SecureSphere/etc/configuration/configuration-repository/157115_baseline_not-applied/1.local.xml
    20/01/2020 10:58:13.737041 [debug] ConfigurationProvider.cpp:112 handleConfigurationSpecRequestMessage - Provider- Spec request arrived
    20/01/2020 10:58:13.737087 [debug] ConfigurationProvider.cpp:123 handleConfigurationSpecRequestMessage - Providing Spec to requestor's IP *.*.*.94 - Provider dismissed spec request- Provider busy. Too many concurrent contants downloads
    20/01/2020 10:58:13.737177 [info] ConfigurationProvider.cpp:239 Spec for requestor IP *.*.*.94; response code- PROVIDER_BUSY
    20/01/2020 10:58:13.737216 [debug] ConfigurationProvider.cpp:224 handleConfigurationSpecRequestMessage - Provider- Spec respond delivered
    20/01/2020 10:58:52.784161 [debug] ConfigurationProvider.cpp:112 handleConfigurationSpecRequestMessage - Provider- Spec request arrived
    20/01/2020 10:58:52.784186 [debug] ConfigurationProvider.cpp:123 handleConfigurationSpecRequestMessage - Providing Spec to requestor's IP *.*.*.94 - Provider dismissed spec request- Provider busy. Too many concurrent contants downloads





    ------------------------------
    Richard Bowden
    Aviva
    ------------------------------



  • 7.  RE: When a gateway ec2 instance is created it takes too long to come into service and is terminated.

    Posted 01-21-2020 04:36

    Something odd is happening here.
    On our NonProd MX we have two gateway groups, one for NonProd gateways and one for PreProd gateways.

    The NonPorod and PreProd gateways are created using two separate CloudFormation stacks and so have separate autoscaling groups.
    When launching a new PreProd instance it always goes to the MX and downloads its configuration from there on port 8083 without any issues.
    If I launch a new NonProd instance it initially goes to the MX but then almost always attempts to download its configuration from another gateway on port 443 and then fails to download it.



    ------------------------------
    Richard Bowden
    Aviva
    ------------------------------



  • 8.  RE: When a gateway ec2 instance is created it takes too long to come into service and is terminated.

    Posted 30 days ago
    I'll raise a case regarding this and see how I get on.

    ------------------------------
    Richard Bowden
    Aviva
    ------------------------------



  • 9.  RE: When a gateway ec2 instance is created it takes too long to come into service and is terminated.

    Imperva Employee
    Posted 30 days ago
    And this is a WAF deployment - I do see where we have timeouts trying to download the config files from the provider 
    And in this case the provider - seems to be another GW 

    I do know of a DAM configuration where this MAY happen - but not for WAF
    please confirm this is WAF only

    ------------------------------
    Phil Klassen
    ------------------------------