Interruption anormale des instances Simplicité déployées sur un serveur AWS

Tags: #<Tag:0x00007f9e4da8bf18>

Bonjour,

Nous remarquons une interruption anormale de nos instances qui sont déployées sur serveur AWS (CentOS).
Cette interruption se manifeste tous les jours entre 15h et 16h.
Pour résoudre, nous redémarrons l’instance EC2 à chaque fois. Ce n’est pas une solution optimale !

Ce que nous interceptons comme logs du journal AWS :

[e[32m  OK  e[0m] Started Monitoring of LVM2 mirrors,...ng dmeventd or progress polling.
[e[32m  OK  e[0m] Reached target Local File Systems (Pre).
[e[32m  OK  e[0m] Reached target Local File Systems.
         Starting Import network configuration from initramfs...
         Starting Preprocess NFS configuration...
[e[32m  OK  e[0m] Started Preprocess NFS configuration.
[e[32m  OK  e[0m] Started Import network configuration from initramfs.
         Starting Create Volatile Files and Directories...
[e[32m  OK  e[0m] Started Create Volatile Files and Directories.
         Starting Security Auditing Service...
[    7.035311] type=1305 audit(1593784721.845:3): audit_pid=549 old=0 auid=4294967295 ses=4294967295 res=1
[e[32m  OK  e[0m] Started Security Auditing Service.
         Starting Update UTMP about System Boot/Shutdown...
[e[32m  OK  e[0m] Started Update UTMP about System Boot/Shutdown.
[e[32m  OK  e[0m] Reached target System Initialization.
[e[32m  OK  e[0m] Listening on D-Bus System Message Bus Socket.
         Starting Initial cloud-init job (pre-networking)...
[e[32m  OK  e[0m] Started Run docker-cleanup every hour.
[e[32m  OK  e[0m] Listening on RPCbind Server Activation Socket.
[e[32m  OK  e[0m] Reached target Sockets.
[e[32m  OK  e[0m] Reached target Basic System.
[e[32m  OK  e[0m] Started D-Bus System Message Bus.
[e[32m  OK  e[0m] Started irqbalance daemon.
         Starting Login Service...
         Starting NTP client/server...
         Starting GSSAPI Proxy Daemon...
         Starting Dump dmesg to /var/log/dmesg...
         Starting Authorization Manager...
         Starting RPC bind service...
[e[32m  OK  e[0m] Started Daily Cleanup of Temporary Directories.
[e[32m  OK  e[0m] Reached target Timers.
[e[32m  OK  e[0m] Started Dump dmesg to /var/log/dmesg.
[e[32m  OK  e[0m] Started RPC bind service.
[e[32m  OK  e[0m] Started Login Service.
[e[32m  OK  e[0m] Started NTP client/server.
[e[32m  OK  e[0m] Started GSSAPI Proxy Daemon.
[e[32m  OK  e[0m] Reached target NFS client services.
[e[32m  OK  e[0m] Reached target Remote File Systems (Pre).
[e[32m  OK  e[0m] Reached target Remote File Systems.
[e[32m  OK  e[0m] Started Authorization Manager.
[    7.900017] cloud-init[606]: Cloud-init v. 18.5 running 'init-local' at Fri, 03 Jul 2020 13:58:42 +0000. Up 7.88 seconds.
[e[32m  OK  e[0m] Started Initial cloud-init job (pre-networking).
[e[32m  OK  e[0m] Reached target Network (Pre).
         Starting LSB: Bring up/down networking...
[e[32m  OK  e[0m] Started LSB: Bring up/down networking.
[e[32m  OK  e[0m] Reached target Network.
         Starting Enable periodic update of entitlement certificates....
[e[32m  OK  e[0m] Started Apache Tomcat Web Application Container.
         Starting Dynamic System Tuning Daemon...
         Starting Postfix Mail Transport Agent...
         Starting The nginx HTTP and reverse proxy server...
         Starting PostgreSQL database server...
         Starting Initial cloud-init job (metadata service crawler)...
[e[32m  OK  e[0m] Started Enable periodic update of entitlement certificates..
[e[32m  OK  e[0m] Started The nginx HTTP and reverse proxy server.
[   11.123846] cloud-init[899]: Cloud-init v. 18.5 running 'init' at Fri, 03 Jul 2020 13:58:45 +0000. Up 11.10 seconds.
[e[32m  OK  e[0m] Started Dynamic System Tuning Daemon.
[   11.165378] cloud-init[899]: ci-info: ++++++++++++++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++++++++++++
[   11.173281] cloud-init[899]: ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
[   11.184579] cloud-init[899]: ci-info: | Device |  Up  |          Address           |      Mask     | Scope  |     Hw-Address    |
[   11.192039] cloud-init[899]: ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
[   11.198727] cloud-init[899]: ci-info: |  eth0  | True |       172.31.26.159        | 255.255.240.0 | global | 02:40:12:c6:83:e6 |
[   11.206173] cloud-init[899]: ci-info: |  eth0  | True | fe80::40:12ff:fec6:83e6/64 |       .       |  link  | 02:40:12:c6:83:e6 |
[   11.213120] cloud-init[899]: ci-info: |   lo   | True |         127.0.0.1          |   255.0.0.0   |  host  |         .         |
[   11.223556] cloud-init[899]: ci-info: |   lo   | True |          ::1/128           |       .       |  host  |         .         |
[   11.233575] cloud-init[899]: ci-info: +--------+------+----------------------------+---------------+--------+-------------------+
[   11.244642] cloud-init[899]: ci-info: +++++++++++++++++++++++++++++Route IPv4 info+++++++++++++++++++++++++++++
[   11.261700] cloud-init[899]: ci-info: +-------+-------------+-------------+---------------+-----------+-------+
[   11.273933] cloud-init[899]: ci-info: | Route | Destination |   Gateway   |    Genmask    | Interface | Flags |
[   11.284555] cloud-init[899]: ci-info: +-------+-------------+-------------+---------------+-----------+-------+
[   11.290358] cloud-init[899]: ci-info: |   0   |   0.0.0.0   | 172.31.16.1 |    0.0.0.0    |    eth0   |   UG  |
[   11.300627] cloud-init[899]: ci-info: |   1   | 169.254.0.0 |   0.0.0.0   |  255.255.0.0  |    eth0   |   U   |
[   11.311956] cloud-init[899]: ci-info: |   2   | 172.31.16.0 |   0.0.0.0   | 255.255.240.0 |    eth0   |   U   |
[   11.317694] cloud-init[899]: ci-info: +-------+-------------+-------------+---------------+-----------+-------+
[   11.323447] cloud-init[899]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
[   11.328432] cloud-init[899]: ci-info: +-------+-------------+---------+-----------+-------+
[   11.334595] cloud-init[899]: ci-info: | Route | Destination | Gateway | Interface | Flags |
[   11.341261] cloud-init[899]: ci-info: +-------+-------------+---------+-----------+-------+
[   11.346236] cloud-init[899]: ci-info: |   9   |  fe80::/64  |    ::   |    eth0   |   U   |
[   11.352933] cloud-init[899]: ci-info: |   13  |   ff00::/8  |    ::   |    eth0   |   U   |
[   11.358110] cloud-init[899]: ci-info: +-------+-------------+---------+-----------+-------+
[e[32m  OK  e[0m] Started Initial cloud-init job (metadata service crawler).
[e[32m  OK  e[0m] Reached target Network is Online.
         Starting Notify NFS peers of a restart...
         Starting Crash recovery kernel arming...
         Starting System Logging Service...
         Starting Docker Storage Setup...
[e[32m  OK  e[0m] Reached target Cloud-config availability.
         Starting Apply the settings specified in cloud-config...
         Starting OpenSSH server daemon...
         Starting Permit User Sessions...
[e[32m  OK  e[0m] Started Notify NFS peers of a restart.
[e[32m  OK  e[0m] Started Permit User Sessions.
[e[32m  OK  e[0m] Started Getty on tty1.
[e[32m  OK  e[0m] Started Serial Getty on ttyS0.
[e[32m  OK  e[0m] Reached target Login Prompts.
[e[32m  OK  e[0m] Started Command Scheduler.
[e[32m  OK  e[0m] Started System Logging Service.
[e[32m  OK  e[0m] Started OpenSSH server daemon.
[e[32m  OK  e[0m] Started Postfix Mail Transport Agent.
[   11.799092] TECH PREVIEW: Overlay filesystem may not be fully supported.
[   11.799092] Please review provided documentation for limitations.
[e[32m  OK  e[0m] Started Docker Storage Setup.
         Starting Docker Application Container Engine...
[   11.847353] cloud-init[1075]: Cloud-init v. 18.5 running 'modules:config' at Fri, 03 Jul 2020 13:58:46 +0000. Up 11.79 seconds.
[e[32m  OK  e[0m] Started Apply the settings specified in cloud-config.
         Starting Execute cloud user/final scripts...
[e[32m  OK  e[0m] Started PostgreSQL database server.
[   12.157639] cloud-init[1215]: Cloud-init v. 18.5 running 'modules:final' at Fri, 03 Jul 2020 13:58:46 +0000. Up 12.11 seconds.
[   12.186111] cloud-init[1215]: Cloud-init v. 18.5 finished at Fri, 03 Jul 2020 13:58:46 +0000. Datasource DataSourceEc2Local.  Up 12.18 seconds
[e[32m  OK  e[0m] Started Execute cloud user/final scripts.
[e[32m  OK  e[0m] Started Crash recovery kernel arming.
[e[32m  OK  e[0m] Created slice User Slice of centos.
[e[32m  OK  e[0m] Started Session 1 of user centos.
[   15.261254] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   15.271488] Bridge firewalling registered
[   15.286850] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[   15.376477] IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
[   15.408110] IPv6: ADDRCONF(NETDEV_UP): br-8b52be162713: link is not ready
[   15.915410] br-8b52be162713: port 1(veth5ae32ca) entered blocking state
[   15.924179] br-8b52be162713: port 1(veth5ae32ca) entered disabled state
[   15.954190] device veth5ae32ca entered promiscuous mode
[   15.960380] IPv6: ADDRCONF(NETDEV_UP): veth5ae32ca: link is not ready
[   15.966599] br-8b52be162713: port 1(veth5ae32ca) entered blocking state
[   15.972533] br-8b52be162713: port 1(veth5ae32ca) entered forwarding state
[   15.978896] IPv6: ADDRCONF(NETDEV_CHANGE): br-8b52be162713: link becomes ready
[   15.985607] br-8b52be162713: port 1(veth5ae32ca) entered disabled state
[   15.992246] br-8b52be162713: port 2(veth8ba7827) entered blocking state
[   15.998190] br-8b52be162713: port 2(veth8ba7827) entered disabled state
[   16.004832] device veth8ba7827 entered promiscuous mode
[   16.011150] IPv6: ADDRCONF(NETDEV_UP): veth8ba7827: link is not ready
[   16.016660] br-8b52be162713: port 2(veth8ba7827) entered blocking state
[   16.020849] br-8b52be162713: port 2(veth8ba7827) entered forwarding state
[   16.041026] br-8b52be162713: port 3(veth24c71bd) entered blocking state
[   16.045345] br-8b52be162713: port 3(veth24c71bd) entered disabled state
[   16.049696] device veth24c71bd entered promiscuous mode
[   16.066195] IPv6: ADDRCONF(NETDEV_UP): veth24c71bd: link is not ready
[   16.071847] br-8b52be162713: port 3(veth24c71bd) entered blocking state
[   16.076920] br-8b52be162713: port 3(veth24c71bd) entered forwarding state
[   16.110787] br-8b52be162713: port 4(vethfbd47cf) entered blocking state
[   16.116660] br-8b52be162713: port 4(vethfbd47cf) entered disabled state
[   16.122849] device vethfbd47cf entered promiscuous mode
[   16.127199] IPv6: ADDRCONF(NETDEV_UP): vethfbd47cf: link is not ready
[   16.132130] br-8b52be162713: port 4(vethfbd47cf) entered blocking state
[   16.137391] br-8b52be162713: port 4(vethfbd47cf) entered forwarding state
[   16.143662] br-8b52be162713: port 5(vethe59793d) entered blocking state
[   16.148809] br-8b52be162713: port 5(vethe59793d) entered disabled state
[   16.154072] device vethe59793d entered promiscuous mode
[   16.158397] IPv6: ADDRCONF(NETDEV_UP): vethe59793d: link is not ready
[   16.163452] br-8b52be162713: port 5(vethe59793d) entered blocking state
[   16.168911] br-8b52be162713: port 5(vethe59793d) entered forwarding state
[e[32m  OK  e[0m] Started libcontainer container 74f3...72a6cb35225f87cd08e6da19bdb3273.
[   16.350083] br-8b52be162713: port 2(veth8ba7827) entered disabled state
[   16.355671] br-8b52be162713: port 4(vethfbd47cf) entered disabled state
[   16.361671] br-8b52be162713: port 5(vethe59793d) entered disabled state
[   16.369116] br-8b52be162713: port 3(veth24c71bd) entered disabled state
[   16.537251] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   16.542577] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   16.546825] IPv6: ADDRCONF(NETDEV_CHANGE): veth5ae32ca: link becomes ready
[   16.551026] br-8b52be162713: port 1(veth5ae32ca) entered blocking state
[   16.555906] br-8b52be162713: port 1(veth5ae32ca) entered forwarding state
[   16.869269] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   16.881309] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   16.886538] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   16.890663] IPv6: ADDRCONF(NETDEV_CHANGE): vethe59793d: link becomes ready
[   16.895115] br-8b52be162713: port 5(vethe59793d) entered blocking state
[   16.899581] br-8b52be162713: port 5(vethe59793d) entered forwarding state
[   16.903933] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   16.907716] IPv6: ADDRCONF(NETDEV_CHANGE): vethfbd47cf: link becomes ready
[   16.912056] br-8b52be162713: port 4(vethfbd47cf) entered blocking state
[   16.916215] br-8b52be162713: port 4(vethfbd47cf) entered forwarding state
[   16.926285] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   16.931088] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   16.935976] IPv6: ADDRCONF(NETDEV_CHANGE): veth24c71bd: link becomes ready
[   16.941292] br-8b52be162713: port 3(veth24c71bd) entered blocking state
[   16.946133] br-8b52be162713: port 3(veth24c71bd) entered forwarding state
[   16.973797] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   16.978253] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   16.982258] IPv6: ADDRCONF(NETDEV_CHANGE): veth8ba7827: link becomes ready
[   16.986345] br-8b52be162713: port 2(veth8ba7827) entered blocking state
[   16.990052] br-8b52be162713: port 2(veth8ba7827) entered forwarding state
CentOS Linux 7 (Core)
Kernel 3.10.0-1127.el7.x86_64 on an x86_64
ip-172-31-26-159 login:

Screenshot de l’instance au moment de l’occurance :

Auriez-vous une idée ?

Merci d’avance.

Healthcheck simplicité :

[Platform]
Status=OK
Version=4.0.P24
BuiltOn=2020-07-02 15:57 (revision 0403c3304089466badd16eb604bbab80f1c5ba49)
Encoding=UTF-8
EndpointIP=172.18.0.3
EndpointURL=http://878ff8084a4e:8080
TimeZone=UTC
SystemDate=2020-07-06 15:26:09

[Application]
ApplicationVersion=4.0
ContextPath=
ContextURL=https://devsimplicite.mineralconseil.fr
ActiveSessions=2
TotalUsers=19
EnabledUsers=9
LastLoginDate=

[Server]
ServerInfo=Apache Tomcat/9.0.36
ServerType=WEB
User=root

[OS]
Name=Linux
Architecture=amd64
Version=3.10.0-1127.el7.x86_64
SystemEncoding=UTF-8

[Disk]
DiskFree=67490
DiskUsable=67490
DiskTotal=81908

[JavaVM]
Version=14.0.1
Vendor=Red Hat, Inc.
VMName=OpenJDK 64-Bit Server VM
VMVersion=14.0.1+7
ScriptEngine=rhino
ScriptEngineVersion=Rhino 1.7.11 2019 05 30
HeapFree=61224
HeapSize=237568
HeapMaxSize=970752
TotalFreeSize=794408

[Cache]
GrantCache=3
GrantCacheMax=0
GrantCacheRatio=0
ObjectCache=201
ObjectCacheMax=10000
ObjectCacheRatio=2
ProcessCache=1
ProcessCacheMax=10000
ProcessCacheRatio=0

[Database]
Vendor=3
ProductName=PostgreSQL
ProductVersion=10.5 (Debian 10.5-2.pgdg90+1)
DriverName=PostgreSQL JDBC Driver
DriverVersion=42.2.14
DBDate=2020-07-06 15:26:09
DBDateOffset=0
DBPatchLevel=P24;eea8d51834dcd04b456d8cf624b95f57
UsingBLOBs=true

[Healthcheck]
Date=2020-07-06 15:26:09
ElapsedTime=9

Je ne vois rien au niveau Simplicité qui pourrait induire une “interruption” de service à heure fixe.

A mon avis c’est :

  • soit un pb technique purement au niveau de l’infrastructure (genre une sauvegarde programmée qui coupe des connexions, et/ou démonte un filesystem externe, et/ou suspend des processus, et/ou “fige” la VM, etc.)
  • soit vous lancez un traitement applicatif spécifique à cette heure là via la cron et celui-ci part en boucle (ou dans le genre)
  • soit un “agent” externe effectue une opération à cette heure là (appel de service, import/export, …)
  • etc.

Commencez déjà par regardez les logs Tomcat (logs serveur et access logs) et Simplicité à cette heure là pour voir ce qui s’y trouve.

Mais bon, si vous êtes vraiment obligé de rebooter votre VM, je pense franchement que c’est un pb d’infrastructure, je n’ai pas le souvenir d’avoir vu une instance Simplicité “planter” un serveur Linux.

Bonjour,

A priori nous voyons pas d’explications côté infrastructure.
Nous avons eu le problème encore une fois. Nous avons extrait les logs (Tomcat serveur+access) et logs simplicité.

catalina.out


access-logs

simplicite.log

Merci

Comme vous le voyez il n’y a strictement rien dans ces copies d’écran (i.e. pas de stacktrace avec des exceptions système graves, ni trace de plantage de la JVM) qui pourrait donner la moindre information sur un “plantage” système

En l’absence de ce genre de signes visibles je continue à penser que votre pb est soit lié à l’infrastructure soit peut être à un pb grave dans votre code (genre une boucle infinie, ce qu’une supervision du CPU peut révéler)

Par contre je vois des erreurs applicatives très bizarres (“unknown field .”) que vous devriez investiguer, mais rien à voir à priori avec un éventuel plantage sauf si c’est la cause de choses qui partent en boucle dans votre code.

De manière générale:

  1. quand vous nous transmettez des logs, merci de nous fournir des fichier logs complets, pas des copies d’écran partielles comme ici
  2. quand il s’agit d’un pb système, comme ici, merci de nous transmettre toutes les informations de contexte qui nous permettront de savoir comment vous exécutez la plateforme (ex: s’il s’agit de containers Docker merci de nous fournir la commande exacte de lancement du container ou les fichiers Docker-compose ou Kubernetes) que vous utilisez.

Sans ces éléments de base il est vraiment impossible de vous apporter une aide pertinente.

PS: Ce n’est pas specifique à Simplicité mais pour investiguer des pbs de “plantage” de JVM il est possible de plugger une supervision JMX externe sur la JVM (cf. https://docs.simplicite.io/documentation/90-operation/docker.md où vous cherchez “JMX”)