-- AAISP's status page

Recent posts

Timeline view of events on our network and systems

Events from the AAISP network from the last few months on a scrollable timeline. Mouseover for brief details, click incident to view the full post.

Maximise all events

BT planned work on one of our interlinks

MAINTENANCE Planned BT

AFFECTING

STARTING

May 24, 12:01 AM (20¾ days )

DESCRIPTION

BT have planned work on their side of one of our hostlinks, on 24th May between midnight and 6AM. We will move traffic away from this hostlink beforehand so as to minimise the impact on customers. We don't expect this to impact customers.

Last updated: 7¼ days ago
Update due: 22¼ days

View full details

Small number of BT lines dropped and reconnected

MINOR Closed BT

AFFECTED

STARTED

May 02, 01:30 PM (12 hours ago)

CLOSED

May 02, 02:30 PM (11 hours ago)

DESCRIPTION

At around 13:30 we saw a small number of BT lines drop and reconnect. Customers are back online, we're investigating the cause.

Resolution:

Closed: 11 hours ago

View full details

Call issues ( unrelated to https://aastatus.net/42669 )

MINOR Closed VoIP and SIMs

AFFECTED

VoIP and SIMs

STARTED

May 02, 11:57 AM (13¾ hours ago)

CLOSED

May 02, 12:05 PM (13½ hours ago)

DESCRIPTION

One of our upstream carriers have call issues. We've routed calls away from them as much as possible, but this may still affect some inbound calls. The symptom is unexpected call rejections or audio issues. They're aware and are working to fix it as soon as possible.

Resolution: One of our upstream carriers reported: at 11:04 hrs BST the call handling units in one of our nodes started to behave unexpectedly, causing some audio issues and call failures. We are investigating some unusual traffic received at that time. The units were restarted starting at 11:22 hrs BST at which point traffic quickly restored to normal. The remaining units were fully restarted by 11:44 hrs BST when we were back at full redundancy.

Closed: 13½ hours ago

View full details

Call issues

MINOR Open SIP2SIM

AFFECTING

SIP2SIM

STARTED

May 02, 11:32 AM (14 hours ago)

DESCRIPTION

We have had reports of some issues with calls to SIP2SIM mobiles. This is being investigated. Note that SMS is unaffected. Note calls from mobiles are unaffected.

Last updated: 11 hours ago
Expected close: 0 seconds ago

View full details

Data SIMs reconnecting

MINOR Closed Data SIMs

AFFECTED

Data SIMs

STARTED

May 01, 09:07 PM (1 day ago)

CLOSED

May 01, 09:11 PM (1 day ago)

DESCRIPTION

We're investigating our Data SIM service as we have a large number of SIMs reconnecting and dropping their connection to us.

Resolution:

Closed: 1 day ago

View full details

ONSIM SIP2SIM changes - possible loss of calls this evening.

MAINTENANCE Assumed Completed VoIP and SIMs

AFFECTING

VoIP and SIMs

STARTED

May 01, 06:00 PM (1¼ days ago)

DESCRIPTION

We had an issue yesterday with the fact we have more than one call server, and the carrier not coping. They have suggestions for changes that will rectify that. We plan top do changes this evening to address this, but there are known to be something of a lag in them picking up DNS changes, so if things do not work it could result in a period where calls are not working. From what we saw, calls from mobiles should continue reliably, or with a brief issue if this does not work. Calls to mobiles may have an issue for longer. Sorry for the problems. However, we are going to see if we can get they to test on some test SIMs first, which will mean we are able to fix this without disruption.

Last updated: 17¼ hours ago

View full details

ONSIM SIP2SIM calls to phones

MINOR Closed VoIP and SIMs

AFFECTED

VoIP and SIMs

STARTED

Apr 30, 12:00 PM (2½ days ago)

CLOSED

Apr 30, 09:18 PM (2 days ago)

DESCRIPTION

We have had some reports today of some calls not getting to phones, this is being investigated.

Resolution: For now, one server in use, until we sort multiple server handling with carrier.

Closed: 2 days ago

View full details

Some TalkTalk instabilities

MINOR Closed TalkTalk

AFFECTED

TalkTalk

STARTED

Apr 27, 10:10 AM (5½ days ago)

CLOSED

Apr 28, 06:46 PM (4¼ days ago)

DESCRIPTION

We've see a few TalkTalk ADSL/VDSL lines drop and reconnect this morning. This was caused by a power alert on a network card on a device in TalkTalk's network that routes L2TP traffic in Telehouse. The card reset itself and re-routed traffic causing a small number of lines to drop their PPP and reconnect a few moments later.

Resolution: Two separate incidents within TalkTalk's network caused a small number of our customers to drop their connection and reconnect moments later on Saturday mid-morning and Sunday early morning.

Closed: 4¼ days ago

View full details

Routing problem to some of our services

MINOR Closed AA Services

AFFECTED

AA Services

STARTED

Apr 23, 05:30 PM (9¼ days ago)

CLOSED

Apr 23, 08:49 PM (9 days ago)

DESCRIPTION

There is a routing problem affecting access to some of our services, eg our website and L2TP service among others. We're investigating.

Resolution: This was caused by a third party internet provider, with whom we have been in talks with about them providing us some transit and had provisionally configured some of their routers to allow us to announce our IP blocks through them. We had not got to the point of actually setting up the service though. However, one of their routers malfunctioned and got in a state where it was re-announcing our IP blocks to some of the internet which meant some of the internet was sending traffic bound for us to them. We mitigated some of the problems by announcing more specific routes and also got in touch with the provider who promptly fixed their router.

Closed: 9 days ago

View full details

Some VoIP Registration Problems

MINOR Closed VoIP

AFFECTED

VoIP

STARTED

Apr 22, 12:14 PM (10½ days ago)

CLOSED

Apr 22, 12:30 PM (10½ days ago)

DESCRIPTION

Some customers are having problems with registering their VoIP phone. Investigations are underway. This will cause problems for some customers with receiving and making calls.

Resolution: There was a problem with us storing the port for some SIP registrations between 11:14 and 12:30 which was causing some registrations to fail.

Closed: 10½ days ago

View full details

TalkTalk planned work on our interlinks

MAINTENANCE Assumed Completed TalkTalk

AFFECTING

TalkTalk

STARTED

Apr 21, 09:00 PM (11 days ago)

DESCRIPTION

We have multiple interlinks to TalkTalk that carry our broadband traffic. TalkTalk have scheduled planned work on both these links during a four week period from Tuesday 23rd April until 16th May. (Specifically midnight to 6AM on 23rd, 25th, 25th April and 1st, 2nd, 9th, 16th May.

Due to the work being carried out (Software updates of their "LTSs") we are unable to move traffic seamlessly between our interlinks and so TalkTalk customers will see their connections drop and reconnect on these early mornings.

Last updated: 18 hours ago
Update due: 7½ days ago (overdue)

View full details

Overnight router work affecting some CityFibre customers

MAINTENANCE Completed CityFibre

AFFECTING

CityFibre

STARTED

Apr 20, 04:00 AM (12¾ days ago)

CLOSED

Apr 22, 07:25 AM (10¾ days ago)

DESCRIPTION

We'll be performing some work on one of our routers for CityFibre connections. Some CityFibre customers will see their connection drop and reconnect moments after at 4AM on Saturday morning.

Resolution: This work has been completed.

Closed: 10¾ days ago

View full details

Ethernet Services and Customer BGP Router Maintenance

MAINTENANCE Completed Ethernet BGP

AFFECTING

Ethernet BGP

STARTED

Apr 15, 10:30 PM (17 days ago)

CLOSED

Apr 17, 09:30 PM (15 days ago)

DESCRIPTION

We're performing some maintenance on the primary router used for our Ethernet (Etherway/Etherflow) services and our customer BGP sessions (eg customers with their own BGP sessions to us) - "A.Weightless". We have moved traffic on to our secondary router for the next 48 hours. This move is seamless and customer traffic is not affected.

Resolution: This has been completed.

Closed: 15 days ago

View full details

Hypervisor disk problem

MINOR Closed Servers

AFFECTED

Servers

STARTED

Apr 15, 01:36 PM (17½ days ago)

CLOSED

Apr 15, 03:07 PM (17¼ days ago)

DESCRIPTION

At around 13:30 one of our servers had a disk problem, and it needs to be rebooted and fixed. This server is hypervisor that runs some of our core services. As we run many redundant and spare servers which fail over to other servers when a problem occurs the customer impact is minimal.

Resolution:

Closed: 17¼ days ago

View full details

Some Data SIM drops

MINOR Closed DATA SIMs

AFFECTED

DATA SIMs

STARTED

Apr 11, 12:15 PM (21½ days ago)

CLOSED

Apr 11, 01:30 PM (21½ days ago)

DESCRIPTION

We've seen some Data SIMs drop and reconnect from 12:15 today - we suspect caused by something upstream, probably in the mobile network.

Resolution:

Closed: 21½ days ago

View full details

L2TP Router Replacement

MAINTENANCE Completed L2TP

AFFECTING

L2TP

STARTED

Apr 10, 04:00 AM (22¾ days ago)

CLOSED

Apr 11, 04:10 AM (21¾ days ago)

DESCRIPTION

We will be replacing the hardware of our main L2TP router during the day on Wednesday 10th April. As part of this work we will be moving L2TP customers over to the backup L2TP server shortly after 4AM on 10th April. This was cause customers to drop and reconnect.

Resolution: This has been completed.

Closed: 21¾ days ago

View full details

L2TP Service Drop

MINOR Closed L2TP

AFFECTED

L2TP

STARTED

Apr 09, 02:00 PM (23¼ days ago)

CLOSED

Apr 09, 02:07 PM (23¼ days ago)

DESCRIPTION

At 2pm L2TP customers experienced a drop and reconnect of their service.

Resolution: Hardware replacement underway: https://aastatus.net/42656

Closed: 23¼ days ago

View full details

S.Gormless LNS Reboot

MAINTENANCE Completed LNS

AFFECTING

LNS

STARTED

Apr 09, 04:00 AM (23¾ days ago)

CLOSED

Apr 09, 10:00 AM (23½ days ago)

DESCRIPTION

We've had a few customers on the S.Gormless LNS report slow speeds and moving them on to different LNSs has helped. In light o no obvious reason for this, we will reboot the LNS at 4AM on Tuesday 9th April. The small number of customers on this LNS will experience a drop and reconnect of their service.

Resolution: This has been completed.

Closed: 23½ days ago

View full details

TalkTalk planned work on one of our interlinks

MAINTENANCE Completed TalkTalk

AFFECTING

TalkTalk

STARTED

Apr 09, 03:00 AM (23¾ days ago)

CLOSED

Apr 11, 07:01 PM (21¼ days ago)

DESCRIPTION

We have multiple interlinks to TalkTalk that carry our broadband traffic. TalkTalk have scheduled planned work on our links in our Equinix LD8 datacentre for 11th April between 1AM and 6AM.

So as to minimise the impact on our customers, we will move traffic off these links on 9th April at 3AM. This should be seamless, but there is a risk of some customers having a brief interruption to their service.

Resolution: This work has been completed with no customer impact.

Closed: 21¼ days ago

View full details

General SMS improvements

MAINTENANCE Assumed Completed SMS

AFFECTING

SMS

STARTED

Apr 08, 02:13 PM (24¼ days ago)

DESCRIPTION

This work has started, but we did not do a planned work as expected it to be seamless. Sadly that was not quite the case today, so this is more detail on what we are planning over the next few weeks. The main thing is, any problems, please tell us right away.

Some cosmetic improvements (nicer format phone numbers) in emailed or tooted SMS (done)
Additional options (such as forcing the email/toots to be E.123 + format numbers) (done)
Additional options for posting JSON to http/https (TODO)
Allowing SMS to be relayed (chargeable) to other numbers (done)
We already allow multiple targets for a number for SMS (done)
Some improvements for 8 bit SMS, which are rare, as we previously treated as latin1, which is not correct (TODO)
Some new features for trialling a new SIP2SIM platform (TODO)
Improve "visible" format for content in email/toot when special characters are used (e.g. NULL as ␀) (TODO)

The 8 bit data format changes are likely to be the least "backwards compatible" changes, but should not impact anyone as they are not generally encountered. I.e. incoming SMS will rarely (if ever) be 8 bit coded, and when they were, we would get special characters wrong. Similarly, sending 8 bit SMS would only show the expected characters on some older phones, and would be wrong on many others as the specification does not say the character set to use. We will, however, handle NULLs much better, which are relevant for some special use cases.

Last updated: 24¼ days ago

View full details

Problem with SMS delivery via HTTP

MINOR Closed SMS

AFFECTED

SMS

STARTED

Apr 08, 09:15 AM (24½ days ago)

CLOSED

Apr 08, 11:21 AM (24½ days ago)

DESCRIPTION

SMS delivery via HTTP POST was broken via one of our SMS relays for a while this morning. The symptom was that "da", the destination address, was being posted as the "target" rather than the destination number. This means if we post to your server on https://example.com/sms/, we could have posted the SMS with the destination number of literally "https://example.com/sms/". This would have broken anything depending on the "da" to make decisions on what to do with the message. This is fixed now, and the problem occurred between around 9:15 and 11:21. Apologies for any inconvenience.

Resolution:

Closed: 24½ days ago

View full details

Overnight LNS shuffling

MAINTENANCE Completed LNS

AFFECTING

LNS

STARTED

Apr 08, 01:00 AM (25 days ago)

CLOSED

Apr 08, 08:08 AM (24½ days ago)

DESCRIPTION

We'll be moving customers off the B.Gormless and G.Gormless LNS during the early hours of Monday 8th April. These customers will see their line drop and reconnect from 1AM.

Resolution: This has been completed.

Closed: 24½ days ago

View full details

Intermittent DoH/DoT problem - fixed

MINOR Closed DNS, Email and Web Hosting

AFFECTED

DNS, Email and Web Hosting

STARTED

Apr 03, 10:54 AM (29½ days ago)

CLOSED

Apr 04, 10:54 AM (28½ days ago)

DESCRIPTION

Our DoH/DoT resolvers ( https://support.aa.net.uk/DoH_and_DoT ) were intermittently failing DNS lookups. It seemed to start over the Easter weekend. Our DoT/DoH front ends are DNS aware proxies (dnsdist) to back ends running unbound. dnsdist uses TLS to speak DNS to the back ends. Some of the back ends had failed to reload their TLS certificates after renewal, so although the certificates were valid unbound was still serving old certs and they eventually expired. This resulted in broken back ends in the pool, which dnsdist kept trying to bring back into service. The intermittent nature of the failures meant that it wasn't obvious to users, as clients generally retry silently in the background. Of course our monitoring should have caught this! We've fixed the underlying problem which caused unbound not to pick up the renewed certificates, and we've improved monitoring to catch similar problems should they occur in future.

Resolution:

Closed: 28½ days ago

View full details

Overnight work - Software Upgrades [on hold]

MAINTENANCE Completed LNS and Routers

AFFECTING

LNS and Routers

STARTED

Mar 23, 03:00 AM (1¼ months ago)

CLOSED

Apr 07, 12:15 PM (25½ days ago)

DESCRIPTION

We will be performing software upgrades on our FB9000 LNSs during the early hours of ~~Saturday 23rd, Sunday 24th and Monday 25th~~ this week. This will cause customer lines to drop and reconnect a couple of times between the hours of 3AM and 4:30AM.

Customer who will be affected by this are those with line speeds of 80Mb/s and above.

The software upgrade being applied does have a plausible fix for the CPU hang that we have been seeing. However, if we we see any further CPU hangs we will revert back to the seemingly stable version of the software.

Resolution: We have seen some CPU hangs with the latest software, so will be reverting back to the more stable 'Factory' version.

Closed: 25½ days ago

View full details

Work to help resolve recent LNS problems (Updated 30th April)

MAINTENANCE Assumed Completed Broadband

AFFECTING

Broadband

STARTED

Jan 19, 03:50 PM (3¼ months ago)

DESCRIPTION

This is a summary and update regarding the problems we've been having with our network, causing line drops for some customers, interrupting their Internet connections for a few minutes at a time. It carries on from the earlier, now out of date, post: https://aastatus.net/42577

We are not only an Internet Service Provider.

We also design and build our own routers under the FireBrick brand. This equipment is what we predominantly use in our own network to provide Internet services to customers. These routers are installed between our wholesale carriers (e.g. BT, CityFibre and TalkTalk) and the A&A core IP network. The type of router is called an "LNS", which stands for L2TP Network Server.

FireBricks are also deployed elsewhere in the core; providing our L2TP and Ethernet services, as well as facing the rest of the Internet as BGP routers to multiple Transit feeds, Internet Exchanges and CDNs.

Throughout the entire existence of A&A as an ISP, we have been running various models of FireBrick in our network.

Our newest model is the FB9000. We have been running a mix of prototype, pre-production and production variants of the FB9000 within our network since early 2022.

As can sometimes happen with a new product, at a certain point we started to experience some strange behaviour; essentially the hardware would lock-up and "watchdog" (and reboot) unpredictably.

Compared to a software 'crash' a hardware lock-up is very hard to diagnose, as little information is obtainable when this happens. If the FireBrick software ever crashes, a 'core dump' is posted with specific information about where the software problem happened. This makes it a lot easier to find and fix.

After intensive work by our developers, the cause was identified as (unexpectedly) something to do with the NVMe socket on the motherboard. At design time, we had included an NVME socket connected to the PCIE pins on the CPU, for undecided possible future uses. We did not populate the NVMe socket, though. The hanging issue completely cleared up once an NVMe was installed even though it was not used for anything at all.

As a second approach, the software was then modified to force the PCIe to be switched off such that we would not need to install NVMes in all the units.

This certainly did solve the problem in our test rig (which is multiple FB9000s, PCs to generate traffic, switches etc). For several weeks FireBricks which had formerly been hanging often in "artificially worsened" test conditions, literally stopped hanging altogether, becoming extremely stable.

So, we thought the problem was resolved. And, indeed, in our test rig we still have not seen a hang. Not even once, across multiple FB9000s.

However...

We did then start seeing hangs in our Live prototype units in production (causing dropouts to our broadband customers).

At the same time, the FB9000s we have elsewhere in our network, not running as LNS routers, are stable.

We are still working on pinpointing the cause of this, which we think is highly likely to be related to the original (now, solved) problem.

Further work...

Over the next 1-2 weeks we will be installing several extra FB9000 LNS routers. We are installing these with additional low-level monitoring capabilities in the form of JTAG connections from the main PCB so that in the event of a hardware lock-up we can directly gather more information.

The enlarged pool of LNSs will also reduce the number of customers affected if there is a lock-up of one LNS.

We obviously do apologise for the blips customers have been seeing. We do take this very seriously, and are not happy when customers are inconvenienced.

We can imagine some customers might also be wondering why we bother to make our own routers, and not just do what almost all other ISPs do, and simply buy them from a major manufacturer. This is a fair question. At times like this, it is a question we ask ourselves!

Ultimately, we do still firmly believe the benefits of having the FireBrick technology under our complete control outweigh the disadvantages. CQM graphs are still almost unique to us, and these would simply not be possible without FireBrick. There have also been numerous individual cases where our direct control over the firmware has enabled us to implement individual improvements and changes that have benefitted one or many customers.

Many times over the years we have been able to diagnose problems with our carrier partners, which they themselves could not see or investigate. This level of monitoring is facilitated by having FireBricks.

But in order to have finished FireBricks, we have to develop them. And development involves testing, and testing can sometimes reveal problems, which then affect customers.

We do not feel we were irrationally premature in introducing prototype FireBricks into our network, having had them under test not routing live customer traffic for an appropriate period beforehand.

But some problems can only reveal themselves once a "real world" level and nature of traffic is being passed. This is unavoidable, and whilst we do try hard to minimise disruption, we still feel the long term benefits of having FireBricks more-than offset the short term problems in late stage of development. We hope our detailed view on this is informative, and even persuasive.

Last updated: 2½ days ago
Update due: 3½ days

View full details