Analysis and suggestions on government "i-Invigilation" and "Check-in Smart" system failures

– Article is translated by Google Translate –

Recently, the Hong Kong Examinations and Assessment Authority’s “i-Invigilation” and “Check-in Smart” systems experienced large-scale failures, causing public concern. Although problems can occur in any system, it is puzzling that this incident involved 9 million public funds, and that even with ample resources and time to prepare, it was still vulnerable.

Although our company was not involved in the development of this system, based on its extensive experience in large-flow system development and third-party pressure testing, we believe that this failure is likely to be caused by insufficient pressure testing. The purpose of stress testing is to identify system bottlenecks and take early countermeasures, such as virtual queuing systems, when expected traffic cannot be achieved, to control user experience and avoid problems from expanding.

The traffic of “i-Invigilation” and “Check-in Smart” are completely predictable, and stress test indicators are easy to set. If the indicators were set correctly and tested, such a large-scale failure should not have occurred. Therefore, we speculate that the test indicators may be too low, or the actual usage scenario may not be fully simulated.
In addition, it can be seen from public information that the system error message actually displays the database request code on the client. This raises our question: does the client connect directly to the database? If so, the database may need to be connected to tens of thousands of clients at the same time, which is easily overloaded and may pose security risks. If it is not a direct connection, it is abnormal for the background to return this message to the client. Why should data that the user cannot understand be displayed? Is the debugging code not properly handled during development?

Since we are not system developers, we cannot know exactly where the bottleneck lies and it is difficult to propose precise solutions. However, if our company designs this system, we will make the following suggestions:

1 Avoid single points of failure: Design the system into multiple backends, and assign each examination venue or several markets to a specific backend to avoid all candidates and invigilators in Hong Kong being concentrated in the same backend, reducing the number of candidates in Hong Kong. risk of sexual failure. When necessary, background batch operations or asynchronous processing can be used to reduce the pressure on the central system.

2 Reduce client requests: Reducing the number of interactions between the client and the background can reduce background pressure and failure probability. For example, the client is designed to operate independently or offline, communicating with the backend only when necessary; or only a few clients need to connect to the system. In fact, “Safe Travel” has adopted a similar method. The public side does not need to be connected at all times. It only needs to update data or the restaurant side needs to be connected to check in real time.

⠀Although the virtual queuing system may not completely solve the problem in this incident, we have extensive experience in developing large-traffic systems. Even if we are unable to participate in system design, we are willing to provide assistance in stress testing to bring greater satisfaction to the government and citizens. experience.

‍