Federated Learning - Part 2 - Risks and Challenges
2 months ago
Intro
In the first article of our mini-series on Federated Learning (FL) [Privacy-First AI: Exploring Federated Learning] we introduced the basic concepts behind the decentralized training approach, and we also presented potential applications in certain domains. Undoubtedly, FL is a framework that holds great promise for enhancing privacy and efficiency in machine learning. However, despite its powerful potential, FL comes with its own set of unique risks and challenges that must be addressed in order to gain its full benefits. In this article, we will delve into these challenges, and discuss some of the strategies that address the risks that come with adopting FL.
Three ‘Bottlenecks’
The literature distinguishes three bottlenecks that FL currently faces [1]:
- Privacy and security: although FL is designed with built-in privacy in that it doesn't expose clients' local data, research shows that it can still reveal sensitive information through the parameters that are exchanged during the learning process [2, 3]. As a result, the risk of parameters being revealed remains a significant concern in FL, as well as malicious attacks that target the integrity of training data and models.
- Data and models heterogeneity: data distribution and computational resources can vary widely between different clients, which leads to issues such as global model drift or the need to adopt multiple model architectures.
- Significant communication overhead: as the number of clients participating in FL grows, the communications overhead between nodes starts to significantly exceed the computational overhead required to train a model. Therefore, improving efficiency in communications becomes another major challenge
Privacy
The main privacy vulnerability in FL lies in the fact that the updates communicated by a client node can leak information about its underlying private data samples. Therefore, to ensure secure transmission of model parameters or gradients, FL systems usually employ a set of traditional encryption techniques, which they adapt to their specific needs. Some of the most commonly used methods are:
- Secure Multi-Party Computing (SMPC): SMPC is a broad subfield of cryptography that was first introduced in the 1980s [4], and its general purpose is to enable multiple parties to jointly compute a function over their inputs while keeping those inputs private. In recent years, SMPC has been migrated to federated systems mainly for the purpose of secure model aggregation [5, 6]. Through special privacy-preserving protocols, SMPC ensures that individual contributions remain confidential to the central server, and it can only see the final aggregate result.
- Differential Privacy (DP): Generally speaking, the main idea of DP is to ensure that it is impossible to infer an individual’s private information from the output of any statistical analysis [7]. To make such operation impossible, the most common approach is for DP to use random noise to drown the contribution of each individual to the analysis output but, at the same time, preserve the overall group statistics. In the context of FL, we can talk about two forms of using DP:
- Central Differential Privacy, where noise is added to the globally aggregated parameters at the central server.
- Local Differential Privacy, where noise is added at the client node to the updates that will be sent to the server.
- Homomorphic Encryption (HE): HE allows users to perform mathematical operations directly on the cipher-text and guarantees that the result of the operations is the same as if they were applied to the plaintext. In FL, HE enables the central server to carry algebraic operations directly on encrypted parameters without needing decryption, which means that potentially sensitive information is kept private throughout the whole computation process. [8, 9]
Security
In addition to the privacy concerns related to the leakage of parameters, we can talk about a broader category of attacks that tamper with local training data or the aggregated models. These types of attacks threaten the integrity and performance of federated learning systems, and are usually carried out by faulty or malicious clients. The attacks are known as Byzantine attacks. They can be classified into two types:
- Training data attacks, also known as the data poisoning attacks, involve hostile clients deliberately manipulating their local training data to mislead the global model. Possible approaches include:
- Label-flipping, which means changing the labels of the training data to incorrect ones.
- Noise addition, which means degrading the quality of the dataset by introducing noise.
- Backdoor triggers which add specific patterns or triggers to the data, which causes the model to behave in a particular way when those patterns appear during inference time.
- Parameter attacks which focus on altering the local model’s parameters, which then corrupt the aggregated central model. The parameters can be modified either directly, for example, by randomly sampling a number and treating it as one of the parameters of the local model, or indirectly, for example, by flipping the signs of local gradients.
To protect against the Byzantine attacks, servers can introduce various defense schemes [10]:
- Distance schemes focus on identifying and rejecting harmful updates by evaluating the distances between them. Updates that deviate significantly from the norm are flagged as malicious. Krum [11] is a good example of such a technique, where only the update that is the most similar to its neighbors is selected to update the global model.
- Performance schemes evaluate and filter updates based on their contribution to the performance of the global model. The quality of the updates can be assessed on a clean validation dataset. This usually involves multiple scoring metrics which a server uses to estimate the probability that an update is reliable. Examples include [12, 13, 14]
- Statistical schemes derive a robust aggregate and diminish the impact of any abnormal updates based on the statistical properties of the updates. For example, the trimmed mean method can be applied to remove a certain percentage of the highest and lowest values, and then average over the remaining update values. To reduce the impact of extreme values, the median value of the updates is also often used instead of, or alongside, the mean value. [15, 16]
- Target optimization schemes focus on manipulating the objective loss function in such a way that it favors a more robust global model. For example, a local model can be forced to be close to the global model as is proposed by Byzantine-Robust Stochastic Aggregation [17]
Heterogeneity
Heterogeneity challenges arise naturally from the decentralized nature of FL and the differences between participating clients. Those differences can be found in either data or supported model architectures.
- Data Heterogeneity: in most of real-world scenarios, it is practically impossible to guarantee that each local dataset follows the same distribution and contains the same number of data samples. As a result, if we employ a simple uniform averaging scheme to aggregate the updates calculated on those datasets, this can introduce biases to the model, can affect its ability to converge and can impact the precision of predictions [18]. The standard approach to protecting against these risks uses some form of client-specific weighting aggregation that gives more weight to updates with higher quality and more representative data. [19, 20]
- Model Heterogeneity: it is also a mistake to assume that the local model structure is uniform across the clients. Clients can differ in their computational capabilities, storage capacity, but can also require models that are customized to their local tasks. This leads to a discrepancy between the locally used model architectures and the difficulty of transferring knowledge between model heterogeneous clients in a model-agnostic manner. To overcome that challenge, usually some form of Knowledge Transfer [21] or Architecture Sharing is used [22].
Communication
FL eliminates the costs related to dataset transfers, however, when numerous client nodes send their updates between each other or to the central server, the communication overhead becomes a significant factor. Similarly to previously discussed challenges, the research community is developing strategies to ensure a high level of efficiency of FL systems.
- Optimization algorithm: to reduce the communication overhead, the generally preferred approach is to increase the local training iteration times and decrease the number of global communication rounds. This means training locally for a longer time before sending an update to the global server. In addition, researchers seek to improve the local training efficiency and optimize the global aggregation algorithms to make the federated models converge faster and hence ensure satisfactory performance within the smaller number of global updates. [23, 24]
- Client selection: if the number of local nodes is sufficiently large, it is a common practice to select only a subset of the clients in each communication round to update the global model. The random client selection is the most straightforward approach, however, in most of the scenarios, not the most efficient one. First of all, by randomly selecting the clients, it is possible to discard some of the important updates and degrade the model quality. Secondly, we can be unlucky, and repeatedly draw a subset of the slowest participants. Therefore, the studies focus on the client selection algorithms that take both the updates upload time and their relevance into consideration [25, 26]
- Model compression: the most straightforward strategy to reduce the communication costs is to reduce the size of the transferred parameters. This is particularly important for deep neural network architectures often consisting of millions of weights. Because data compression methods often come at the price of losing some of the original information, significant research effort is devoted to implementation of techniques that compress the models and do not degrade their predictions quality. [27, 28]
Summary
Federated Learning presents both opportunities and challenges in the distributed machine learning domain. Key risks include data and model security vulnerabilities, potential biases due to non-representative local data, and high communication costs. Research community is already addressing these challenges, and introduces robust encryption protocols as well as efficient aggregation schemes and optimization algorithms. In the closing article of the FL mini-series we are going to give you a brief overview of software frameworks that enable smooth adoption of FL principles into your own projects. Stay tuned!
References
- Wen, J., Zhang, Z., Lan, Y., Cui, Z., Cai, J., & Zhang, W. (2023). A survey on federated learning: challenges and applications. International Journal of Machine Learning and Cybernetics, 14(2), 513-535.
- Wang, Z., Song, M., Zhang, Z., Song, Y., Wang, Q., & Qi, H. (2019, April). Beyond inferring class representatives: User-level privacy leakage from federated learning. In IEEE INFOCOM 2019-IEEE conference on computer communications (pp. 2512-2520). IEEE.
- Zhu, L., Liu, Z., & Han, S. (2019). Deep leakage from gradients. Advances in neural information processing systems, 32.
- Yao, A. C. (1982, November). Protocols for secure computations. In 23rd annual symposium on foundations of computer science (sfcs 1982) (pp. 160-164). IEEE.
- Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S., ... & Seth, K. (2017, October). Practical secure aggregation for privacy-preserving machine learning. In proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (pp. 1175-1191).
- Xu, Y., Peng, C., Tan, W., Tian, Y., Ma, M., & Niu, K. (2022). Non-interactive verifiable privacy-preserving federated learning. Future Generation Computer Systems, 128, 365-380.
- Wei, K., Li, J., Ding, M., Ma, C., Yang, H. H., Farokhi, F., ... & Poor, H. V. (2020). Federated learning with differential privacy: Algorithms and performance analysis. IEEE transactions on information forensics and security, 15, 3454-3469.
- Fang, H., & Qian, Q. (2021). Privacy preserving machine learning with homomorphic encryption and federated learning. Future Internet, 13(4), 94.
- Zhang, C., Li, S., Xia, J., Wang, W., Yan, F., & Liu, Y. (2020). {BatchCrypt}: Efficient homomorphic encryption for {Cross-Silo} federated learning. In 2020 USENIX annual technical conference (USENIX ATC 20) (pp. 493-506).
- Shi, J., Wan, W., Hu, S., Lu, J., & Zhang, L. Y. (2022, December). Challenges and approaches for mitigating byzantine attacks in federated learning. In 2022 IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) (pp. 139-146). IEEE.
- Blanchard, P., El Mhamdi, E. M., Guerraoui, R., & Stainer, J. (2017). Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in neural information processing systems, 30.
- Xie, C., Koyejo, S., & Gupta, I. (2019, May). Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance. In International Conference on Machine Learning (pp. 6893-6901). PMLR.
- Cao, X., & Lai, L. (2019). Distributed gradient descent algorithm robust to an arbitrary number of byzantine attackers. IEEE Transactions on Signal Processing, 67(22), 5850-5864.
- Cao, X., Fang, M., Liu, J., & Gong, N. Z. (2020). Fltrust: Byzantine-robust federated learning via trust bootstrapping. arXiv preprint arXiv:2012.13995.
- Yin, D., Chen, Y., Kannan, R., & Bartlett, P. (2018, July). Byzantine-robust distributed learning: Towards optimal statistical rates. In International conference on machine learning (pp. 5650-5659). Pmlr.
- Guerraoui, R., & Rouault, S. (2018, July). The hidden vulnerability of distributed learning in byzantium. In International Conference on Machine Learning (pp. 3521-3530). PMLR.
- Li, L., Xu, W., Chen, T., Giannakis, G. B., & Ling, Q. (2019, July). RSA: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 1544-1551)
- Zhang, J., Cheng, X., Wang, C., Wang, Y., Shi, Z., Jin, J., ... & Zhang, T. (2022). FedAda: Fast-convergent adaptive federated learning in heterogeneous mobile edge computing environment. World Wide Web, 25(5), 1971-1998.
- Wang, C., Yang, Y., & Zhou, P. (2020). Towards efficient scheduling of federated mobile devices under computational and statistical heterogeneity. IEEE Transactions on Parallel and Distributed Systems, 32(2), 394-410.
- Taïk, A., Mlika, Z., & Cherkaoui, S. (2021). Data-aware device scheduling for federated edge learning. IEEE Transactions on Cognitive Communications and Networking, 8(1), 408-421.
- Li, D., & Wang, J. (2019). Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581.
- Arivazhagan, M. G., Aggarwal, V., Singh, A. K., & Choudhary, S. (2019). Federated learning with personalization layers. arXiv preprint arXiv:1912.00818.
- McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017, April). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics (pp. 1273-1282). PMLR.
- Liu, W., Chen, L., Chen, Y., & Zhang, W. (2020). Accelerating federated learning via momentum gradient descent. IEEE Transactions on Parallel and Distributed Systems, 31(8), 1754-1766.
- Liu, S., Yu, G., Yin, R., Yuan, J., Shen, L., & Liu, C. (2021). Joint model pruning and device selection for communication-efficient federated edge learning. IEEE Transactions on Communications, 70(1), 231-244.
- Du, C., Xiao, J., & Guo, W. (2022). Bandwidth constrained client selection and scheduling for federated learning over SD‐WAN. IET Communications, 16(2), 187-194.
- Zhu, H., & Jin, Y. (2019). Multi-objective evolutionary federated learning. IEEE transactions on neural networks and learning systems, 31(4), 1310-1322.
- Li, C., Li, G., & Varshney, P. K. (2021). Communication-efficient federated learning based on compressed sensing. IEEE Internet of Things Journal, 8(20), 15531-15541.