Libraries and security – SE metrics (Software Engineering)

https://arxiv.org/pdf/2309.11021.pdf

I often use python because of the large ecosystem of libraries. Thanks to these libraries, I do not have to focus on the details of the implementation, but I can focus on the task at hand. However, not all libraries are good, and therefore this paper captured my attention. The study aims to understand the characteristics and lifecycle of malicious code in PyPI by building an automated data collection framework and analyzing a dataset of malicious package files.

Key findings and contributions of the paper include:

Empirical Analysis: The authors conducted an empirical study to understand the characteristics and lifecycle of malicious code in the PyPI ecosystem.
Automated Data Collection: They built an automated data collection framework to gather a high-quality dataset of malicious code from PyPI mirrors and other sources.
Dataset Construction: The dataset includes 4,669 malicious package files, making it one of the largest publicly available datasets of PyPI malicious packages.
Classification Framework: An automated classification framework was developed to categorize the collected malicious code into different types based on their behavior characteristics.
Malicious Behavior: The research found that over 50% of the malicious code exhibits multiple malicious behaviors, with information stealing and command execution being particularly prevalent.
Novel Attack Vectors and Anti-Detection Techniques: The study observed several novel attack vectors and anti-detection techniques used by malicious code.
Impact on End-User Projects: It was revealed that 74.81% of all malicious packages successfully entered end-user projects through source code installation, increasing security risks.
Persistence in Mirror Servers: Many reported malicious packages persist in PyPI mirror servers globally, with over 72% remaining for an extended period after being discovered.
Lifecycle Portrait: The paper sketches a portrait of the malicious code lifecycle in the PyPI ecosystem, reflecting the characteristics of malicious code at different stages.
Suggested Mitigations: The authors present some suggested mitigations to improve the security of the Python open-source ecosystem.

The study is significant as it provides a systematic understanding of the propagation patterns, influencing factors, and potential hazards of malicious code in the PyPI ecosystem. It also offers a foundation for developing more efficient detection methods and improving the security practices within the software supply chain.

Author: Miroslaw Staron

I’m professor in Software Engineering at Computer Science and Engineering. I usually blog about interesting articles (for me) and my own reflections on the development of Software Engineering, AI, computer science and automotive software. View all posts by Miroslaw Staron