5 Minutes of Data Science - week 3

Highlights from January 16 to January 22


There are a couple of big news from last week. Firstly, Google launched a free Deep Learning Tuning Book and HOWTO build GPT2 from scratch.

Come say hi on Mastodon. See you next week!

Reddit’s top posts

  • 300,000+ Tech jobs have been vanished in the last 12 months. (Sad but true fact), at r/Data Science (💬186)
  • Thoughts?, at r/Data Science (💬81)
  • I asked ChatGPT to explain ROC AUC, the level of collaboration is beyond my expectation, at r/Data Science (💬81)
  • OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic, at r/Machine Learning (💬246)
  • Getty Images is suing the creators of AI art tool Stable Diffusion for scraping its content, at r/Machine Learning (💬270)
  • [ICLR'2023 Spotlight🌟]: The first BERT-style pretraining on CNNs!, at r/Machine Learning (💬20)
  • How do you lie about numbers?, at r/Ask Statistics (💬27)
  • [QUESTION] GLM, at r/Ask Statistics (💬3)
  • What to do if nothing in class makes sense?, at r/Ask Statistics (💬15)
  • This AI can clone your voice! VALL-E (explained), at r/Latest in ML (💬0)
  • AI-For-Beginners: 12 Weeks, 24 Lessons, AI for All!
  • nn-zero-to-hero: Neural Networks: Zero to Hero
  • diff-svc: Singing Voice Conversion via diffusion model
  • micrograd: A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API
  • VToonify: [SIGGRAPH Asia 2022] VToonify: Controllable High-Resolution Portrait Video Style Transfer
  • YOLOv6: YOLOv6: a single-stage object detection framework dedicated to industrial applications.
  • whisper: Robust Speech Recognition via Large-Scale Weak Supervision
  • ML-For-Beginners: 12 weeks, 26 lessons, 52 quizzes, classic Machine Learning for all
  • FinRL: FinRL: Financial Reinforcement Learning.🔥
  • PythonDataScienceHandbook: Python Data Science Handbook: full text in Jupyter Notebooks
  • chatgpt-comparison-detection: Human ChatGPT Comparison Corpus (HC3), Detectors, and more!🔥
  • mlbookcamp-code: The code from the Machine Learning Bookcamp book and a free course based on the book
  • amazon-sagemaker-examples: Example📓Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using🧠Amazon SageMaker.
  • mlops-zoomcamp: Free MLOps course from DataTalks.Club
  • nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs.
  • balance: The balance python package offers a simple workflow and methods for dealing with biased data samples when looking to infer from them to some target population of interest.
  • blog: Public repo for HF blog posts
  • DeepLearningSystem: Deep Learning System core principles introduction.
  • ColabFold: Making Protein folding accessible to all!
  • Python-and-Machine-Learning: 6th Feb 2021
  • langchain: ⚡Building applications with LLMs through composability⚡
  • gpt_index: An index created by GPT to organize external information and answer queries!
  • CodeFormer: [NeurIPS 2022] Towards Robust Blind Face Restoration with Codebook Lookup Transformer
  • Gymnasium: A standard API for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym)
  • mage-ai: 🧙The modern replacement for Airflow.
  • dvc: 🦉Data Version Control | Git for Data & Models | ML Experiments Management
  • ultralytics: YOLOv8🚀in PyTorch > ONNX > CoreML > TFLite
  • SHARK: SHARK - High Performance Machine Learning for CPUs, GPUs, Accelerators and Heterogeneous Clusters
  • mmyolo: OpenMMLab YOLO series toolbox and benchmark. Implemented RTMDet, YOLOv5, YOLOv6, YOLOv7, YOLOv8,YOLOX, PPYOLOE, etc.
  • tiktoken: None
  • azure-cli: Azure Command-Line Interface
  • CodeGen: CodeGen is an open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.
  • black: The uncompromising Python code formatter



  • Do Results Generalize for Privacy and Security Surveys, by Data Skeptic
  • Machine learning at small organizations, by Practical AI
  • Indie Hacking - Pauline Clavelloux, by Data Talks


  • Using large language models (LLMs) to synthesize training data, by Amazon Science
  • Domain data trumps teacher knowledge for distilling NLU models, by Amazon Science