Improving forecasting by learning quantile functions
The quantile function is a mathematical function that takes a percentage of a distribution as input and output the value of a variable. It can answer questions like, “If I want to guarantee that 95% of my customers receive their orders within 24 hours, how much inventory do I need to keep on hand?” In practical cases, however, we rarely have a tidy formula for computing it.
Answering Questions with HuggingFace Pipelines and Streamlit
HuggingFace’s Transformers library is full of SOTA NLP models which can be used out of the box as-is, as well as fine-tuned for specific uses and high performance. Both Streamlit and the Transformer library’s pipelines can help make data science projects much easier to implement. This article will show just how easy this can be, and how few lines of code are needed to achieve something interesting.
Inside Meta’s AI optimization platform for engineers across the company
AI is an important part of making modern software systems and products work as well as possible. To leverage AI more effectively in our products, we need to address several challenges. To address these challenges, we’ve built an end-to-end AI platform called Looper. Looper supports the full machine learning lifecycle from model training, deployment, and inference to evaluation and tuning of products.
Summarization with GPT-3
In this article, we look at the impressive power of OpenAI’s GPT-3 engines by looking at an example of summarizing complex text. This article is an excerpt from the book Transformers for Natural Language Processing, Second Edition. This edition includes more use cases, such as casual language analysis and computer vision tasks, and an introduction to OpenAI’s Codex Codex.
The 8 Basic Statistics Concepts for Data Science
Statistics is a form of mathematical analysis that uses quantified models and representations for a given set of experimental data or real-life studies. Descriptive Analytics tells us what happened in the past and helps a business understand how it is performing by providing context to help stakeholders interpret information. P(A|B) is a measure of the probability of one event occurring with some relationship to one or more other events.
How AWS uses graph neural networks to meet customer needs
At Amazon Web Services, the use of machine learning (ML) to make the information encoded in graphs more useful to our customers has been a major research focus. In this post, we’ll showcase a variety of graph ML applications that customers have developed in collaboration with AWS scientists. Nodes are often associated with data features, such as a product’s price or text description.
Modeling urban heat wave risk with satellite imagery and lidar data
A method for modeling environmental risk with GIS, statistical techniques and open python libraries
Recently the World Resources Insitute office in Brazil tasked me with a consultancy under the Cities4Forests project on modeling the risk for several hazards linked to climate change such as floods, landslides and heat waves at urban scale for the municipality of Campinas, Brazil’s most populous city outside of a metropolitan region, with just over 1,2M…
Guide to Iteratively Tuning GNNs
This blog walks through a process for experimenting with hyperparameters, training algorithms and other parameters of Graph Neural Networks. In this post, we share the first two phases of our experiment chain. We tuned two popular GNN variants to: Minimize training cost (time and number of epochs) for future reference. We then designed and executed an iterative experimentation approach for hyperparameter tuning where we seek a quality model that takes minimal time to train.
Discovering the systematic errors made by machine learning models
Machine learning models that achieve high overall accuracy often make systematic errors on coherent slices of validation data. A slice is a set of data samples that share a common characteristic. A model underperforms on a slice if performance on the data samples in the slice is significantly worse than its overall performance. The search for underperforming slices is a critical, but often overlooked, part of model evaluation.
Documenting Python code with Sphinx
Sphinx-quickstart is an interactive tool that asks some questions about your project and then generates a complete documentation directory along with a file which will be used later to generate HTML. Major Python libraries like Django, NumPy, SciPy, and many more are written using Sphinx. Sphinx takes in your.rst files and converts them to HTML, and all that is done using a bunch of commands.
Apache Spark for Data Science — How to Work with Spark RDDs
Apache Spark for Data Science — How to Work with Spark RDDs
Spark is based on Resilient Distributed Datasets (RDD) — Make sure you know how to use them
RDDs, or Resilient Distributed Datasets are core objects in Apache Spark. They are a primary abstraction Spark uses for fast and efficient MapReduce operations. As the name suggests, these datasets are resilient…
An introduction to the generalized linear model (GLM)
An introduction to the generalized linear model (GLM)
What it is and how the model is fitted & Application to housing prices prediction
In the classical linear model, normality is usually required. This is shown in Figure 0.1, with random variable X fixed, the distribution of Y is normal (illustrated by each small bell curve). And the regression curve goes across the mean…
Exploring SageMaker Canvas
Building Machine Learning models takes knowledge, experience, and a lot of time. Sometimes different persona such as Business Analysts or other technocrats who do not have experience with ML might have a ML use-case that they may want to address, but lack the expertise to do so. Even ML engineers and Data Scientists who have ML experience may want a model built quickly.
Multinomial Naїve Bayes’ For Documents Classification and Natural Language Processing (NLP)
Naïve Bayes is a probabilistic approach for constructing the data classification models. It deals with probability as the “likelihood” that data belongs to a specific class. The multinomial naïve Bayes algorithm is widely used for assigning documents to classes based on statistical analysis of their contents. It provides an alternative to the “heavy” AI-based semantic analysis and drastically simplifies textual data classification.
The Complete Collection Of Data Repositories – Part 2
Most of the data sources listed below are free and open to the public. However, some are not. The collection of data repositories is divided into 2 parts, which consist of 20 categories based on various fields of science. The Healthcare category consists of patients and hospital records. You can find data on air quality, viruses, diseases, mortality statistics, and vaccination progress.
Understanding the Difference between Loss Functions and Metrics in Machine Learning/Deep Learning
There’s indeed a difference between loss functions and Metrics in the field of Machine Learning. These two terms are often used interchangeably. The primary goal of this article is to shed some light on the concepts and how they apply differently in building a Machine Learning model. An Evaluation Metric, also known as a “Criterion” is a method of evaluating/comparing the performance of a learning function. A loss function, is more like an “error” function that calculates how far apart the output/predicted value of a. learning function deviates/differs from the ground truth/actual value.
Understanding and mitigating dimensional collapse
Self-supervised learning has revolutionized AI training in several domains, including vision, language, and speech. But it sometimes triggers dimensional collapse, in which a model fails to take advantage of its full capacity to encode information. We have developed DirectCLR, a training method that overcomes this problem by optimizing a model’s ability to create rich representations of knowledge.
Pythagorean Expectation In Sports Analytics With Examples From Different Sports
Pythagorean Expectation in Sports Analytics, with Examples From Different Sports
Pythagorean Expectation is used in different sports like baseball, basketball, football, hockey etcetera to drive data-driven analytics and predictive modeling
An NLP Movie History
Plot plots show how the language used to describe movies changes depending on the year when the movie was produced. The data and the code that I have used to generate this plot and the ones below are available in this Jyter Notebook. Can data science help us make sense of these sudden changes in cinematic taste and perhaps predict what kinds of movies might soon be coming to — or departing — a screen near us?
Nearest Neighbors for Classification
K-nearest neighbors (KNN) is a type of supervised learning machine learning algorithm and is used for both regression and classification tasks. KNN is used to make predictions on the test data set based on the characteristics of the current training data points. It will then place the new data point at closer proximity to the current data points that share the same characteristics or features. K is a positive integer and is typically small in value.