DBSCAN clustering of NYC restaurant data
How concentrated or diffuse is each cuisine across NYC? Using the same DBSCAN parameters as the map, this table shows what share of each cuisine's restaurants fall into clusters versus remaining unclustered. The diffusion score uses normalized Shannon entropy — lower values indicate a cuisine concentrated in a few hotspots (e.g., Russian in Southern Brooklyn), while higher values indicate even spread across many clusters and noise (e.g., Pizza places everywhere).
| Cuisine | # Restaurants | % Clustered | # Clusters | Largest Cluster % | Diffusion Score |
|---|
DBSCAN, or Density Based Spatial Clustering of Applications with Noise, is an algorithm that uses a density concept based on three types of points (restaurants in this case):
Where M is the minPts parameter. This allows for arbitrarily shaped clusters that can ignore noise and outliers. The number of these clusters is defined by the data, allowing as few as zero clusters.
From DOMNH via NYC Open Data:
This field describes the entity (restaurant) cuisine;
Optional field provided by provided by restaurant owner/manager
An overwhelming majority of entries in the DOMNH database contain this
information, but there may be innaccuracies due to the inconsistent nature
of label assignment.
Eps: How close to be considered neighbors
The DCScan clustering algorithm uses the idea of neighbors to assign clusters.
The epsilon parameter (Eps) is used to determine which nodes (restaurants) are
considered "neighbors" to each other. A larger value for Eps means farther restaurants
will be taken as neighbors to each other. A smaller value means a group of restaurants has to be more
"packed in" to be considered a hotspot.
As the boroughs differ significantly in density, they each have their own Eps value
- restaurants that might seem close in Staten Island could be a few neighborhoods away in
Manhattan, and this is accounted for. Because of this, the slider is an Eps multiplier,
preserving the borough-based scaling.
minPts: How many neighbors are needed to start a hotspot
In the DBScan Algorithm, clusters are created when nodes (restaurants) have enough neighbors.
The core nodes around which a cluster (hotspot) grows must have a certain amount of neighbors, defined by
the minPts paramater. Only a restaurant with at least this many neighbors can start a cluster,
and it will bring in each of its neighbors with it. A restaurant without this many nearby connections
may still be in a cluster - if it is brought in by one of its own (more connected) neighbors.
This is a bit different from just the minimum number of points in a cluster - the minPts parameter instead
determines the number of points needed to start a cluster.
Because the cuisines represented in the data have vastly different numbers of restaurants
(compare nearly 5000 American restaurants to the dozen or so types with less than 10), setting
one universal minPts value doesn't make sense. The default minPts value is 1.5% of the total for
each label, or 4, whichever is higher. Increasing the slider is a multiplier on this value - a
higehr value means more points are needed to start a cluster.