DBSCAN算法说明

发表时间：2022-07-30 00:30:31 阅读：145

Im使用DBSCAN聚类算法执行NLP任务，以查看这些单词之间是否存在任何聚类.

我拥有的数据集具有以下格式

dbscan_df1 = 0.218139   -0.187614   0.369169    -0.092737   -0.557485   0.081599    -0.128570   0.176931    0.017034    0.037340    ...  0.144367   -0.016442   -0.035220   -0.054845   -0.066294   0.478536    -0.009905   0.020157    'to'//
             0.337206   -0.198735   0.111028    -0.260347   -0.398793   -0.154748   0.000725    -0.066050   0.053052    0.049676    ...     0.472701    0.218401    -0.048758   -0.178615   0.202199    0.206250    0.120965    0.644875    'antigens'//
                                               .
                                               .
                                               .//
             0.361761   -0.415790   0.122305    0.254493    0.088273    0.322763    0.041017    -0.041002   -0.169154   -0.420434   ... -0.131940   -0.104786   -0.072195   -0.322311   0.065886    -0.074053   -0.090491   0.276367    'sailboat'

所以有65列，其中前64列是每个唯一单词的表示，然后我做以下操作来为dbscan算法准备数据集


# Select all columns apart from Unique Words one 
dbscan_df = dbscan_df2.loc[:, dbscan_df1.columns != 'Unique Words']

最后，我将使用本教程应用dbscan算法https://www.reneshbedre.com/blog/dbscan-python.html，以确定its参数

# Compute the parameters of DBSCAN method
# For multidimensional dataset, minPts should be 2 * number of dimensions
# To determine the optimal ε parameter, I will compute the kNN distances of an input dataset using the kNN method

# n_neighbors = 129 as kneighbors function returns distance of point to itself (i.e. first column will be zeros) 
nbrs = NearestNeighbors(n_neighbors=129).fit(dbscan_df)

# Find the k-neighbors of a point
neigh_dist, neigh_ind = nbrs.kneighbors(dbscan_df)

# sort the neighbor distances (lengths to points) in ascending order
# axis = 0 represents sort along first axis i.e. sort along row
sort_neigh_dist = np.sort(neigh_dist, axis=0)

k_dist = sort_neigh_dist[:, 128]
plt.plot(k_dist)
plt.axhline(y=5, linewidth=1, linestyle='dashed', color='k')
plt.ylabel("k-NN distance")
plt.xlabel("Sorted observations (4th NN)")
plt.show()

# Fit DBSCAN method
dbscan_clustering = DBSCAN(min_samples=129, eps=5, n_jobs=-1).fit(dbscan_df)
Counter(dbscan_clustering.labels_)

output: Counter({0: 1201, -1: 1})

问题:有什么方法可以看到哪些单词的标签是0，哪些单词的标号是1例如，我可以看看单词antigens是有标签1还是标签1吗？

🎖️ 优质答案