从熊猫数据框中的列中删除自定义停止词列表

发表时间：2022-06-30 00:01:38 阅读：111

我正在分析一长串的调查回复.我可以很好地删除标准nltk列表中的停止词.然而，我创建了一个修改过的列表，似乎无法解决如何将其合并到代码中的问题.我用于标准列表的原始代码是:

创建一个列，在该列中，停止字从我已从响应中删除标点符号的列中删除，这些响应也已标记并全部小写.

stop_words = set(stopwords.words('english'))

df['stopwords_removed'] = df['no_punc'].apply(lambda x: [word for word in x if word not in stop_words])

df.head()

我使用以下代码将其添加到标准列表中:

stop_words = set(stopwords.words('english'))

new_stopwords = ['satisfying', 'satisfy', 'satisfied', 'clemson', 'university', 'institution', 'disappointing', 'disappoint', 'disappointed', 'experience', 'would', 'should']

new_stopwords_list = stop_words.union(new_stopwords)

我的问题是如何修改我的原始代码以包含新的_stopwords_u列表，而不是标准的？

🎖️ 优质答案

我不确定我是否完全理解，但为什么不能使用相同的代码行，然后检查新集合中的成员身份？因此:

df['stopwords_removed'] = df['no_punc'].apply(lambda x: [word for word in x if word not in new_stopwords])