Pregunta Mejorando la extracción de nombres humanos con nltk


Estoy tratando de extraer nombres humanos del texto.

¿Alguien tiene un método que recomendarían?

Esto es lo que probé (el código está debajo): estoy usando nltk para encontrar todo marcado como una persona y luego generar una lista de todas las partes NNP de esa persona. Me estoy saltando las personas donde hay un solo NNP que evita agarrar un apellido solitario.

Estoy obteniendo resultados decentes, pero me preguntaba si hay mejores formas de resolver este problema.

Código:

import nltk
from nameparser.parser import HumanName

def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    person_list = []
    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []

    return (person_list)

text = """
Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
print "LAST, FIRST"
for name in names: 
    last_first = HumanName(name).last + ', ' + HumanName(name).first
        print last_first

Salida:

LAST, FIRST
Velde, Francois
Branson, Richard
Galactic, Virgin
Krugman, Paul
Summers, Larry
Colas, Nick

Además de Virgin Galactic, esta es toda la salida válida. Por supuesto, saber que Virgin Galactic no es un nombre humano en el contexto de este artículo es la parte difícil (quizás imposible).


32
2017-11-29 17:33


origen


Respuestas:


Debe estar de acuerdo con la sugerencia de que "mejorar mi código" no es muy adecuado para este sitio, pero puedo darle un poco de dónde puede intenta cavar en.

Echa un vistazo a Stanford Nombrado Entity Recognizer (NER). Su enlace se ha incluido en NLTK v 2.0, pero debe descargar algunos archivos centrales. Aquí está guión que puede hacer todo eso por ti.

Escribí este guion:

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

y obtuve un resultado no tan malo:

('Francois', 'PERSONA')   ('R.', 'PERSONA')   ('Velde', 'PERSONA')   ('Richard', 'PERSONA')   ('Branson', 'PERSONA')   ('Virgen', 'PERSONA')   ('Galactic', 'PERSONA')   ('Bitcoin', 'PERSONA')   ('Bitcoin', 'PERSONA')   ('Paul', 'PERSONA')   ('Krugman', 'PERSONA')   ('Larry', 'PERSONA')   ('Veranos', 'PERSONA')   ('Bitcoin', 'PERSONA')   ('Nick', 'PERSONA')   ('Colas', 'PERSONA')

Espero que esto sea útil.


14
2018-06-09 11:13



Puede tratar de resolver los nombres encontrados y verificar si puede encontrarlos en una base de datos como freebase.com. Obtenga los datos localmente y consulte (está en RDF), o use la API de Google: https://developers.google.com/freebase/v1/getting-started. La mayoría de las grandes compañías, ubicaciones geográficas, etc. (que quedarían atrapadas por su fragmento) podrían descartarse en función de los datos de la base de datos.


5
2017-12-08 23:57



Para cualquier otra persona que busca, encontré que este artículo es útil: http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code

>>> import nltk
>>> def extract_entities(text):
...     for sent in nltk.sent_tokenize(text):
...         for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
...             if hasattr(chunk, 'node'):
...                 print chunk.node, ' '.join(c[0] for c in chunk.leaves())
...

5
2018-02-25 20:27



Spacy puede ser una buena alternativa para recuperar nombres de un texto.

https://spacy.io/usage/training#ner


2
2017-12-06 15:39



La respuesta de @trojane no funcionó para mí, pero ayudó mucho para esta.

Prerrequisitos

Crea una carpeta stanford-ner y descargue los siguientes dos archivos:

Guión

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import nltk
from nltk.tag.stanford import StanfordNERTagger

text = u"""
Some economists have responded positively to Bitcoin, including
Francois R. Velde, senior economist of the Federal Reserve in Chicago
who described it as "an elegant solution to the problem of creating a
digital currency." In November 2013 Richard Branson announced that
Virgin Galactic would accept Bitcoin as payment, saying that he had invested
in Bitcoin and found it "fascinating how a whole new global currency
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical.
Economist Paul Krugman has suggested that the structure of the currency
incentivizes hoarding and that its value derives from the expectation that
others will accept it as payment. Economist Larry Summers has expressed
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market
strategist for ConvergEx Group, has remarked on the effect of increasing
use of Bitcoin and its restricted supply, noting, "When incremental
adoption meets relatively fixed supply, it should be no surprise that
prices go up. And that’s exactly what is happening to BTC prices.
"""

st = StanfordNERTagger('stanford-ner/english.all.3class.distsim.crf.ser.gz',
                       'stanford-ner/stanford-ner.jar')

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1] in ["PERSON", "LOCATION", "ORGANIZATION"]:
            print(tag)

Resultados

(u'Bitcoin', u'LOCATION')       # wrong
(u'Francois', u'PERSON')
(u'R.', u'PERSON')
(u'Velde', u'PERSON')
(u'Federal', u'ORGANIZATION')
(u'Reserve', u'ORGANIZATION')
(u'Chicago', u'LOCATION')
(u'Richard', u'PERSON')
(u'Branson', u'PERSON')
(u'Virgin', u'PERSON')         # Wrong
(u'Galactic', u'PERSON')       # Wrong
(u'Bitcoin', u'PERSON')        # Wrong
(u'Bitcoin', u'LOCATION')      # Wrong
(u'Bitcoin', u'LOCATION')      # Wrong
(u'Paul', u'PERSON')
(u'Krugman', u'PERSON')
(u'Larry', u'PERSON')
(u'Summers', u'PERSON')
(u'Bitcoin', u'PERSON')        # Wrong
(u'Nick', u'PERSON')
(u'Colas', u'PERSON')
(u'ConvergEx', u'ORGANIZATION')
(u'Group', u'ORGANIZATION')     
(u'Bitcoin', u'LOCATION')       # Wrong
(u'BTC', u'ORGANIZATION')       # Wrong

1
2017-07-12 15:49



De hecho, quería extraer solo el nombre de la persona, por lo que pensé en verificar todos los nombres que vienen como salida contra wordnet (una gran base de datos léxica de inglés). Puede encontrar más información sobre Wordnet aquí: http://www.nltk.org/howto/wordnet.html

import nltk
from nameparser.parser import HumanName
from nltk.corpus import wordnet

person_names=person_list
person_list = []
def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)

    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []
#     print (person_list)

text = """

Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
for person in person_list:
    person_split = person.split(" ")
    for name in person_split:
        if wordnet.synsets(name):
            if(name in person):
                person_names.remove(person)
                break

print(person_names)

SALIDA

['Francois R. Velde', 'Richard Branson', 'Economist Paul Krugman', 'Nick Colas']

Además de Larry Summers, todos los nombres son correctos y eso se debe al apellido "Summers".


1
2018-03-26 20:37



Esto funcionó bastante bien para mí. Solo tuve que cambiar una línea para que se ejecutara.

    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):

necesita ser

    for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):

Hubo imperfecciones en el resultado (por ejemplo, identificó "lavado de dinero" como persona), pero con mis datos una base de datos de nombres puede no ser confiable.


0
2017-07-27 13:11