Data Science is an exact combination of actionable knowledge from raw data through the complete data lifecycle process. Data Science is a mix of different tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data.
Data science is the study of where information comes from, what it represents and how it can be turned into a valuable resource in the creation of business and IT strategies.
Mining large amounts of structured and unstructured data to identify patterns can help an organization rein in costs, increase efficiencies, recognize new market opportunities and increase the organization's competitive advantage.
The key objective is to extract required or valuable information that may be used for multiple purposes, such as decision making, product development, trend analysis and forecasting.
NLP is a specialized field of computer science and artificial intelligence with roots in computational linguistics. It is primarily concerned with designing and building applications and systems that enable interaction between machines and natural languages that have been evolved for use by humans.
Natural Language Processing (NLP) is all about leveraging tools, techniques and algorithms to process and understand natural language-based data, which is usually unstructured like text, speech and so on.
NLP can be described as the “process of producing meaningful phrases and sentences in the form of natural language.” Natural Language Processing precludes Natural Language Understanding (NLU) and Natural Language Generation (NLG). NLU takes the data input and maps it into natural language. NLG conducts information extraction and retrieval, sentiment analysis, and more.
NLP is used to analyze text, automatic text summarization, sentiment analysis, topic extraction, parts-of-speech tagging, stemming, and more.
Document chronological ordered
Document or Text files can be ordered based on chronology of events/occurance of events which are mentioned in document files.
Document summarization using NLP
Extract summary optimization is the process of creating a small version from the original text Satisfy user requirements. Extraction approach is one of way of extracting the most important sentences in document, this approach is used to select sentences after calculating the score for each sentence, and based on user defined summary ratio the top n sentences are selected as summary.
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format.
Real-world data is often incomplete, inconsistent, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues.
1. Data Cleaning:
Data cleaning, also called data cleansing or scrubbing. Data cleaning includes fill in missing values, smooth noisy data, identify or remove the outliers, and resolve inconsistencies. Data cleaning is required because source systems contain dirty data that must be cleaned.
2. Data Integration :
Combines data from multiple sources into a coherent data store e.g. data warehouse.Sources may include multiple databases, data cubes or data files.
3. Data Transformation:
Transformation process deals with rectifying any inconsistency One of the most common transformation issues is ‘Attribute Naming Inconsistency’. It is common for the given data element to be referred to by different data names in different databases. Eg Employee Name may be EMP_NAME in one database, ENAME in the other. Thus one set of Data Names are picked and used consistently in the data warehouse.
4. Data Reduction:
Obtains reduced representation in volume but produces the same or similar analytical results.
Pattern recognition is the process of recognizing patterns by using machine learning algorithm. Pattern recognition can be defined as the classification of data based on knowledge already gained or on statistical information extracted from patterns and/or their representation. One of the important aspects of the pattern recognition is its application potential.
There are various sequences of activities that are used for designing the Pattern Recognition Systems. These activities are as follows: